• hive中导入json格式的数据(hive分区表)


    hive中建立外部分区表,外部数据格式是json的如何导入呢?

    json格式的数据表不必含有分区字段,只需要在hdfs目录结构中体现出分区就可以了

    This is all according to this guide: http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

    1 hive> ADD JAR /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar;
    2 Added /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar to class path
    3 Added resource: /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar

    In /tmp/new I have a file abc.json 

    The CREATE EXTERNAL TABLE command runs properly, but it doesn't take in any data:

     1 hive> 
     2     > CREATE EXTERNAL TABLE tweets (
     3     >   id BIGINT,
     4     >   created_at STRING,
     5     >   source STRING,
     6     >   favorited BOOLEAN,
     7     >   retweeted_status STRUCT<
     8     >     text:STRING,
     9     >     user:STRUCT<screen_name:STRING,name:STRING>,
    10     >     retweet_count:INT>,
    11     >   entities STRUCT<
    12     >     urls:ARRAY<STRUCT<expanded_url:STRING>>,
    13     >     user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    14     >     hashtags:ARRAY<STRUCT<text:STRING>>>,
    15     >   text STRING,
    16     >   user STRUCT<
    17     >     screen_name:STRING,
    18     >     name:STRING,
    19     >     friends_count:INT,
    20     >     followers_count:INT,
    21     >     statuses_count:INT,
    22     >     verified:BOOLEAN,
    23     >     utc_offset:INT,
    24     >     time_zone:STRING>,
    25     >   in_reply_to_screen_name STRING
    26     > ) 
    27     > PARTITIONED BY (datehour INT)
    28     > ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
    29     > LOCATION '/tmp/new';
    30 OK
    31 Time taken: 0.142 seconds

    注意外部分区表需要手工添加分区

    具体步骤

    1.) Run the create table statement.
    2.) In the directory /tmp/new/ create a sub directory datehour=<some int value>, and then put your .json file inside this.这里就是说只需要在hdfs上建立目录,目录体现分区信息,将数据放到对应目录,然后直接add partiton就好了
    3.) Run alter table statement adding this partition to metadata:
        alter table tweets add partition(datehour=<some int value>);
    4.) Now run the select statement.
  • 相关阅读:
    Python成长笔记
    Python成长笔记
    Python成长笔记
    Python成长笔记
    Python成长笔记
    Python成长笔记
    Python成长笔记
    Python成长笔记
    Python成长笔记
    解决Jenkins生成测试报告的问题
  • 原文地址:https://www.cnblogs.com/amazement/p/4885903.html
Copyright © 2020-2023  润新知