hive中建立外部分区表,外部数据格式是json的如何导入呢?
json格式的数据表不必含有分区字段,只需要在hdfs目录结构中体现出分区就可以了
This is all according to this guide: http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
1 hive> ADD JAR /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar; 2 Added /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar to class path 3 Added resource: /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar
In /tmp/new
I have a file abc.json
The CREATE EXTERNAL TABLE command runs properly, but it doesn't take in any data:
1 hive> 2 > CREATE EXTERNAL TABLE tweets ( 3 > id BIGINT, 4 > created_at STRING, 5 > source STRING, 6 > favorited BOOLEAN, 7 > retweeted_status STRUCT< 8 > text:STRING, 9 > user:STRUCT<screen_name:STRING,name:STRING>, 10 > retweet_count:INT>, 11 > entities STRUCT< 12 > urls:ARRAY<STRUCT<expanded_url:STRING>>, 13 > user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, 14 > hashtags:ARRAY<STRUCT<text:STRING>>>, 15 > text STRING, 16 > user STRUCT< 17 > screen_name:STRING, 18 > name:STRING, 19 > friends_count:INT, 20 > followers_count:INT, 21 > statuses_count:INT, 22 > verified:BOOLEAN, 23 > utc_offset:INT, 24 > time_zone:STRING>, 25 > in_reply_to_screen_name STRING 26 > ) 27 > PARTITIONED BY (datehour INT) 28 > ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' 29 > LOCATION '/tmp/new'; 30 OK 31 Time taken: 0.142 seconds
注意外部分区表需要手工添加分区
具体步骤
1.) Run the create table statement. 2.) In the directory /tmp/new/ create a sub directory datehour=<some int value>, and then put your .json file inside this.这里就是说只需要在hdfs上建立目录,目录体现分区信息,将数据放到对应目录,然后直接add partiton就好了 3.) Run alter table statement adding this partition to metadata: alter table tweets add partition(datehour=<some int value>); 4.) Now run the select statement.