• Hive数据导入


    可以通过多种方式将数据导入hive表

    1.通过外部表导入

    用户在hive上建external表,建表的同时指定hdfs路径,在数据拷贝到指定hdfs路径的同时,也同时完成数据插入external表。
    例如:
    编辑文件test.txt
    $ cat test.txt
    1       hello
    2       world
    3       test
    4       case
    字段之间以' '分割

    启动hive:
    $ hive

    建external表:
    hive> CREATE EXTERNAL TABLE MYTEST(num INT, name STRING)
        > COMMENT 'this is a test'
        > ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
        > STORED AS TEXTFILE
        > LOCATION '/data/test';
    OK
    Time taken: 0.714 seconds

    hive> show tables;
    OK
    mytest
    partition_test
    partition_test_input
    test
    Time taken: 0.07 seconds
    hive> desc mytest ;
    OK
    num     int
    name    string
    Time taken: 0.121 seconds|

    数据拷贝到hdfs:
    $ Hadoop fs -put test.txt /data/test

    查看hive表数据:
    hive> select * from mytest;
    OK
    1       hello
    2       world
    3       test
    4       case
    Time taken: 0.375 seconds
    hive> select num from mytest;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    ......
    Total MapReduce CPU Time Spent: 510 msec
    OK
    1
    2
    3
    4
    Time taken: 27.157 seconds

    这种方式常常用于当hdfs上有一些历史数据,而我们需要在这些数据上做一些hive的操作时使用。这种方式避免了数据拷贝开销

    2.从本地导入

    数据不在hdfs上,直接从本地导入hive表

    文件/home/work/test.txt内容同上

    建表:
    hive> CREATE TABLE MYTEST2(num INT, name STRING)
        > COMMENT 'this is a test2'                         
        > ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '    
        > STORED AS TEXTFILE;                               
    OK
    Time taken: 0.077 seconds

    导数据入表:
    hive> LOAD DATA LOCAL INPATH '/home/work/test.txt' INTO TABLE MYTEST2;
    Copying data from file:/home/work/test.txt
    Copying file: file:/home/work/test.txt
    Loading data to table default.mytest2
    OK
    Time taken: 0.24 seconds

    查看数据:
    hive> select * from MYTEST2;
    OK
    1       hello
    2       world
    3       test
    4       case
    Time taken: 0.11 seconds

    这种方式导入的本地数据可以是一个文件,一个文件夹或者通配符,需要注意的是,如果是文件夹,文件夹内不能包含子目录,同样,通配符只能通配文件。

    3.从hdfs导入

    上述test.txt文件已经导入/data/test
    则可以使用下述命令直接将数据导入hive表:
    hive> CREATE TABLE MYTEST3(num INT, name STRING)
        > COMMENT "this is a test3"
        > ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
        > STORED AS TEXTFILE;
    OK
    Time taken: 4.735 seconds
    hive> LOAD DATA INPATH '/data/test/test.txt' INTO TABLE MYTEST3;
    Loading data to table default.mytest3
    OK
    Time taken: 0.337 seconds
    hive> select * from MYTEST3 ;
    OK
    1       hello
    2       world
    3       test
    4       case
    Time taken: 0.227 seconds

    上述test.txt文件已经导入/data/test
    则可以使用下述命令直接将数据导入hive表:
    hive> CREATE TABLE MYTEST3(num INT, name STRING)
        > COMMENT "this is a test3"
        > ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
        > STORED AS TEXTFILE;
    OK
    Time taken: 4.735 seconds
    hive> LOAD DATA INPATH '/data/test/test.txt' INTO TABLE MYTEST3;
    Loading data to table default.mytest3
    OK
    Time taken: 0.337 seconds
    hive> select * from MYTEST3 ;
    OK
    1       hello
    2       world
    3       test
    4       case
    Time taken: 0.227 seconds

    4. 从其它表导入数据

    hive> CREATE EXTERNAL TABLE MYTEST4(num INT) ;
    OK
    Time taken: 0.091 seconds
    hive> FROM MYTEST3 test3
        > INSERT OVERWRITE TABLE MYTEST4
        > select test3.num where name='world';
    Total MapReduce jobs = 2
    Launching Job 1 out of 2
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201207230024_0002, Tracking URL =
    http://localhost:50030/jobdetails.jsp?jobid=job_201207230024_0002
    Kill Command = /home/work/hadoop/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201207230024_0002
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2012-07-23 18:59:02,365 Stage-1 map = 0%,  reduce = 0%
    2012-07-23 18:59:08,417 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.62 sec
    2012-07-23 18:59:09,435 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.62 sec
    2012-07-23 18:59:10,445 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.62 sec
    2012-07-23 18:59:11,455 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.62 sec
    2012-07-23 18:59:12,470 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.62 sec
    2012-07-23 18:59:13,489 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.62 sec
    2012-07-23 18:59:14,508 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 0.62 sec
    MapReduce Total cumulative CPU time: 620 msec
    Ended Job = job_201207230024_0002
    Ended Job = -174856900, job is filtered out (removed at runtime).
    Moving data to: hdfs://localhost:9000/tmp/hive-work/hive_2012-07-23_18-58-44_166_189728317691010041/-ext-10000
    Loading data to table default.mytest4
    Deleted hdfs://localhost:9000/user/hive/warehouse/mytest4
    Table default.mytest4 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 2, raw_data_size: 0]
    1 Rows loaded to mytest4
    MapReduce Jobs Launched:
    Job 0: Map: 1   Accumulative CPU: 0.62 sec   HDFS Read: 242 HDFS Write: 2 SUCESS
    Total MapReduce CPU Time Spent: 620 msec
    OK
    Time taken: 30.663 seconds
    hive> select * from mytest4;
    OK
    2
    Time taken: 0.103 seconds

  • 相关阅读:
    Python 字典(Dictionary)操作详解
    MySQL root密码忘记后更优雅的解决方法
    linux下升级python
    mysql实现复杂groupby : GROUP_CONCAT
    Python常用time处理
    spark安装
    python2脚本批量转成python3
    linux zip压缩文件
    dataframe 差集
    012.mysql-mysql查询字段排序规则、数据库编码、表编码,修改排序规则
  • 原文地址:https://www.cnblogs.com/dyllove98/p/3228557.html
Copyright © 2020-2023  润新知