• [Hadoop大数据]——Hive连接JOIN用例详解


    SQL里面通常都会用Join来连接两个表,做复杂的关联查询。比如用户表和订单表,能通过join得到某个用户购买的产品;或者某个产品被购买的人群....

    Hive也支持这样的操作,而且由于Hive底层运行在hadoop上,因此有很多地方可以进行优化。比如小表到大表的连接操作、小表进行缓存、大表进行避免缓存等等...

    下面就来看看hive里面的连接操作吧!其实跟SQL还是差不多的...

    数据准备:创建数据-->创建表-->导入数据

    首先创建两个原始数据的文件,这两个文件分别有三列,第一列是id、第二列是名称、第三列是另外一个表的id。通过第二列可以明显的看到两个表做连接查询的结果:

    [xingoo@localhost tmp]$ cat aa.txt 
    1 a 3
    2 b 4
    3 c 1
    [xingoo@localhost tmp]$ cat bb.txt 
    1 xxx 2
    2 yyy 3
    3 zzz 5
    

    接下来创建两个表,需要注意的是表的字段分隔符为空格,另一个表可以直接基于当前的表创建。

    hive> create table aa
        > (a string,b string,c string)
        > row format delimited
        > fields terminated by ' ';
    OK
    Time taken: 0.19 seconds
    hive> create table bb like aa;
    OK
    Time taken: 0.188 seconds
    

    查看两个表的结构:

    hive> describe aa;
    OK
    a                   	string              	                    
    b                   	string              	                    
    c                   	string              	                    
    Time taken: 0.068 seconds, Fetched: 3 row(s)
    hive> describe bb;
    OK
    a                   	string              	                    
    b                   	string              	                    
    c                   	string              	                    
    Time taken: 0.045 seconds, Fetched: 3 row(s)
    

    下面可以基于本地的文件,导入数据

    hive> load data local inpath '/usr/tmp/aa.txt' overwrite into table aa;
    Loading data to table test.aa
    OK
    Time taken: 0.519 seconds
    hive> load data local inpath '/usr/tmp/bb.txt' overwrite into table bb;
    Loading data to table test.bb
    OK
    Time taken: 0.321 seconds
    

    内连接

    内连接即基于on语句,仅列出表1和表2符合连接条件的数据。

    hive> select * from aa a join bb b on a.c=b.a;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160824161233_f9ecefa2-e5d7-416d-8d90-e191937e7313
    Total jobs = 1
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    2016-08-24 16:12:44	Starting to launch local task to process map join;	maximum memory = 518979584
    2016-08-24 16:12:47	Dump the side-table for tag: 0 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
    2016-08-24 16:12:47	Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (332 bytes)
    2016-08-24 16:12:47	End of local task; Time Taken: 3.425 sec.
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Job running in-process (local Hadoop)
    2016-08-24 16:12:50,222 Stage-3 map = 100%,  reduce = 0%
    Ended Job = job_local944389202_0007
    MapReduce Jobs Launched: 
    Stage-Stage-3:  HDFS Read: 1264 HDFS Write: 90 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    3	c	1	1	xxx	2
    1	a	3	3	zzz	5
    Time taken: 17.083 seconds, Fetched: 2 row(s)
    
    

    左连接

    左连接是显示左边的表的所有数据,如果有右边表与之对应,则显示;否则显示null

    ive> select * from aa a left outer join bb b on a.c=b.a;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160824161637_6d540592-13fd-4f59-a2cf-0a91c0fc9533
    Total jobs = 1
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    2016-08-24 16:16:48	Starting to launch local task to process map join;	maximum memory = 518979584
    2016-08-24 16:16:51	Dump the side-table for tag: 1 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
    2016-08-24 16:16:51	Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (338 bytes)
    2016-08-24 16:16:51	End of local task; Time Taken: 2.634 sec.
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Job running in-process (local Hadoop)
    2016-08-24 16:16:53,843 Stage-3 map = 100%,  reduce = 0%
    Ended Job = job_local1670258961_0008
    MapReduce Jobs Launched: 
    Stage-Stage-3:  HDFS Read: 1282 HDFS Write: 90 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    1	a	3	3	zzz	5
    2	b	4	NULL	NULL	NULL
    3	c	1	1	xxx	2
    Time taken: 16.048 seconds, Fetched: 3 row(s)
    
    

    右连接

    类似左连接,同理。

    hive> select * from aa a right outer join bb b on a.c=b.a;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160824162227_5d0f0090-1a9b-4a3f-9e82-e93c4d180f4b
    Total jobs = 1
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    2016-08-24 16:22:37	Starting to launch local task to process map join;	maximum memory = 518979584
    2016-08-24 16:22:40	Dump the side-table for tag: 0 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
    2016-08-24 16:22:40	Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (332 bytes)
    2016-08-24 16:22:40	End of local task; Time Taken: 2.368 sec.
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Job running in-process (local Hadoop)
    2016-08-24 16:22:43,060 Stage-3 map = 100%,  reduce = 0%
    Ended Job = job_local2001415675_0009
    MapReduce Jobs Launched: 
    Stage-Stage-3:  HDFS Read: 1306 HDFS Write: 90 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    3	c	1	1	xxx	2
    NULL	NULL	NULL	2	yyy	3
    1	a	3	3	zzz	5
    Time taken: 15.483 seconds, Fetched: 3 row(s)
    
    

    全连接

    相当于表1和表2的数据都显示,如果没有对应的数据,则显示Null.

    hive> select * from aa a full outer join bb b on a.c=b.a;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160824162252_c71b2fae-9768-4b9a-b5ad-c06d7cdb60fb
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Job running in-process (local Hadoop)
    2016-08-24 16:22:54,111 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_local1766586034_0010
    MapReduce Jobs Launched: 
    Stage-Stage-1:  HDFS Read: 4026 HDFS Write: 270 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    3	c	1	1	xxx	2
    NULL	NULL	NULL	2	yyy	3
    1	a	3	3	zzz	5
    2	b	4	NULL	NULL	NULL
    Time taken: 1.689 seconds, Fetched: 4 row(s)
    
    

    左半开连接

    这个比较特殊,SEMI-JOIN仅仅会显示表1的数据,即左边表的数据。但是效率会比左连接快,因为他会先拿到表1的数据,然后在表2中查找,只要查找到结果立马就返回数据。

    hive> select * from aa a left semi join bb b on a.c=b.a;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160824162327_e7fc72a7-ef91-4d39-83bc-ff8159ea8816
    Total jobs = 1
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    2016-08-24 16:23:37	Starting to launch local task to process map join;	maximum memory = 518979584
    2016-08-24 16:23:41	Dump the side-table for tag: 1 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable
    2016-08-24 16:23:41	Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable (317 bytes)
    2016-08-24 16:23:41	End of local task; Time Taken: 3.586 sec.
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Job running in-process (local Hadoop)
    2016-08-24 16:23:43,798 Stage-3 map = 100%,  reduce = 0%
    Ended Job = job_local521961878_0011
    MapReduce Jobs Launched: 
    Stage-Stage-3:  HDFS Read: 1366 HDFS Write: 90 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    1	a	3
    3	c	1
    Time taken: 16.811 seconds, Fetched: 2 row(s)
    
    

    笛卡尔积

    笛卡尔积会针对表1和表2的每条数据做连接...

    hive> select * from aa join bb;
    Warning: Map Join MAPJOIN[9][bigTable=?] in task 'Stage-3:MAPRED' is a cross product
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160824162449_20e4b5ec-768f-48cf-a840-7d9ff360975f
    Total jobs = 1
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    2016-08-24 16:25:00	Starting to launch local task to process map join;	maximum memory = 518979584
    2016-08-24 16:25:02	Dump the side-table for tag: 0 with group count: 1 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable
    2016-08-24 16:25:02	Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable (305 bytes)
    2016-08-24 16:25:02	End of local task; Time Taken: 2.892 sec.
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Job running in-process (local Hadoop)
    2016-08-24 16:25:05,677 Stage-3 map = 100%,  reduce = 0%
    Ended Job = job_local2068422373_0012
    MapReduce Jobs Launched: 
    Stage-Stage-3:  HDFS Read: 1390 HDFS Write: 90 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    1	a	3	1	xxx	2
    2	b	4	1	xxx	2
    3	c	1	1	xxx	2
    1	a	3	2	yyy	3
    2	b	4	2	yyy	3
    3	c	1	2	yyy	3
    1	a	3	3	zzz	5
    2	b	4	3	zzz	5
    3	c	1	3	zzz	5
    

    上面就是hive中的连接查询,其实与SQL一样的。

  • 相关阅读:
    RabbitMQ简介、特性、使用场景、安装、启动与关闭
    mybatis的工作原理
    bzoj2119 股市的预测
    Noi2014 购票
    51Nod 算法马拉松22 开黑记
    COGS2485 从零开始的序列
    Codeforces Round #402 (Div.2)
    BestCoder Round #92
    COGS2294 释迦
    bzoj4764 弹飞大爷
  • 原文地址:https://www.cnblogs.com/xing901022/p/5804836.html
Copyright © 2020-2023  润新知