一、前言
公司实用Hadoop构建数据仓库,期间不可避免的实用HiveSql,在Etl过程中,速度成了避无可避的问题。本人有过几个数据表关联跑1个小时的经历,你可能觉得无所谓,可是多次Etl就要多个小时,非常浪费时间,所以HiveSql优化不可避免。
注:本文只是从sql层面介绍一下日常需要注意的点,不涉及Hadoop、MapReduce等层面,关于Hive的编译过程,请参考文章:http://tech.meituan.com/hive-sql-to-mapreduce.html
二、准备数据
假设咱们有两张数据表。
景区表:sight,12W条记录,数据表结构:
hive> desc sight; OK area string None city string None country string None county string None id string None name string None region string None
景区订单明细表:order_sight,1040W条记录,数据表结构:
hive> desc order_sight; OK create_time string None id string None order_id string None sight_id bigint None
三、分析
3.1 where条件
那么咱们希望看见景区id是9718,日期是2015-10-10的所有订单id,那么sql需要如下书写:
hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1434099279301_3562174, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562174/ Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3562174 Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1 2015-10-12 22:58:22,706 Stage-1 map = 0%, reduce = 0% 2015-10-12 22:58:29,882 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 4.73 sec 2015-10-12 22:58:30,907 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 4.73 sec 2015-10-12 22:58:31,933 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:32,968 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:33,995 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:35,020 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:36,046 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:37,070 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:38,096 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:39,121 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:40,153 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:41,182 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.22 sec 2015-10-12 22:58:42,209 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.22 sec 2015-10-12 22:58:43,236 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.22 sec 2015-10-12 22:58:44,263 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.3 sec 2015-10-12 22:58:45,289 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.3 sec 2015-10-12 22:58:46,316 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.3 sec 2015-10-12 22:58:47,344 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 21.85 sec 2015-10-12 22:58:48,370 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 21.85 sec 2015-10-12 22:58:49,397 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 21.85 sec 2015-10-12 22:58:50,424 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 21.85 sec 2015-10-12 22:58:51,452 Stage-1 map = 83%, reduce = 17%, Cumulative CPU 37.62 sec 2015-10-12 22:58:52,478 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 38.06 sec 2015-10-12 22:58:53,506 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 38.06 sec 2015-10-12 22:58:54,534 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.17 sec 2015-10-12 22:58:55,560 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.17 sec 2015-10-12 22:58:56,587 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.17 sec 2015-10-12 22:58:57,615 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.25 sec 2015-10-12 22:58:58,642 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.25 sec 2015-10-12 22:58:59,674 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.25 sec 2015-10-12 22:59:00,708 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.32 sec 2015-10-12 22:59:01,736 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.32 sec 2015-10-12 22:59:02,763 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.32 sec 2015-10-12 22:59:03,791 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.41 sec 2015-10-12 22:59:04,817 Stage-1 map = 96%, reduce = 29%, Cumulative CPU 49.13 sec 2015-10-12 22:59:05,843 Stage-1 map = 100%, reduce = 29%, Cumulative CPU 49.59 sec 2015-10-12 22:59:06,870 Stage-1 map = 100%, reduce = 41%, Cumulative CPU 49.76 sec 2015-10-12 22:59:07,897 Stage-1 map = 100%, reduce = 41%, Cumulative CPU 49.76 sec 2015-10-12 22:59:08,922 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 52.79 sec 2015-10-12 22:59:09,947 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 52.79 sec MapReduce Total cumulative CPU time: 52 seconds 790 msec Ended Job = job_1434099279301_3562174 MapReduce Jobs Launched: Job 0: Map: 8 Reduce: 1 Cumulative CPU: 52.79 sec HDFS Read: 371210469 HDFS Write: 330 SUCCESS Total MapReduce CPU Time Spent: 52 seconds 790 msec OK 9718 210294949 9718 210294421 9718 210296438 9718 210295344 9718 210297567 9718 210296076 9718 210295525 9718 210298219 9718 210295840 9718 210301363 9718 210297733 9718 210298066 9718 210295239 9718 210298328 9718 210298008 9718 210299712 9718 210295586 9718 210295050 9718 210295566 9718 210299105 9718 210296318 9718 210295277 Time taken: 52.068 seconds, Fetched: 22 row(s)
可见需要的时间是52秒,如果咱们换一个sql的书写方式:
hive> select s.id,o.order_id from sight s left join (select order_id,sight_id from order_sight where create_time = '2015-10-10') o on o.sight_id=s.id where s.id=9718; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1434099279301_3562218, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562218/ Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3562218 Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1 2015-10-12 23:03:54,926 Stage-1 map = 0%, reduce = 0% 2015-10-12 23:04:01,075 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 2.24 sec 2015-10-12 23:04:02,101 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 2.24 sec 2015-10-12 23:04:03,126 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 2.24 sec 2015-10-12 23:04:04,151 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 5.53 sec 2015-10-12 23:04:05,176 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 5.53 sec 2015-10-12 23:04:06,201 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.62 sec 2015-10-12 23:04:07,226 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:08,250 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:09,275 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:10,300 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:11,324 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:12,356 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.09 sec 2015-10-12 23:04:13,384 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.09 sec 2015-10-12 23:04:14,410 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.09 sec 2015-10-12 23:04:15,437 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.22 sec 2015-10-12 23:04:16,463 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.22 sec 2015-10-12 23:04:17,487 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.22 sec 2015-10-12 23:04:18,514 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.35 sec 2015-10-12 23:04:19,538 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.35 sec 2015-10-12 23:04:20,569 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.35 sec 2015-10-12 23:04:21,595 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.54 sec 2015-10-12 23:04:22,620 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.54 sec 2015-10-12 23:04:23,646 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.54 sec 2015-10-12 23:04:24,673 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.64 sec 2015-10-12 23:04:25,698 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.64 sec 2015-10-12 23:04:26,723 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.64 sec 2015-10-12 23:04:27,748 Stage-1 map = 75%, reduce = 21%, Cumulative CPU 23.32 sec 2015-10-12 23:04:28,774 Stage-1 map = 88%, reduce = 21%, Cumulative CPU 27.27 sec 2015-10-12 23:04:29,799 Stage-1 map = 100%, reduce = 21%, Cumulative CPU 32.82 sec 2015-10-12 23:04:30,823 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 34.35 sec 2015-10-12 23:04:31,846 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 34.35 sec MapReduce Total cumulative CPU time: 34 seconds 350 msec Ended Job = job_1434099279301_3562218 MapReduce Jobs Launched: Job 0: Map: 8 Reduce: 1 Cumulative CPU: 34.35 sec HDFS Read: 371210469 HDFS Write: 330 SUCCESS Total MapReduce CPU Time Spent: 34 seconds 350 msec OK 9718 210297733 9718 210298066 9718 210295239 9718 210298328 9718 210298008 9718 210299712 9718 210297567 9718 210296076 9718 210295525 9718 210298219 9718 210295840 9718 210301363 9718 210295586 9718 210295050 9718 210295566 9718 210299105 9718 210296318 9718 210295277 9718 210294949 9718 210294421 9718 210296438 9718 210295344 Time taken: 43.709 seconds, Fetched: 22 row(s)
实用43秒,快了一些。当然咱们并不是仅仅分析说快了20%(我还多次测试,这次的差距最小),而是分析原因!
单从两个sql的写法上看的出来,特别是第二条的红色部分,我将left的条件写到里面了。那么执行的结果随之不一样,第二条的Reduce时间明显小于第一条的Reduce时间。
原因是这两个sql都分解成8个Map任务和1个Reduce任务,如果left的条件写在后面,那么这些关联操作会放在Reduce阶段,1个Reduce操作的时间必然大于8个Map的执行时间,造成执行时间超长。
结论:当使用外关联时,如果将副表的过滤条件写在Where后面,那么就会先全表关联,之后再过滤