MapReduce执行计划及优化

WordCount:

select rank, count(*) cnt from city group by rank;

Map与Reduce 都是一个JVM进程，可以理解为都是一个独立的应用程序

MapReduce 框架的作用就是自动帮你把map端的输出，经过shuffle后变为reduce端的输入

方法：

public void map(Text value,Context context);

public void reduce(Text key, Iterable<IntWritable> values, Context context)

Shuffle:

common join

Map Join

MapReduce

word count：

### 执行计划
explain select
a.name,a.entity_id
from
tangbao.shop a
join
ods_order_org.instancedetail b
on a.entity_id=b.entity_id
where b.pt='20180501';

explain select
a.name,a.entity_id
from
tangbao.shop a
left join
ods_order_org.instancedetail b
on a.entity_id=b.entity_id
where b.pt='20180501';

explain select
avg(fee),max(fee)
from
olap_pc_history_report.tmp_tp_p_join
group by
entity_id,curr_date,kindpay_id;

+-------------------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+-------------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tmp_tp_p_join |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string), fee (type: double) |
| outputColumnNames: entity_id, curr_date, kindpay_id, fee |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: avg(fee), max(fee) |
| keys: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| sort order: +++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col3 (type: struct<count:bigint,sum:double,input:double>), _col4 (type: double) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0), max(VALUE._col1) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col3 (type: double), _col4 (type: double) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+-------------------------------------------------------------------------------------------------------------------------------+--+

set hive.map.aggr=false;

set hive.groupby.skewindata=true;

explain select
avg(fee),max(fee)
from
olap_pc_history_report.tmp_tp_p_join
group by
entity_id,curr_date,kindpay_id;

+-------------------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+-------------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tmp_tp_p_join |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: curr_date (type: string), entity_id (type: string), fee (type: double), kindpay_id (type: string) |
| outputColumnNames: curr_date, entity_id, fee, kindpay_id |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
| sort order: +++ |
| Map-reduce partition columns: rand() (type: double) |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| value expressions: fee (type: double) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0), max(VALUE._col0) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: partial1 |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: true |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| sort order: +++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col3 (type: struct<count:bigint,sum:double,input:double>), _col4 (type: double) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0), max(VALUE._col1) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: final |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col3 (type: double), _col4 (type: double) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+-------------------------------------------------------------------------------------------------------------------------------+--+

set hive.optimize.skewjoin=true;

explain select
a.name,a.entity_id
from
ods_shop_org.shop a
join
ods_order_org.instancedetail b
on a.entity_id=b.entity_id
where b.pt='20180501';

explain select
a.name,a.entity_id
from
ods_order_org.instancedetail a
left semi join
tangbao.shop b
on a.entity_id=b.entity_id
where a.pt='20180501';

相关阅读:
POJ 1166 The Clocks 高斯消元 + exgcd(纯属瞎搞)
防止登录页面出如今frame中
android--显式跳转和隐式跳转的差别使用方法
卫星照片
poj 2586 Y2K Accounting Bug (贪心)
【转】关于Python脚本开头两行的：#!/usr/bin/python和# -*- coding: utf-8 -*-的作用 – 指定文件编码类型
【转】在Eclipse中使用PyDev进行Python开发
【转】eclipse + Pydev 配置Python开发环境
【转】Python自动化测试 (一) Eclipse+Pydev 搭建开发环境
【转】Eclipse的启动问题【an error has occurred see the log file】

原文地址：https://www.cnblogs.com/hyc123-/p/9190515.html