WordCount:
select rank, count(*) cnt from city group by rank;
Map与Reduce 都是一个JVM进程,可以理解为都是一个独立的应用程序
MapReduce 框架的作用就是自动帮你把map端的输出,经过shuffle后变为reduce端的输入
方法:
public void map(Text value,Context context);
public void reduce(Text key, Iterable<IntWritable> values, Context context)
Shuffle:
common join
Map Join
MapReduce
word count:
### 执行计划
explain select
a.name,a.entity_id
from
tangbao.shop a
join
ods_order_org.instancedetail b
on a.entity_id=b.entity_id
where b.pt='20180501';
+--------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+--------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| a |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| a |
| TableScan |
| alias: a |
| filterExpr: entity_id is not null (type: boolean) |
| Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 entity_id (type: string) |
| 1 entity_id (type: string) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: b |
| filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
| Statistics: Num rows: 164539402 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 entity_id (type: string) |
| 1 entity_id (type: string) |
| outputColumnNames: _col9, _col38 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col9 (type: string), _col38 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+--------------------------------------------------------------------------------------------------------------------+--+
explain select
a.name,a.entity_id
from
tangbao.shop a
left join
ods_order_org.instancedetail b
on a.entity_id=b.entity_id
where b.pt='20180501';
+--------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+--------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage , consists of Stage-1 |
| Stage-1 |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Conditional Operator |
| |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: entity_id (type: string) |
| sort order: + |
| Map-reduce partition columns: entity_id (type: string) |
| Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
| value expressions: name (type: string) |
| TableScan |
| alias: b |
| Statistics: Num rows: 4821977155 Data size: 284496652145 Basic stats: PARTIAL Column stats: PARTIAL |
| Reduce Output Operator |
| key expressions: entity_id (type: string) |
| sort order: + |
| Map-reduce partition columns: entity_id (type: string) |
| Statistics: Num rows: 4821977155 Data size: 284496652145 Basic stats: PARTIAL Column stats: PARTIAL |
| value expressions: pt (type: string) |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 entity_id (type: string) |
| 1 entity_id (type: string) |
| outputColumnNames: _col9, _col38, _col117 |
| Statistics: Num rows: 5304174985 Data size: 312946324142 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (_col117 = '20180501') (type: boolean) |
| Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col9 (type: string), _col38 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+--------------------------------------------------------------------------------------------------------------------+--+
explain select
avg(fee),max(fee)
from
olap_pc_history_report.tmp_tp_p_join
group by
entity_id,curr_date,kindpay_id;
+-------------------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+-------------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tmp_tp_p_join |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string), fee (type: double) |
| outputColumnNames: entity_id, curr_date, kindpay_id, fee |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: avg(fee), max(fee) |
| keys: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| sort order: +++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col3 (type: struct<count:bigint,sum:double,input:double>), _col4 (type: double) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0), max(VALUE._col1) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col3 (type: double), _col4 (type: double) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+-------------------------------------------------------------------------------------------------------------------------------+--+
set hive.map.aggr=false;
set hive.groupby.skewindata=true;
explain select
avg(fee),max(fee)
from
olap_pc_history_report.tmp_tp_p_join
group by
entity_id,curr_date,kindpay_id;
+-------------------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+-------------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tmp_tp_p_join |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: curr_date (type: string), entity_id (type: string), fee (type: double), kindpay_id (type: string) |
| outputColumnNames: curr_date, entity_id, fee, kindpay_id |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
| sort order: +++ |
| Map-reduce partition columns: rand() (type: double) |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| value expressions: fee (type: double) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0), max(VALUE._col0) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: partial1 |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: true |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| sort order: +++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col3 (type: struct<count:bigint,sum:double,input:double>), _col4 (type: double) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0), max(VALUE._col1) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: final |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col3 (type: double), _col4 (type: double) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+-------------------------------------------------------------------------------------------------------------------------------+--+
set hive.optimize.skewjoin=true;
explain select
a.name,a.entity_id
from
ods_shop_org.shop a
join
ods_order_org.instancedetail b
on a.entity_id=b.entity_id
where b.pt='20180501';
+------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-7 is a root stage , consists of Stage-1 |
| Stage-1 |
| Stage-4 depends on stages: Stage-1 , consists of Stage-8 |
| Stage-8 |
| Stage-3 depends on stages: Stage-8 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-7 |
| Conditional Operator |
| |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: entity_id is not null (type: boolean) |
| Statistics: Num rows: 443697 Data size: 88739406 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 221849 Data size: 44369803 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: entity_id (type: string) |
| sort order: + |
| Map-reduce partition columns: entity_id (type: string) |
| Statistics: Num rows: 221849 Data size: 44369803 Basic stats: COMPLETE Column stats: NONE |
| value expressions: name (type: string) |
| TableScan |
| alias: b |
| filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
| Statistics: Num rows: 164539402 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: entity_id (type: string) |
| sort order: + |
| Map-reduce partition columns: entity_id (type: string) |
| Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| handleSkewJoin: true |
| keys: |
| 0 entity_id (type: string) |
| 1 entity_id (type: string) |
| outputColumnNames: _col9, _col38 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col9 (type: string), _col38 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-4 |
| Conditional Operator |
| |
| Stage: Stage-8 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| 1 |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| 1 |
| TableScan |
| HashTable Sink Operator |
| keys: |
| 0 reducesinkkey0 (type: string) |
| 1 reducesinkkey0 (type: string) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Map Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 reducesinkkey0 (type: string) |
| 1 reducesinkkey0 (type: string) |
| outputColumnNames: _col9, _col38 |
| Select Operator |
| expressions: _col9 (type: string), _col38 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
+------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+------------------------------------------------------------------------------------------------------------------+--+
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+------------------------------------------------------------------------------------------------------------------+--+
explain select
a.name,a.entity_id
from
ods_order_org.instancedetail a
left semi join
tangbao.shop b
on a.entity_id=b.entity_id
where a.pt='20180501';
+--------------------------------------------------------------------------------------------------------------------+--+
| Explain |
+--------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| b |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| b |
| TableScan |
| alias: b |
| filterExpr: entity_id is not null (type: boolean) |
| Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: entity_id (type: string) |
| outputColumnNames: entity_id |
| Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: entity_id (type: string) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 entity_id (type: string) |
| 1 _col0 (type: string) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
| Statistics: Num rows: 82269701 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: entity_id is not null (type: boolean) |
| Statistics: Num rows: 41134851 Data size: 8226970212 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 entity_id (type: string) |
| 1 _col0 (type: string) |
| outputColumnNames: _col9, _col29 |
| Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col9 (type: string), _col29 (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+--------------------------------------------------------------------------------------------------------------------+--+