• MapReduce执行计划及优化


    WordCount:

    select rank, count(*) cnt from city group by rank;

    Map与Reduce 都是一个JVM进程,可以理解为都是一个独立的应用程序

    MapReduce 框架的作用就是自动帮你把map端的输出,经过shuffle后变为reduce端的输入

    方法:

    public void map(Text value,Context context);

    public void reduce(Text key, Iterable<IntWritable> values, Context context)

    Shuffle:

     

    common join

    Map Join

     

    MapReduce

    word count:


    ### 执行计划
    explain select
    a.name,a.entity_id
    from
    tangbao.shop a
    join
    ods_order_org.instancedetail b
    on a.entity_id=b.entity_id
    where b.pt='20180501';

    +--------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +--------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-4 is a root stage |
    | Stage-3 depends on stages: Stage-4 |
    | Stage-0 depends on stages: Stage-3 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-4 |
    | Map Reduce Local Work |
    | Alias -> Map Local Tables: |
    | a |
    | Fetch Operator |
    | limit: -1 |
    | Alias -> Map Local Operator Tree: |
    | a |
    | TableScan |
    | alias: a |
    | filterExpr: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
    | HashTable Sink Operator |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 entity_id (type: string) |
    | |
    | Stage: Stage-3 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: b |
    | filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
    | Statistics: Num rows: 164539402 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
    | Map Join Operator |
    | condition map: |
    | Inner Join 0 to 1 |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 entity_id (type: string) |
    | outputColumnNames: _col9, _col38 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col9 (type: string), _col38 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | Local Work: |
    | Map Reduce Local Work |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +--------------------------------------------------------------------------------------------------------------------+--+

    explain select
    a.name,a.entity_id
    from
    tangbao.shop a
    left join
    ods_order_org.instancedetail b
    on a.entity_id=b.entity_id
    where b.pt='20180501';

    +--------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +--------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-4 is a root stage , consists of Stage-1 |
    | Stage-1 |
    | Stage-0 depends on stages: Stage-1 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-4 |
    | Conditional Operator |
    | |
    | Stage: Stage-1 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: a |
    | Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string) |
    | sort order: + |
    | Map-reduce partition columns: entity_id (type: string) |
    | Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: name (type: string) |
    | TableScan |
    | alias: b |
    | Statistics: Num rows: 4821977155 Data size: 284496652145 Basic stats: PARTIAL Column stats: PARTIAL |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string) |
    | sort order: + |
    | Map-reduce partition columns: entity_id (type: string) |
    | Statistics: Num rows: 4821977155 Data size: 284496652145 Basic stats: PARTIAL Column stats: PARTIAL |
    | value expressions: pt (type: string) |
    | Reduce Operator Tree: |
    | Join Operator |
    | condition map: |
    | Left Outer Join0 to 1 |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 entity_id (type: string) |
    | outputColumnNames: _col9, _col38, _col117 |
    | Statistics: Num rows: 5304174985 Data size: 312946324142 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: (_col117 = '20180501') (type: boolean) |
    | Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col9 (type: string), _col38 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +--------------------------------------------------------------------------------------------------------------------+--+

    explain select
    avg(fee),max(fee)
    from
    olap_pc_history_report.tmp_tp_p_join
    group by
    entity_id,curr_date,kindpay_id;

    +-------------------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +-------------------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-1 is a root stage |
    | Stage-0 depends on stages: Stage-1 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-1 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: tmp_tp_p_join |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string), fee (type: double) |
    | outputColumnNames: entity_id, curr_date, kindpay_id, fee |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Group By Operator |
    | aggregations: avg(fee), max(fee) |
    | keys: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
    | mode: hash |
    | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    | sort order: +++ |
    | Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: _col3 (type: struct<count:bigint,sum:double,input:double>), _col4 (type: double) |
    | Reduce Operator Tree: |
    | Group By Operator |
    | aggregations: avg(VALUE._col0), max(VALUE._col1) |
    | keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
    | mode: mergepartial |
    | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col3 (type: double), _col4 (type: double) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +-------------------------------------------------------------------------------------------------------------------------------+--+

    set hive.map.aggr=false;

    set hive.groupby.skewindata=true;

    explain select
    avg(fee),max(fee)
    from
    olap_pc_history_report.tmp_tp_p_join
    group by
    entity_id,curr_date,kindpay_id;

    +-------------------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +-------------------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-1 is a root stage |
    | Stage-2 depends on stages: Stage-1 |
    | Stage-0 depends on stages: Stage-2 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-1 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: tmp_tp_p_join |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: curr_date (type: string), entity_id (type: string), fee (type: double), kindpay_id (type: string) |
    | outputColumnNames: curr_date, entity_id, fee, kindpay_id |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
    | sort order: +++ |
    | Map-reduce partition columns: rand() (type: double) |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: fee (type: double) |
    | Reduce Operator Tree: |
    | Group By Operator |
    | aggregations: avg(VALUE._col0), max(VALUE._col0) |
    | keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
    | mode: partial1 |
    | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: true |
    | table: |
    | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
    | |
    | Stage: Stage-2 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | Reduce Output Operator |
    | key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    | sort order: +++ |
    | Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: _col3 (type: struct<count:bigint,sum:double,input:double>), _col4 (type: double) |
    | Reduce Operator Tree: |
    | Group By Operator |
    | aggregations: avg(VALUE._col0), max(VALUE._col1) |
    | keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
    | mode: final |
    | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col3 (type: double), _col4 (type: double) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +-------------------------------------------------------------------------------------------------------------------------------+--+

    set hive.optimize.skewjoin=true;

    explain select
    a.name,a.entity_id
    from
    ods_shop_org.shop a
    join
    ods_order_org.instancedetail b
    on a.entity_id=b.entity_id
    where b.pt='20180501';

    +------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-7 is a root stage , consists of Stage-1 |
    | Stage-1 |
    | Stage-4 depends on stages: Stage-1 , consists of Stage-8 |
    | Stage-8 |
    | Stage-3 depends on stages: Stage-8 |
    | Stage-0 depends on stages: Stage-3 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-7 |
    | Conditional Operator |
    | |
    | Stage: Stage-1 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: a |
    | filterExpr: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 443697 Data size: 88739406 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 221849 Data size: 44369803 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string) |
    | sort order: + |
    | Map-reduce partition columns: entity_id (type: string) |
    | Statistics: Num rows: 221849 Data size: 44369803 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: name (type: string) |
    | TableScan |
    | alias: b |
    | filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
    | Statistics: Num rows: 164539402 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string) |
    | sort order: + |
    | Map-reduce partition columns: entity_id (type: string) |
    | Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Operator Tree: |
    | Join Operator |
    | condition map: |
    | Inner Join 0 to 1 |
    | handleSkewJoin: true |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 entity_id (type: string) |
    | outputColumnNames: _col9, _col38 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col9 (type: string), _col38 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | |
    | Stage: Stage-4 |
    | Conditional Operator |
    | |
    | Stage: Stage-8 |
    | Map Reduce Local Work |
    | Alias -> Map Local Tables: |
    | 1 |
    | Fetch Operator |
    | limit: -1 |
    | Alias -> Map Local Operator Tree: |
    | 1 |
    | TableScan |
    | HashTable Sink Operator |
    | keys: |
    | 0 reducesinkkey0 (type: string) |
    | 1 reducesinkkey0 (type: string) |
    | |
    | Stage: Stage-3 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | Map Join Operator |
    | condition map: |
    | Inner Join 0 to 1 |
    | keys: |
    | 0 reducesinkkey0 (type: string) |
    | 1 reducesinkkey0 (type: string) |
    | outputColumnNames: _col9, _col38 |
    | Select Operator |
    | expressions: _col9 (type: string), _col38 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    +------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +------------------------------------------------------------------------------------------------------------------+--+
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | Local Work: |
    | Map Reduce Local Work |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +------------------------------------------------------------------------------------------------------------------+--+

    explain select
    a.name,a.entity_id
    from
    ods_order_org.instancedetail a
    left semi join
    tangbao.shop b
    on a.entity_id=b.entity_id
    where a.pt='20180501';

    +--------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +--------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-4 is a root stage |
    | Stage-3 depends on stages: Stage-4 |
    | Stage-0 depends on stages: Stage-3 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-4 |
    | Map Reduce Local Work |
    | Alias -> Map Local Tables: |
    | b |
    | Fetch Operator |
    | limit: -1 |
    | Alias -> Map Local Operator Tree: |
    | b |
    | TableScan |
    | alias: b |
    | filterExpr: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: entity_id (type: string) |
    | outputColumnNames: entity_id |
    | Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
    | Group By Operator |
    | keys: entity_id (type: string) |
    | mode: hash |
    | outputColumnNames: _col0 |
    | Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
    | HashTable Sink Operator |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 _col0 (type: string) |
    | |
    | Stage: Stage-3 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: a |
    | filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
    | Statistics: Num rows: 82269701 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 41134851 Data size: 8226970212 Basic stats: COMPLETE Column stats: NONE |
    | Map Join Operator |
    | condition map: |
    | Left Semi Join 0 to 1 |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 _col0 (type: string) |
    | outputColumnNames: _col9, _col29 |
    | Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col9 (type: string), _col29 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | Local Work: |
    | Map Reduce Local Work |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +--------------------------------------------------------------------------------------------------------------------+--+



  • 相关阅读:
    POJ 1166 The Clocks 高斯消元 + exgcd(纯属瞎搞)
    防止登录页面出如今frame中
    android--显式跳转和隐式跳转的差别使用方法
    卫星照片
    poj 2586 Y2K Accounting Bug (贪心)
    【转】关于Python脚本开头两行的:#!/usr/bin/python和# -*- coding: utf-8 -*-的作用 – 指定文件编码类型
    【转】在Eclipse中使用PyDev进行Python开发
    【转】eclipse + Pydev 配置Python开发环境
    【转】Python自动化测试 (一) Eclipse+Pydev 搭建开发环境
    【转】Eclipse的启动问题【an error has occurred see the log file】
  • 原文地址:https://www.cnblogs.com/hyc123-/p/9190515.html
Copyright © 2020-2023  润新知