• Hive – Group By 的实现zz


    准备数据

    SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
    hive> SELECT * FROM logs;
    a	苹果	5
    a	橙子	3
    a      苹果   2
    b	烧鸡	1
     
    hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
    a	10
    b	1

    计算过程

    hive-groupby-cal
    默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是 hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的 mode是complete.

    Operator

    hive-groupby-op

    Explain

    hive> explain SELECT uid, sum(count) FROM logs group by uid;
    OK
    ABSTRACT SYNTAX TREE:
      (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))
     
    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 is a root stage
     
    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            logs 
              TableScan // 扫描表
                alias: logs
                Select Operator //选择字段
                  expressions:
                        expr: uid
                        type: string
                        expr: count
                        type: int
                  outputColumnNames: uid, count
                  Group By Operator //这里是因为默认设置了hive.map.aggr=true,会在mapper先做一次聚合,减少reduce需要处理的数据
                    aggregations:
                          expr: sum(count) //聚集函数
                    bucketGroup: false
                    keys: //键
                          expr: uid
                          type: string
                    mode: hash //hash方式,processHashAggr()
                    outputColumnNames: _col0, _col1
                    Reduce Output Operator //输出key,value给reducer
                      key expressions:
                            expr: _col0
                            type: string
                      sort order: +
                      Map-reduce partition columns:
                            expr: _col0
                            type: string
                      tag: -1
                      value expressions:
                            expr: _col1
                            type: bigint
          Reduce Operator Tree:
            Group By Operator
     
              aggregations:
                    expr: sum(VALUE._col0)
    //聚合
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: string
              mode: mergepartial //合并值
              outputColumnNames: _col0, _col1
              Select Operator //选择字段
                expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: bigint
                outputColumnNames: _col0, _col1
                File Output Operator //输出到文件
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
     
      Stage: Stage-0
        Fetch Operator
          limit: -1

    hive> select distinct value from src;
    hive> select max(key) from src;
    因为没有grouping keys,所以只有一个reducer。

     
          2.2 如果有聚合函数或者groupby,做如下处理:
                插入一个select operator,选取所有的字段,用于优化阶段ColumnPruner的优化
                2.2.1  hive.map.aggr为true,默认是true,开启的,在map端做部分聚合
                      2.2.1.1 hive.groupby.skewindata为false,默认是关闭的,groupby的数据没有倾斜。     
                      生成的operator是: GroupByOperator+ReduceSinkOperator+GroupByOperator。
          GroupByOperator+ReduceSinkOperator用于在map端做操作,第一个GroupByOperator在map端先做部分聚合。第二个用于在reduce端做GroupBy操作
                      2.2.1.2 hive.groupby.skewindata为true
                      生成的operator是: GroupbyOperator+ReduceSinkOperator+GroupbyOperator+ReduceSinkOperator +GroupByOperator
                   GroupbyOperator+ReduceSinkOperator(第一个MapredTask的map阶段)
                   GroupbyOperator(第一个MapredTask的reduce阶段)
                   ReduceSinkOperator (第二个MapredTask的map阶段)
                   GroupByOperator(第二个MapredTask的reduce阶段)
                2.2.2  hive.map.aggr为false
                       2.2.2.1 hive.groupby.skewindata为true
                        生成的operator是: ReduceSinkOperator+GroupbyOperator+ReduceSinkOperator +GroupByOperator                   
                   ReduceSinkOperator(第一个MapredTask的map阶段)
                   GroupbyOperator(第一个MapredTask的reduce阶段)
                   ReduceSinkOperator (第二个MapredTask的map阶段)
                   GroupByOperator(第二个MapredTask的reduce阶段)
                       2.2.2.2 hive.groupby.skewindata为false
                        生成的operator是: ReduceSinkOperator(map阶段运行)+GroupbyOperator(reduce阶段运行)



    第一种情况:
    set hive.map.aggr=false;
    set hive.groupby.skewindata=false;
    SemanticAnalyzer.genGroupByPlan1MR(){
      (1)ReduceSinkOperator: It will put all Group By keys and the distinct field (if any) in the map-reduce sort key, and all other fields in the map-reduce value.
      (2)GroupbyOperator:GroupByDesc.Mode.COMPLETE,Reducer: iterate/merge (mode = COMPLETE)
    }

    第二种情况:
    set hive.map.aggr=true;
    set hive.groupby.skewindata=false;
    SemanticAnalyzer.genGroupByPlanMapAggr1MR(){
       (1)GroupByOperator:GroupByDesc.Mode.HASH,The agggregation evaluation functions are as follows: Mapper: iterate/terminatePartial (mode = HASH)
       (2)ReduceSinkOperator:Partitioning Key: grouping key。Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
       (3)GroupByOperator:GroupByDesc.Mode.MERGEPARTIAL,Reducer: iterate/terminate if DISTINCT merge/terminate if NO DISTINCT (mode = MERGEPARTIAL)
    }

    第三种情况:
    set hive.map.aggr=false;
    set hive.groupby.skewindata=true;
    SemanticAnalyzer.genGroupByPlan2MR(){
        (1)ReduceSinkOperator:Partitioning Key: random() if no DISTINCT grouping + distinct key if DISTINCT。Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
        (2)GroupbyOperator:GroupByDesc.Mode.PARTIAL1,Reducer: iterate/terminatePartial (mode = PARTIAL1)
        (3)ReduceSinkOperator:Partitioning Key: grouping key。Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
        (4)GroupByOperator:GroupByDesc.Mode.FINAL,Reducer: merge/terminate (mode = FINAL)
    }

    第四种情况:
    set hive.map.aggr=true;
    set hive.groupby.skewindata=true;
    SemanticAnalyzer.genGroupByPlanMapAggr2MR(){
         (1)GroupbyOperator:GroupByDesc.Mode.HASH,Mapper: iterate/terminatePartial (mode = HASH)
         (2)ReduceSinkOperator: Partitioning Key: random() if no DISTINCT grouping + distinct key if DISTINCT。 Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT。
         (3)GroupbyOperator:GroupByDesc.Mode.PARTIALS, Reducer: iterate/terminatePartial if DISTINCT merge/terminatePartial if NO DISTINCT (mode = MERGEPARTIAL)
         (4)ReduceSinkOperator:Partitioining Key: grouping key。Sorting Key: grouping key if no DISTINCT grouping + distinct key if DISTINCT
         (5)GroupByOperator:GroupByDesc.Mode.FINAL,Reducer: merge/terminate (mode = FINAL)
    }


    ReduceSinkOperator的processOp(Object row, int tag)会根据相应的条件设置Key的hash值,如第四种情况的第一个ReduceSinkOperator:Partitioning Key: random() if no DISTINCT grouping + distinct key if DISTINCT,如果没有DISTINCT字段,那么在OutputCollector.collect前会设置当前Key的hash值为一个随机 数,random = new Random(12345);。如果有DISTINCT字段,那么key的hash值跟grouping + distinct key有关。



    GroupByOperator:
    initializeOp(Configuration hconf)
    processOp(Object row, int tag)
    closeOp(boolean abort)
    forward(ArrayList<Object> keys, AggregationBuffer[] aggs)


    groupby10.q   groupby11.q
    set hive.map.aggr=false;
    set hive.groupby.skewindata=false;

    EXPLAIN
    FROM INPUT
    INSERT OVERWRITE TABLE dest1 SELECT INPUT.key, count(substr(INPUT.value,5)), count(distinct substr(INPUT.value,5)) GROUP BY INPUT.key;

    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 depends on stages: Stage-1

    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            input
              TableScan
                alias: input
                Select Operator  // insertSelectAllPlanForGroupBy
                  expressions:
                        expr: key
                        type: int
                        expr: value
                        type: string
                  outputColumnNames: key, value
                  Reduce Output Operator
                    key expressions:
                          expr: key
                          type: int
                          expr: substr(value, 5)
                          type: string
                    sort order: ++
                    Map-reduce partition columns:
                          expr: key
                          type: int
                    tag: -1
          Reduce Operator Tree:
            Group By Operator
              aggregations:
                    expr: count(KEY._col1:0._col0)
                    expr: count(DISTINCT KEY._col1:0._col0)
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: int
              mode: complete
              outputColumnNames: _col0, _col1, _col2
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: bigint
                      expr: _col2
                      type: bigint
                outputColumnNames: _col0, _col1, _col2
                Select Operator
                  expressions:
                        expr: _col0
                        type: int
                        expr: UDFToInteger(_col1)
                        type: int
                        expr: UDFToInteger(_col2)
                        type: int
                  outputColumnNames: _col0, _col1, _col2
                  File Output Operator
                    compressed: false
                    GlobalTableId: 1
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        name: dest1

      Stage: Stage-0
        Move Operator
          tables:
              replace: true
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: dest1




    set hive.map.aggr=true;
    set hive.groupby.skewindata=false;

    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 depends on stages: Stage-1

    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            input
              TableScan
                alias: input
                Select Operator
                  expressions:
                        expr: key
                        type: int
                        expr: value
                        type: string
                  outputColumnNames: key, value
                  Group By Operator
                    aggregations:
                          expr: count(substr(value, 5))
                          expr: count(DISTINCT substr(value, 5))
                    bucketGroup: false
                    keys:
                          expr: key
                          type: int
                          expr: substr(value, 5)
                          type: string
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2, _col3
                    Reduce Output Operator
                      key expressions:
                            expr: _col0
                            type: int
                            expr: _col1
                            type: string
                      sort order: ++
                      Map-reduce partition columns:
                            expr: _col0
                            type: int
                      tag: -1
                      value expressions:
                            expr: _col2
                            type: bigint
                            expr: _col3
                            type: bigint
          Reduce Operator Tree:
            Group By Operator
              aggregations:
                    expr: count(VALUE._col0)
                    expr: count(DISTINCT KEY._col1:0._col0)
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: int
              mode: mergepartial
              outputColumnNames: _col0, _col1, _col2
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: bigint
                      expr: _col2
                      type: bigint
                outputColumnNames: _col0, _col1, _col2
                Select Operator
                  expressions:
                        expr: _col0
                        type: int
                        expr: UDFToInteger(_col1)
                        type: int
                        expr: UDFToInteger(_col2)
                        type: int
                  outputColumnNames: _col0, _col1, _col2
                  File Output Operator
                    compressed: false
                    GlobalTableId: 1
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        name: dest1

      Stage: Stage-0
        Move Operator
          tables:
              replace: true
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: dest1








    set hive.map.aggr=false;
    set hive.groupby.skewindata=true;

    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-2 depends on stages: Stage-1
      Stage-0 depends on stages: Stage-2

    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            input
              TableScan
                alias: input
                Select Operator
                  expressions:
                        expr: key
                        type: int
                        expr: value
                        type: string
                  outputColumnNames: key, value
                  Reduce Output Operator
                    key expressions:
                          expr: key
                          type: int
                          expr: substr(value, 5)
                          type: string
                    sort order: ++
                    Map-reduce partition columns:
                          expr: key
                          type: int
                    tag: -1
          Reduce Operator Tree:
            Group By Operator
              aggregations:
                    expr: count(KEY._col1:0._col0)
                    expr: count(DISTINCT KEY._col1:0._col0)
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: int
              mode: partial1
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

      Stage: Stage-2
        Map Reduce
          Alias -> Map Operator Tree:
            hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-48-26_387_7978992474997402829/-mr-10002
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: int
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: int
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
                        expr: _col2
                        type: bigint
          Reduce Operator Tree:
            Group By Operator
              aggregations:
                    expr: count(VALUE._col0)
                    expr: count(VALUE._col1)
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: int
              mode: final
              outputColumnNames: _col0, _col1, _col2
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: bigint
                      expr: _col2
                      type: bigint
                outputColumnNames: _col0, _col1, _col2
                Select Operator
                  expressions:
                        expr: _col0
                        type: int
                        expr: UDFToInteger(_col1)
                        type: int
                        expr: UDFToInteger(_col2)
                        type: int
                  outputColumnNames: _col0, _col1, _col2
                  File Output Operator
                    compressed: false
                    GlobalTableId: 1
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        name: dest1

      Stage: Stage-0
        Move Operator
          tables:
              replace: true
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: dest1




    set hive.map.aggr=true;
    set hive.groupby.skewindata=true;

    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-2 depends on stages: Stage-1
      Stage-0 depends on stages: Stage-2

    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            input
              TableScan
                alias: input
                Select Operator
                  expressions:
                        expr: key
                        type: int
                        expr: value
                        type: string
                  outputColumnNames: key, value
                  Group By Operator
                    aggregations:
                          expr: count(substr(value, 5))
                          expr: count(DISTINCT substr(value, 5))
                    bucketGroup: false
                    keys:
                          expr: key
                          type: int
                          expr: substr(value, 5)
                          type: string
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2, _col3
                    Reduce Output Operator
                      key expressions:
                            expr: _col0
                            type: int
                            expr: _col1
                            type: string
                      sort order: ++
                      Map-reduce partition columns:
                            expr: _col0
                            type: int
                      tag: -1
                      value expressions:
                            expr: _col2
                            type: bigint
                            expr: _col3
                            type: bigint
          Reduce Operator Tree:
            Group By Operator
              aggregations:
                    expr: count(VALUE._col0)
                    expr: count(DISTINCT KEY._col1:0._col0)
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: int
              mode: partials
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

      Stage: Stage-2
        Map Reduce
          Alias -> Map Operator Tree:
            hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-49-25_899_4946067838822964610/-mr-10002
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: int
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: int
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
                        expr: _col2
                        type: bigint
          Reduce Operator Tree:
            Group By Operator
              aggregations:
                    expr: count(VALUE._col0)
                    expr: count(VALUE._col1)
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: int
              mode: final
              outputColumnNames: _col0, _col1, _col2
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: bigint
                      expr: _col2
                      type: bigint
                outputColumnNames: _col0, _col1, _col2
                Select Operator
                  expressions:
                        expr: _col0
                        type: int
                        expr: UDFToInteger(_col1)
                        type: int
                        expr: UDFToInteger(_col2)
                        type: int
                  outputColumnNames: _col0, _col1, _col2
                  File Output Operator
                    compressed: false
                    GlobalTableId: 1
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        name: dest1

      Stage: Stage-0
        Move Operator
          tables:
              replace: true
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: dest1








    set hive.map.aggr=false;
    set hive.groupby.skewindata=false;

    EXPLAIN extended
    FROM INPUT
    INSERT OVERWRITE TABLE dest1 SELECT INPUT.key, count(substr(INPUT.value,5)), count(distinct substr(INPUT.value,5)) GROUP BY INPUT.key;

    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 depends on stages: Stage-1

    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            input
              TableScan
                alias: input
                Select Operator
                  expressions:
                        expr: key
                        type: int
                        expr: value
                        type: string
                  outputColumnNames: key, value
                  Reduce Output Operator
                    key expressions:
                          expr: key
                          type: int
                          expr: substr(value, 5)
                          type: string
                    sort order: ++
                    Map-reduce partition columns:
                          expr: key
                          type: int
                    tag: -1
          Needs Tagging: false
          Path -> Alias:
            hdfs://localhost:54310/user/hive/warehouse/input [input]
          Path -> Partition:
            hdfs://localhost:54310/user/hive/warehouse/input
              Partition
                base file name: input
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                properties:
                  bucket_count -1
                  columns key,value
                  columns.types int:string
                  file.inputformat org.apache.hadoop.mapred.TextInputFormat
                  file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  location hdfs://localhost:54310/user/hive/warehouse/input
                  name input
                  serialization.ddl struct input { i32 key, string value}
                  serialization.format 1
                  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  transient_lastDdlTime 1310523947
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
             
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    bucket_count -1
                    columns key,value
                    columns.types int:string
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    location hdfs://localhost:54310/user/hive/warehouse/input
                    name input
                    serialization.ddl struct input { i32 key, string value}
                    serialization.format 1
                    serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    transient_lastDdlTime 1310523947
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: input
                name: input
          Reduce Operator Tree:
            Group By Operator
              aggregations:
                    expr: count(KEY._col1:0._col0)
                    expr: count(DISTINCT KEY._col1:0._col0)
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: int
              mode: complete
              outputColumnNames: _col0, _col1, _col2
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: bigint
                      expr: _col2
                      type: bigint
                outputColumnNames: _col0, _col1, _col2
                Select Operator
                  expressions:
                        expr: _col0
                        type: int
                        expr: UDFToInteger(_col1)
                        type: int
                        expr: UDFToInteger(_col2)
                        type: int
                  outputColumnNames: _col0, _col1, _col2
                  File Output Operator
                    compressed: false
                    GlobalTableId: 1
                    directory: hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-50-38_510_6852880850328147221/-ext-10000
                    NumFilesPerFileSink: 1
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        properties:
                          bucket_count -1
                          columns key,val1,val2
                          columns.types int:int:int
                          file.inputformat org.apache.hadoop.mapred.TextInputFormat
                          file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          location hdfs://localhost:54310/user/hive/warehouse/dest1
                          name dest1
                          serialization.ddl struct dest1 { i32 key, i32 val1, i32 val2}
                          serialization.format 1
                          serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          transient_lastDdlTime 1310523946
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        name: dest1
                    TotalFiles: 1
                    MultiFileSpray: false

      Stage: Stage-0
        Move Operator
          tables:
              replace: true
              source: hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-50-38_510_6852880850328147221/-ext-10000
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    bucket_count -1
                    columns key,val1,val2
                    columns.types int:int:int
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    location hdfs://localhost:54310/user/hive/warehouse/dest1
                    name dest1
                    serialization.ddl struct dest1 { i32 key, i32 val1, i32 val2}
                    serialization.format 1
                    serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    transient_lastDdlTime 1310523946
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: dest1
              tmp directory: hdfs://localhost:54310/tmp/hive-tianzhao/hive_2011-07-15_21-50-38_510_6852880850328147221/-ext-10001



    ABSTRACT SYNTAX TREE:
    (TOK_QUERY
    (TOK_FROM (TOK_TABREF INPUT))
    (TOK_INSERT
    (TOK_DESTINATION (TOK_TAB dest1))
    (TOK_SELECT
    (TOK_SELEXPR (. (TOK_TABLE_OR_COL INPUT) key))
    (TOK_SELEXPR (TOK_FUNCTION count (TOK_FUNCTION substr (. (TOK_TABLE_OR_COL INPUT) value) 5)))
    (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION substr (. (TOK_TABLE_OR_COL INPUT) value) 5)))
    )
    (TOK_GROUPBY (. (TOK_TABLE_OR_COL INPUT) key))
    )
    )



    SemanticAnalyzer.genBodyPlan(QB qb, Operator input){
           if (qbp.getAggregationExprsForClause(dest).size() != 0
                || getGroupByForClause(qbp, dest).size() > 0) { //如果有聚合函数或者有groupby,则执行下面的操作
              //multiple distincts is not supported with skew in data
              if (conf.getVar(HiveConf.ConfVars.HIVEGROUPBYSKEW)
                  .equalsIgnoreCase("true") &&
                 qbp.getDistinctFuncExprsForClause(dest).size() > 1) {
                throw new SemanticException(ErrorMsg.UNSUPPORTED_MULTIPLE_DISTINCTS.
                    getMsg());
              }
              // insert a select operator here used by the ColumnPruner to reduce
              // the data to shuffle
              curr = insertSelectAllPlanForGroupBy(dest, curr); //生成一个SelectOperator,所有的字段都会选取,selectStar=true。
              if (conf.getVar(HiveConf.ConfVars.HIVEMAPSIDEAGGREGATE)
                  .equalsIgnoreCase("true")) {
                if (conf.getVar(HiveConf.ConfVars.HIVEGROUPBYSKEW)
                    .equalsIgnoreCase("false")) {
                  curr = genGroupByPlanMapAggr1MR(dest, qb, curr);
                } else {
                  curr = genGroupByPlanMapAggr2MR(dest, qb, curr);
                }
              } else if (conf.getVar(HiveConf.ConfVars.HIVEGROUPBYSKEW)
                  .equalsIgnoreCase("true")) {
                curr = genGroupByPlan2MR(dest, qb, curr);
              } else {
                curr = genGroupByPlan1MR(dest, qb, curr);
              }
           }  
    }

    distince:
    count.q.out
    groupby11.q.out
    groupby10.q.out
    nullgroup4_multi_distinct.q.out
    join18.q.out
    groupby_bigdata.q.out
    join18_multi_distinct.q.out
    nullgroup4.q.out
    auto_join18_multi_distinct.q.out
    auto_join18.q.out

    (1)map端部分聚合,数据无倾斜,一个MR生成。
    genGroupByPlanMapAggr1MR,生成三个Operator:
    (1.1)GroupByOperator:map-side partial aggregation,由genGroupByPlanMapGroupByOperator方法生成:
           处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
           处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入groupByKeys和outputColumnNames
           处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
      public GroupByDesc(
          final Mode mode,
          final java.util.ArrayList<java.lang.String> outputColumnNames,
          final java.util.ArrayList<ExprNodeDesc> keys,
          final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
          final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
        this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
            false, groupByMemoryUsage, memoryThreshold);
      }
      mode:GroupByDesc.Mode.HASH
      outputColumnNames:groupby+Distinct+Aggregation
      keys:groupby+Distinct
      aggregators:Aggregation
      groupKeyNotReductionKey:false
      groupByMemoryUsage:默认为0.5
      memoryThreshold:默认为0.9 

    (1.2)ReduceSinkOperator
          处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
          处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
          处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
    public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
          int numDistributionKeys,
          java.util.ArrayList<ExprNodeDesc> valueCols,
          java.util.ArrayList<java.lang.String> outputKeyColumnNames,
          List<List<Integer>> distinctColumnIndices,
          java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
          java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
          final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
        this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
        this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
        this.valueCols = valueCols; //reduceValues,聚合函数
        this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
        this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
        this.tag = tag; // -1
        this.numReducers = numReducers; // 一般都是-1
        this.partitionCols = partitionCols; // groupby
        this.keySerializeInfo = keySerializeInfo;
        this.valueSerializeInfo = valueSerializeInfo;
        this.distinctColumnIndices = distinctColumnIndices;
      }

    (1.3)GroupByOperator
          处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
          处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
      public GroupByDesc(
          final Mode mode,
          final java.util.ArrayList<java.lang.String> outputColumnNames,
          final java.util.ArrayList<ExprNodeDesc> keys,
          final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
          final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
        this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
            false, groupByMemoryUsage, memoryThreshold);
      }
      mode:GroupByDesc.Mode.MERGEPARTIAL
      outputColumnNames:groupby+Aggregation
      keys:groupby
      aggregators:Aggregation
      groupKeyNotReductionKey:false
      groupByMemoryUsage:默认为0.5
      memoryThreshold:默认为0.9 

    (2)map端部分聚合,数据倾斜,两个MR生成。
    genGroupByPlanMapAggr2MR:
    (2.1)GroupByOperator:map-side partial aggregation,由genGroupByPlanMapGroupByOperator方法生成:
           处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
           处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入groupByKeys和outputColumnNames
           处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
      public GroupByDesc(
          final Mode mode,
          final java.util.ArrayList<java.lang.String> outputColumnNames,
          final java.util.ArrayList<ExprNodeDesc> keys,
          final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
          final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
        this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
            false, groupByMemoryUsage, memoryThreshold);
      }
      mode:GroupByDesc.Mode.HASH
      outputColumnNames:groupby+Distinct+Aggregation
      keys:groupby+Distinct
      aggregators:Aggregation
      groupKeyNotReductionKey:false
      groupByMemoryUsage:默认为0.5
      memoryThreshold:默认为0.9 

    (2.2)ReduceSinkOperator
          处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
          处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
          处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
    public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
          int numDistributionKeys,
          java.util.ArrayList<ExprNodeDesc> valueCols,
          java.util.ArrayList<java.lang.String> outputKeyColumnNames,
          List<List<Integer>> distinctColumnIndices,
          java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
          java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
          final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
        this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
        this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
        this.valueCols = valueCols; //reduceValues,聚合函数
        this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
        this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
        this.tag = tag; // -1
        this.numReducers = numReducers; // 一般都是-1
        this.partitionCols = partitionCols; // groupby
        this.keySerializeInfo = keySerializeInfo;
        this.valueSerializeInfo = valueSerializeInfo;
        this.distinctColumnIndices = distinctColumnIndices;
      }

    (2.3)GroupByOperator
          处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
          处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
      public GroupByDesc(
          final Mode mode,
          final java.util.ArrayList<java.lang.String> outputColumnNames,
          final java.util.ArrayList<ExprNodeDesc> keys,
          final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
          final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
        this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
            false, groupByMemoryUsage, memoryThreshold);
      }
      mode:GroupByDesc.Mode.PARTIALS
      outputColumnNames:groupby+Aggregation
      keys:groupby
      aggregators:Aggregation
      groupKeyNotReductionKey:false
      groupByMemoryUsage:默认为0.5
      memoryThreshold:默认为0.9 

    (2.4)ReduceSinkOperator
          处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputColumnNames
          处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputColumnNames
    public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
          int numDistributionKeys,
          java.util.ArrayList<ExprNodeDesc> valueCols,
          java.util.ArrayList<java.lang.String> outputKeyColumnNames,
          List<List<Integer>> distinctColumnIndices,
          java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
          java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
          final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
        this.keyCols = keyCols; // 为reduceKeys,groupby
        this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
        this.valueCols = valueCols; //reduceValues,聚合函数
        this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
        this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
        this.tag = tag; // -1
        this.numReducers = numReducers; // 一般都是-1
        this.partitionCols = partitionCols; // groupby
        this.keySerializeInfo = keySerializeInfo;
        this.valueSerializeInfo = valueSerializeInfo;
        this.distinctColumnIndices = distinctColumnIndices;
      }

    (2.5)GroupByOperator
          处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
          处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,需要做聚合的column加入outputColumnNames
      public GroupByDesc(
          final Mode mode,
          final java.util.ArrayList<java.lang.String> outputColumnNames,
          final java.util.ArrayList<ExprNodeDesc> keys,
          final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
          final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
        this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
            false, groupByMemoryUsage, memoryThreshold);
      }
      mode:GroupByDesc.Mode.FINAL
      outputColumnNames:groupby+Aggregation
      keys:groupby
      aggregators:Aggregation
      groupKeyNotReductionKey:false
      groupByMemoryUsage:默认为0.5
      memoryThreshold:默认为0.9 


    (3)map端不部分聚合,数据倾斜,两个MR生成。
    genGroupByPlan2MR:

    (3.1)ReduceSinkOperator
          处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
          处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
          处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
    public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
          int numDistributionKeys,
          java.util.ArrayList<ExprNodeDesc> valueCols,
          java.util.ArrayList<java.lang.String> outputKeyColumnNames,
          List<List<Integer>> distinctColumnIndices,
          java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
          java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
          final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
        this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
        this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
        this.valueCols = valueCols; //reduceValues,聚合函数
        this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
        this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
        this.tag = tag; // -1
        this.numReducers = numReducers; // 一般都是-1
        this.partitionCols = partitionCols; // groupby
        this.keySerializeInfo = keySerializeInfo;
        this.valueSerializeInfo = valueSerializeInfo;
        this.distinctColumnIndices = distinctColumnIndices;
      }

    (3.2)GroupByOperator
          处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
          处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,生成column加入outputColumnNames
      public GroupByDesc(
          final Mode mode,
          final java.util.ArrayList<java.lang.String> outputColumnNames,
          final java.util.ArrayList<ExprNodeDesc> keys,
          final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
          final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
        this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
            false, groupByMemoryUsage, memoryThreshold);
      }
      mode:GroupByDesc.Mode.PARTIAL1
      outputColumnNames:groupby+Aggregation
      keys:groupby
      aggregators:Aggregation
      groupKeyNotReductionKey:false
      groupByMemoryUsage:默认为0.5
      memoryThreshold:默认为0.9 

    (3.3)ReduceSinkOperator
          处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputColumnNames
          处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputColumnNames
    public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
          int numDistributionKeys,
          java.util.ArrayList<ExprNodeDesc> valueCols,
          java.util.ArrayList<java.lang.String> outputKeyColumnNames,
          List<List<Integer>> distinctColumnIndices,
          java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
          java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
          final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
        this.keyCols = keyCols; // 为reduceKeys,groupby
        this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
        this.valueCols = valueCols; //reduceValues,聚合函数
        this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
        this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
        this.tag = tag; // -1
        this.numReducers = numReducers; // 一般都是-1
        this.partitionCols = partitionCols; // groupby
        this.keySerializeInfo = keySerializeInfo;
        this.valueSerializeInfo = valueSerializeInfo;
        this.distinctColumnIndices = distinctColumnIndices;
      }

    (3.4)GroupByOperator
          处理groupby子句,getGroupByForClause,groupby的column加入groupByKeys和outputColumnNames
          处理聚合函数,getAggregationExprsForClause,生成AggregationDesc加入aggregations,需要做聚合的column加入outputColumnNames
      public GroupByDesc(
          final Mode mode,
          final java.util.ArrayList<java.lang.String> outputColumnNames,
          final java.util.ArrayList<ExprNodeDesc> keys,
          final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
          final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
        this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
            false, groupByMemoryUsage, memoryThreshold);
      }
      mode:GroupByDesc.Mode.FINAL
      outputColumnNames:groupby+Aggregation
      keys:groupby
      aggregators:Aggregation
      groupKeyNotReductionKey:false
      groupByMemoryUsage:默认为0.5
      memoryThreshold:默认为0.9 


    (4)map端不部分聚合,数据无倾斜,一个MR生成。
    genGroupByPlan1MR:
    (4.1)ReduceSinkOperator
          处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
          处理select中的Distinct,getDistinctFuncExprsForClause,Distinct的column,加入reduceKeys和outputKeyColumnNames
          处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
    public ReduceSinkDesc(java.util.ArrayList<ExprNodeDesc> keyCols,
          int numDistributionKeys,
          java.util.ArrayList<ExprNodeDesc> valueCols,
          java.util.ArrayList<java.lang.String> outputKeyColumnNames,
          List<List<Integer>> distinctColumnIndices,
          java.util.ArrayList<java.lang.String> outputValueColumnNames, int tag,
          java.util.ArrayList<ExprNodeDesc> partitionCols, int numReducers,
          final TableDesc keySerializeInfo, final TableDesc valueSerializeInfo) {
        this.keyCols = keyCols; // 为reduceKeys,groupby+distinct
        this.numDistributionKeys = numDistributionKeys; // grpByExprs.size()
        this.valueCols = valueCols; //reduceValues,聚合函数
        this.outputKeyColumnNames = outputKeyColumnNames; //outputKeyColumnNames
        this.outputValueColumnNames = outputValueColumnNames; //outputValueColumnNames
        this.tag = tag; // -1
        this.numReducers = numReducers; // 一般都是-1
        this.partitionCols = partitionCols; // groupby
        this.keySerializeInfo = keySerializeInfo;
        this.valueSerializeInfo = valueSerializeInfo;
        this.distinctColumnIndices = distinctColumnIndices;
      }

    (4.2)GroupByOperator
          处理groupby子句,getGroupByForClause,groupby的column加入reduceKeys和outputKeyColumnNames
          处理聚合函数,getAggregationExprsForClause,需要做聚合的column加入reduceValues和outputValueColumnNames
      public GroupByDesc(
          final Mode mode,
          final java.util.ArrayList<java.lang.String> outputColumnNames,
          final java.util.ArrayList<ExprNodeDesc> keys,
          final java.util.ArrayList<org.apache.hadoop.hive.ql.plan.AggregationDesc> aggregators,
          final boolean groupKeyNotReductionKey,float groupByMemoryUsage, float memoryThreshold) {
        this(mode, outputColumnNames, keys, aggregators, groupKeyNotReductionKey,
            false, groupByMemoryUsage, memoryThreshold);
      }
      mode:GroupByDesc.Mode.COMPLETE
      outputColumnNames:groupby+Aggregation
      keys:groupby
      aggregators:Aggregation
      groupKeyNotReductionKey:false
      groupByMemoryUsage:默认为0.5
      memoryThreshold:默认为0.9 



    SemanticAnalyzer.genBodyPlan
      optimizeMultiGroupBy  (multi-group by with the same distinct)
      groupby10.q  groupby11.q
  • 相关阅读:
    java并发系列(六)-----Java并发:volatile关键字解析(内存语义、实现原理)
    java并发系列(五)-----如何正确的关闭一个线程
    23.备忘录模式(Memento Pattern)
    22.访问者模式(Vistor Pattern)
    21.责任链模式
    20.策略者模式(Stragety Pattern)
    19.状态者模式(State Pattern)
    18.中介者模式(Mediator Pattern)
    17.观察者模式(Observer Pattern)
    16.迭代器模式(Iterator Pattern)
  • 原文地址:https://www.cnblogs.com/end/p/3077651.html
Copyright © 2020-2023  润新知