• combine组合器


    具有选择性(适合sum,max,不适合avg)

    1.做优化:前提是不影响最终结果;

      a.实现map端到reduce端减少数据网络传输(网络IO)

      b.减少map Task数据输出(磁盘IO)

    2.combine其实是Reduce,combine的输出作为reduce的输入

    3.疑问:添加combine后的shuffle流程?

      a.无combine情况:shuffle流程

        inputformat-->map函数-->环形缓冲区-->partititon分区-->sort排序(quickSort)-->spill溢写-->merge合并-->sort排序(Collections.sort)-->fetch拉取-->merge合并-->sort排序(Collections.sort)-->reduce函数-->output

      b.含combine情况:shuffle流程

        inputformat-->map函数-->环形缓冲区-->partititon分区-->sort排序(quickSort)-->combiner-->spill溢写-->merge合并-->sort排序(Collections.sort)-->combiner-->fetch拉取-->merge合并-->sort排序(Collections.sort)-->combiner-->reduce函数-->output

    4.源码说明

    -----------------------------------Map端combiner-------------------------------------------------------------------
    【MapTask.class$MapOutputBuffer.class】sortAndSpill()
    -->sorter.sort(MapOutputBuffer.this, mstart, mend, reporter)下面:
    if (combinerRunner == null) {
    // spill directly
    DataInputBuffer key = new DataInputBuffer();
    ....
    }
    } else {
    .....
    if (spstart != spindex) {
    combineCollector.setWriter(writer);
    RawKeyValueIterator kvIter =
    new MRResultIterator(spstart, spindex);
    combinerRunner.combine(kvIter, combineCollector);
    }
    }

    -->mergeParts()
    -->Merger.merge()之后,调用combiner
    -->调用combiner
    if (combinerRunner == null || numSpills < minSpillsForCombine) {
    Merger.writeFile(kvIter, writer, reporter, job);
    } else {
    combineCollector.setWriter(writer);
    combinerRunner.combine(kvIter, combineCollector);
    }
    说明:当溢写个数大于等于3,开启combiner操作;参照属性为:mapreduce.map.combine.minspills
    ----------------------------------Map端combiner----------------------------------------------------------------
    ----------------------------------Reducer端combiner----------------------------------------------------------------
    -->【ReduceTask.class】run()
      -->ShuffleConsumerPlugin.Context shuffleContext = new ShuffleConsumerPlugin.Context(...)
      -->【ShuffleConsumerPlugin接口】加载实现类Shuffle.class
        -->【Shuffle.class】init()
          -->【Shuffle.class】createMergeManager(context)
            -->【Shuffle.class】 new MergeManagerImpl()
              -->【MergeManagerImp.class】merge(List<InMemoryMapOutput<K,V>> inputs)
                -->Merger.merge()之后调用combiner
    if (null == combinerClass) {
    Merger.writeFile(rIter, writer, reporter, jobConf);
    } else {
    combineCollector.setWriter(writer);
    combineAndSpill(rIter, reduceCombineInputCounter);
    }

    ----------------------------------Reducer端combiner----------------------------------------------------------------

  • 相关阅读:
    centos6搭建docker镜像私服
    CentOS6安装docker
    通过pyenv进行多版本python管理
    ( 转)性能测试--地铁模型分析
    LoadRunner基于HTML-based script和URL-based script方式录制的区别和各自的使用场景
    一道Oracle子查询小练习
    Oracle多表连接查询
    LoadRunner关联通用函数的学习
    Selenium2(WebDriver)总结(五)---元素操作进阶(常用类)
    selenium2(WebDriver) API
  • 原文地址:https://www.cnblogs.com/lyr999736/p/9381500.html
Copyright © 2020-2023  润新知