hadoop之 mapreduce Combiner

hadoop之 mapreduce Combiner
许多mapreduce作业会受限与集群的带宽，因此尽量降低map和reduce任务之间的数据传输是有必要的。Hadoop允许用户针对map任务的输出指定一个combiner函数处理map任务的输出，并作为reduce函数的输入。因为combine是优化方案，所以Hadoop无法确定针对map输出记录需要调用多少次combine函数。in the other word，不管调用多少次combine函数，reducer的输出结果都是一样的。
The contract for the combiner function constrains the type of function that may be used。
combiner函数协议会制约可用的函数类型。举个例子：

假设第一个map输出如下：
```
(1950, 0)
(1950, 20)
(1950, 10)
```
第二个map输出如下：
```
(1950, 25)
(1950, 15)
```
reduce函数被调用时，其输入是
```
(1950, [0, 20, 10, 25, 15])
```
结果：
```
(1950, 25)
```
如果调用combine函数，像reduce函数一样去寻找每个map的输出的最大温度。那么输出结果应该是：
```
(1950, [20, 25])
```
reduce 输出结果和以前一样。可用通过下面的表达式来说明气温数值的函数调用：
```
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
```
并不是所有函数都有这个属性。例如，我们计算平均气温，就不能使用平均函数作为combiner。
```
mean(0, 20, 10, 25, 15) = 14
```
但是：
```
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
```
combiner函数不能取代reducer。但它能有效减少mapper和reducer之间的数据传输量。

指定一个 combiner
```
       Job job = Job.getInstance();
            job.setJarByClass(MaxTemperatureJob.class);
            job.setJobName("max temperature");
            //方法为什么不保持一致，不是一个人写的？
            FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
            FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));

            job.setMapperClass(MaxTemperatureMapper.class);
            job.setReducerClass(MaxTemperatureReducer.class);
            //设置combiner
            job.setCombinerClass(MaxTemperatureReducer.class);
            
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            
           // job.setInputFormatClass();

            System.out.println(job.waitForCompletion(true) ? 0 : 1);
```
用放荡不羁的心态过随遇而安的生活
相关阅读:
PyQT_Group
单例模式演示-1-39-07
RSqlBuilder
RExcel
RJson
NodeJs开发目录
 NodeJs事件驱动
 NodeJs实用工具util
NodeJs之global，process
NodeJs两个简单调试技巧
原文地址：https://www.cnblogs.com/re-myself/p/5524494.html