MapReduce学习总结之Combiner、Partitioner、Jobhistory

一、Combiner

在MapReduce编程模型中，在Mapper和Reducer之间有一个非常重要的组件，主要用于解决MR性能瓶颈问题

combiner其实属于优化方案，由于带宽限制，应该尽量map和reduce之间的数据传输数量。它在Map端把同一个key的键值对合并在一起并计算，计算规则和reduce一致，所以combiner也可以看作特殊的Reducer(本地reduce)。
执行combiner操作要求开发者必须在程序中设置了combiner（程序中通过job.setCombinerClass(myCombine.class)自定义combiner操作）

wordcount中直接使用myreduce作为combiner:

// 设置Map规约Combiner
    job.setCombinerClass(MyReducer.class);

参考资料：https://www.tuicool.com/articles/qAzUjav

二、Partitioner

Partitioner也是MR的重要组件，主要功能如下：

1）Partitioner决定MapTask输出的数据交由哪个ReduceTask处理

2）默认实现：分发的key的hash值对reduceTask 个数取模

which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks，得到当前的目的reducer。

例子：

文件内容：xiaomi 200            
            huawei 500
            xiaomi 300
            huawei 700
            iphonex 100
            iphonex 30
            iphone7 60 
对上面文件内容按手机品牌分类分发到四个reduce处理计算：
 
package rdb.com.hadoop01.mapreduce;
 
import java.io.IOException;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
/**
 * 
 * @author rdb
 *
 */
public class PartitionerApp {
 
    /**
     * map读取输入文件
     * @author rdb
     *
     */
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
         
 
        @Override
        protected void map(LongWritable key, Text value,
                Mapper<LongWritable, Text, Text, LongWritable>.Context context)
                throws IOException, InterruptedException {
            //接收每一行数据
            String line = value.toString();
            //按空格进行分割 
            String[] words = line.split(" ");
            //通过上下文把map处理结果输出
            context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1])));
        }
    }
     
    /**
     * reduce程序，归并统计
     * @author rdb
     *
     */
    public static class MyReduce extends Reducer<Text, LongWritable, Text, LongWritable>{
         
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values,
                Reducer<Text, LongWritable, Text, LongWritable>.Context context)
                throws IOException, InterruptedException {
            long sum = 0;
            for (LongWritable value : values){
                //求单词次数
                sum += value.get();
            }
            //通过上下文把reduce处理结果输出
            context.write(key, new LongWritable(sum));
        }
    }
     
    /**
     * 自定义partition
     * @author rdb
     *
     */
    public static class MyPartitioner extends Partitioner<Text, LongWritable>{
 
        @Override
        public int getPartition(Text key, LongWritable value, int numPartitions) {
            if(key.toString().equals("xiaomi")){
                return 0;
            }
            if(key.toString().equals("huawei")){
                return 1;
            }
            if(key.toString().equals("iphonex")){
                return 2;
            }
            return 3;
        }
         
    }
     
    /**
     * 自定义driver:封装mapreduce作业所有信息
     *@param args
     * @throws IOException 
     */
    public static void main(String[] args) throws Exception {
         
        //创建配置
        Configuration configuration = new Configuration();
         
        //清理已经存在的输出目录
        Path out = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(configuration);
        if(fileSystem.exists(out)){
            fileSystem.delete(out, true);
            System.out.println("output exists,but it has deleted");
        }
         
        //创建job
        Job job = Job.getInstance(configuration,"WordCount");
         
        //设置job的处理类
        job.setJarByClass(PartitionerApp.class);
         
        //设置作业处理的输入路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
         
        //设置map相关的参数
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
         
        //设置reduce相关参数
        job.setReducerClass(MyReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
         
        //设置combiner处理类，逻辑上和reduce是一样的
        //job.setCombinerClass(MyReduce.class);
         
        //设置job partition
        job.setPartitionerClass(MyPartitioner.class);
        //设置4个reducer,每个分区一个 
        job.setNumReduceTasks(4);
         
        //设置作业处理的输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
         
        System.exit(job.waitForCompletion(true)? 0 : 1) ;
    }
}
 
打包后调用：hadoop jar ~/lib/hadoop01-0.0.1-SNAPSHOT.jar rdb.com.hadoop01.mapreduce.PartitionerApp 
hdfs://hadoop01:8020/partitioner.txt  hdfs://hadoop01:8020/output/partitioner
 
结果： -rw-r--r--   1 hadoop supergroup         11 2018-05-09 06:35 /output/partitioner/part-r-00000
      -rw-r--r--   1 hadoop supergroup         12 2018-05-09 06:35 /output/partitioner/part-r-00001
      -rw-r--r--   1 hadoop supergroup         12 2018-05-09 06:35 /output/partitioner/part-r-00002
      -rw-r--r--   1 hadoop supergroup         11 2018-05-09 06:35 /output/partitioner/part-r-00003
       
[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00000
18/05/09 06:36:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
xiaomi  500
[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00001
18/05/09 06:36:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
huawei  1200
[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00002
18/05/09 06:36:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
iphonex 130
[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00003
18/05/09 06:36:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
iphone7 60

三、Jobhistory

JobHistory用来记录已经finished的mapreduce运行日志，日志信息存放于HDFS目录中，默认情况下没有开启此功能。需要配置。

1）配置hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml

<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop01:10020</value>
    <description>MR JobHistory Server管理的日志的存放位置</description>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop01:19888</value>
    <description>查看历史服务器已经运行完的Mapreduce作业记录的web地址，需要启动该服务才行</description>
</property>
<property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>/history/done</value>
    <description>MR JobHistory Server管理的日志的存放位置,默认:/mr-history/done</description>
</property>
<property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>/history/done_intermediate</value>
    <description>MapReduce作业产生的日志存放位置，默认值:/mr-history/tmp</description>
</property>

2）配置好后重启yarn.启动jobhistory服务：hadoop-2.6.0-cdh5.7.0/sbin/mr-jobhistory-daemon.sh start historyserver

[hadoop@hadoop01 sbin]$ jps
24321 JobHistoryServer
24353 Jps
23957 NodeManager
7880 DataNode
8060 SecondaryNameNode
23854 ResourceManager
7791 NameNode
[hadoop@hadoop01 sbin]$

3）浏览器访问：http://192.168.44.183:19888/

后台跑一个MapReduce程序：hadoop jar ~/lib/hadoop01-0.0.1-SNAPSHOT.jar rdb.com.hadoop01.mapreduce.WordCountApp hdfs://hadoop01:8020/hello.txt hdfs://hadoop01:8020/output/wc

刷新下浏览器可以看到刚才程序的日志：

点击页面中对应mr程序中的logs可以看详细日志。

问题记录：

R4@I[37YY5MU]1J(44XG~13.png

相关阅读:
Java设计模式之依赖倒置原则
 windows 下安装apache 遇到的问题
 Java序列化相关
 批量插入————优化
 接口相关
 Redis使用及工具类
 面试回顾——kafka
面试回顾——List<T>排序
 Java面试——线程池
 面试回顾——session相关
原文地址：https://www.cnblogs.com/jnba/p/10670828.html