• MapReduce学习总结之Combiner、Partitioner、Jobhistory


    一、Combiner

    在MapReduce编程模型中,在Mapper和Reducer之间有一个非常重要的组件,主要用于解决MR性能瓶颈问题   

    722831-20160907111331707-136223051.png

    combiner其实属于优化方案,由于带宽限制,应该尽量map和reduce之间的数据传输数量。它在Map端把同一个key的键值对合并在一起并计算,计算规则和reduce一致,所以combiner也可以看作特殊的Reducer(本地reduce)。 
    执行combiner操作要求开发者必须在程序中设置了combiner(程序中通过job.setCombinerClass(myCombine.class)自定义combiner操作)

    wordcount中直接使用myreduce作为combiner:

    // 设置Map规约Combiner
        job.setCombinerClass(MyReducer.class);

    参考资料:https://www.tuicool.com/articles/qAzUjav

    二、Partitioner

      Partitioner也是MR的重要组件,主要功能如下:

        1)Partitioner决定MapTask输出的数据交由哪个ReduceTask处理 

        2)默认实现:分发的key的hash值对reduceTask 个数取模

        which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks,得到当前的目的reducer。

        例子:

    文件内容:xiaomi 200            
                huawei 500
                xiaomi 300
                huawei 700
                iphonex 100
                iphonex 30
                iphone7 60 
    对上面文件内容按手机品牌分类分发到四个reduce处理计算:
     
    package rdb.com.hadoop01.mapreduce;
     
    import java.io.IOException;
     
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Partitioner;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
     
    /**
     * 
     * @author rdb
     *
     */
    public class PartitionerApp {
     
        /**
         * map读取输入文件
         * @author rdb
         *
         */
        public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
             
     
            @Override
            protected void map(LongWritable key, Text value,
                    Mapper<LongWritable, Text, Text, LongWritable>.Context context)
                    throws IOException, InterruptedException {
                //接收每一行数据
                String line = value.toString();
                //按空格进行分割 
                String[] words = line.split(" ");
                //通过上下文把map处理结果输出
                context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1])));
            }
        }
         
        /**
         * reduce程序,归并统计
         * @author rdb
         *
         */
        public static class MyReduce extends Reducer<Text, LongWritable, Text, LongWritable>{
             
            @Override
            protected void reduce(Text key, Iterable<LongWritable> values,
                    Reducer<Text, LongWritable, Text, LongWritable>.Context context)
                    throws IOException, InterruptedException {
                long sum = 0;
                for (LongWritable value : values){
                    //求单词次数
                    sum += value.get();
                }
                //通过上下文把reduce处理结果输出
                context.write(key, new LongWritable(sum));
            }
        }
         
        /**
         * 自定义partition
         * @author rdb
         *
         */
        public static class MyPartitioner extends Partitioner<Text, LongWritable>{
     
            @Override
            public int getPartition(Text key, LongWritable value, int numPartitions) {
                if(key.toString().equals("xiaomi")){
                    return 0;
                }
                if(key.toString().equals("huawei")){
                    return 1;
                }
                if(key.toString().equals("iphonex")){
                    return 2;
                }
                return 3;
            }
             
        }
         
        /**
         * 自定义driver:封装mapreduce作业所有信息
         *@param args
         * @throws IOException 
         */
        public static void main(String[] args) throws Exception {
             
            //创建配置
            Configuration configuration = new Configuration();
             
            //清理已经存在的输出目录
            Path out = new Path(args[1]);
            FileSystem fileSystem = FileSystem.get(configuration);
            if(fileSystem.exists(out)){
                fileSystem.delete(out, true);
                System.out.println("output exists,but it has deleted");
            }
             
            //创建job
            Job job = Job.getInstance(configuration,"WordCount");
             
            //设置job的处理类
            job.setJarByClass(PartitionerApp.class);
             
            //设置作业处理的输入路径
            FileInputFormat.setInputPaths(job, new Path(args[0]));
             
            //设置map相关的参数
            job.setMapperClass(MyMapper.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(LongWritable.class);
             
            //设置reduce相关参数
            job.setReducerClass(MyReduce.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);
             
            //设置combiner处理类,逻辑上和reduce是一样的
            //job.setCombinerClass(MyReduce.class);
             
            //设置job partition
            job.setPartitionerClass(MyPartitioner.class);
            //设置4个reducer,每个分区一个 
            job.setNumReduceTasks(4);
             
            //设置作业处理的输出路径
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
             
            System.exit(job.waitForCompletion(true)? 0 : 1) ;
        }
    }
     
    打包后调用:hadoop jar ~/lib/hadoop01-0.0.1-SNAPSHOT.jar rdb.com.hadoop01.mapreduce.PartitionerApp 
    hdfs://hadoop01:8020/partitioner.txt  hdfs://hadoop01:8020/output/partitioner
     
    结果: -rw-r--r--   1 hadoop supergroup         11 2018-05-09 06:35 /output/partitioner/part-r-00000
          -rw-r--r--   1 hadoop supergroup         12 2018-05-09 06:35 /output/partitioner/part-r-00001
          -rw-r--r--   1 hadoop supergroup         12 2018-05-09 06:35 /output/partitioner/part-r-00002
          -rw-r--r--   1 hadoop supergroup         11 2018-05-09 06:35 /output/partitioner/part-r-00003
           
    [hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00000
    18/05/09 06:36:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    xiaomi  500
    [hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00001
    18/05/09 06:36:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    huawei  1200
    [hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00002
    18/05/09 06:36:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    iphonex 130
    [hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00003
    18/05/09 06:36:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    iphone7 60

    三、Jobhistory

     JobHistory用来记录已经finished的mapreduce运行日志,日志信息存放于HDFS目录中,默认情况下没有开启此功能。需要配置。

    1)配置hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml

    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hadoop01:10020</value>
        <description>MR JobHistory Server管理的日志的存放位置</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>hadoop01:19888</value>
        <description>查看历史服务器已经运行完的Mapreduce作业记录的web地址,需要启动该服务才行</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.done-dir</name>
        <value>/history/done</value>
        <description>MR JobHistory Server管理的日志的存放位置,默认:/mr-history/done</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.intermediate-done-dir</name>
        <value>/history/done_intermediate</value>
        <description>MapReduce作业产生的日志存放位置,默认值:/mr-history/tmp</description>
    </property>

    2)配置好后重启yarn.启动jobhistory服务:hadoop-2.6.0-cdh5.7.0/sbin/mr-jobhistory-daemon.sh start historyserver

    [hadoop@hadoop01 sbin]$ jps
    24321 JobHistoryServer
    24353 Jps
    23957 NodeManager
    7880 DataNode
    8060 SecondaryNameNode
    23854 ResourceManager
    7791 NameNode
    [hadoop@hadoop01 sbin]$

    3)浏览器访问 :http://192.168.44.183:19888/

         后台跑一个MapReduce程序:hadoop jar ~/lib/hadoop01-0.0.1-SNAPSHOT.jar rdb.com.hadoop01.mapreduce.WordCountApp hdfs://hadoop01:8020/hello.txt  hdfs://hadoop01:8020/output/wc

    刷新下浏览器可以看到刚才程序的日志:

    hisoty.png

    点击页面中对应mr程序中的logs可以看详细日志。

    问题记录:

    R4@I[37YY5MU]1J(44XG~13.png

  • 相关阅读:
    Java设计模式之依赖倒置原则
    windows 下安装apache 遇到的问题
    Java序列化相关
    批量插入————优化
    接口相关
    Redis使用及工具类
    面试回顾——kafka
    面试回顾——List<T>排序
    Java面试——线程池
    面试回顾——session相关
  • 原文地址:https://www.cnblogs.com/jnba/p/10670828.html
Copyright © 2020-2023  润新知