• MapReduce序列化及分区的java代码示例


    概述

      序列化(Serialization)是指把结构化对象转化为字节流

      反序列化(Deserialization)是序列化的逆过程。把字节流转为结构化对象。

      当要在进程间传递对象或持久化对象的时候,就需要序列化对象成字节流,反之当要将接收到或从磁盘读取的字节流转换为对象,就要进行反序列化。

      Java 的序列化(Serializable)是一个重量级序列化框架,一个对象被序列化后,会附带很多额外的信息(各种校验信息,header,继承体系…),不便于在网络中高效传输;所以,hadoop 自己开发了一套序列化机制( Writable),精简,高效。不用像 java 对象类一样传输多层的父子关系,需要哪个属性就传输哪个属性值,大大的减少网络传输的开销。

      Writable是Hadoop的序列化格式,hadoop定义了这样一个Writable接口。一个类要支持可序列化只需实现这个接口即可。

    public interface Writable {
        void write(DataOutput out) throws IOException;
        void readFields(DataInput in) throws IOException;
    }

      如需要将自定义的 bean 放在 key 中传输,则还需要实现 comparable 接口,因为 mapreduce 框中的 shuffle 过程一定会对 key 进行排序,此时,自定义的bean 实现的接口应该是:WritableComparable

    代码示例

      1 . 需求

        统计每一个用户(手机号)所耗费的总上行流量、下行流量,总流量结果的基础之上再加一个需求:将统计结果按照总流量倒序排序。

      准备数据

    1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
    1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
    1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
    1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
    1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
    1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
    1363157995052     13826544101    5C-0E-8B-C7-F1-E0:CMCC    120.197.40.4            4    0    264    0    200
    1363157991076     13926435656    20-10-7A-28-CC-0A:CMCC    120.196.100.99            2    4    132    1512    200
    1363154400022     13926251106    5C-0E-8B-8B-B1-50:CMCC    120.197.40.4            4    0    240    0    200
    1363157993044     18211575961    94-71-AC-CD-E6-18:CMCC-EASY    120.196.100.99    iface.qiyi.com    视频网站    15    12    1527    2106    200
    1363157995074     84138413    5C-0E-8B-8C-E8-20:7DaysInn    120.197.40.4    122.72.52.12        20    16    4116    1432    200
    1363157993055     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            18    15    1116    954    200
    1363157995033     15920133257    5C-0E-8B-C7-BA-20:CMCC    120.197.40.4    sug.so.360.cn    信息安全    20    20    3156    2936    200
    1363157983019     13719199419    68-A1-B7-03-07-B1:CMCC-EASY    120.196.100.82            4    0    240    0    200
    1363157984041     13660577991    5C-0E-8B-92-5C-20:CMCC-EASY    120.197.40.4    s19.cnzz.com    站点统计    24    9    6960    690    200
    1363157973098     15013685858    5C-0E-8B-C7-F7-90:CMCC    120.197.40.4    rank.ie.sogou.com    搜索引擎    28    27    3659    3538    200
    1363157986029     15989002119    E8-99-C4-4E-93-E0:CMCC-EASY    120.196.100.99    www.umeng.com    站点统计    3    3    1938    180    200
    1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
    1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
    1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
    1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
    1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
    1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
    1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
    1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
    1363157984040     13602846565    5C-0E-8B-8B-B6-00:CMCC    120.197.40.4    2052.flash2-http.qq.com    综合门户    15    12    1938    2910    200
    1363157995093     13922314466    00-FD-07-A2-EC-BA:CMCC    120.196.100.82    img.qfc.cn        12    12    3008    3720    200
    1363157982040     13502468823    5C-0A-5B-6A-0B-D4:CMCC-EASY    120.196.100.99    y0.ifengimg.com    综合门户    57    102    7335    110349    200
    1363157986072     18320173382    84-25-DB-4F-10-1A:CMCC-EASY    120.196.100.99    input.shouji.sogou.com    搜索引擎    21    18    9531    2412    200
    1363157990043     13925057413    00-1F-64-E1-E6-9A:CMCC    120.196.100.55    t3.baidu.com    搜索引擎    69    63    11058    48243    200
    1363157988072     13760778710    00-FD-07-A4-7B-08:CMCC    120.196.100.82            2    2    120    120    200
    1363157985066     13726238888    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
    1363157993055     13560436666    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            18    15    1116    954    200

      2 . 分析

        实现自定义的bean 来封装流量信息,并将bean 作为 map 输出的 key 来传输

        MR 程序在处理数据的过程中会对数据排序(map 输出的 kv 对传输到 reduce之前,会排序),排序的依据是 map 输出的 key。所以,我们如果要实现自己需要的排序规则,则可以考虑将排序因素放到 key 中,让 key 实现接口:WritableComparable,然后重写 key 的 compareTo 方法。

      3 . 未排序的实现

        自定义JavaBean

    public class FlowBean implements WritableComparable<FlowBean>{
        private long upFlow;
        private long downFlow;
        private long sumFlow;
    
        public FlowBean() {
        }
        public FlowBean(long upFlow, long downFlow, long sumFlow) {
            this.upFlow = upFlow;
            this.downFlow = downFlow;
            this.sumFlow = sumFlow;
        }
        public FlowBean(long upFlow, long downFlow) {
            this.upFlow = upFlow;
            this.downFlow = downFlow;
            this.sumFlow = upFlow + downFlow;
        }
        public void set(long upFlow, long downFlow) {
            this.upFlow = upFlow;
            this.downFlow = downFlow;
            this.sumFlow = upFlow + downFlow;
        }
    
        public long getUpFlow() {
            return upFlow;
        }
        public void setUpFlow(long upFlow) {
            this.upFlow = upFlow;
        }
        public long getDownFlow() {
            return downFlow;
        }
        public void setDownFlow(long downFlow) {
            this.downFlow = downFlow;
        }
        public long getSumFlow() {
            return sumFlow;
        }
        public void setSumFlow(long sumFlow) {
            this.sumFlow = sumFlow;
        }
    
        @Override
        public String toString() {
            return upFlow + "	" + downFlow + "	" + sumFlow;
        }
    
        /**
         * 序列化方法
         * @param out
         * @throws IOException
         */
        @Override
        public void write(DataOutput out) throws IOException {
            out.writeLong(upFlow);
            out.writeLong(downFlow);
            out.writeLong(sumFlow);
        }
    
        /**
         * 反序列化方法
         * 先序列化的先反序列化
         * @param in
         * @throws IOException
         */
        @Override
        public void readFields(DataInput in) throws IOException {
            this.upFlow = in.readLong();
            this.downFlow = in.readLong();
            this.sumFlow = in.readLong();
        }
    
        /**
         * 指定对象排序的方法
         *  如果指定的数与参数相等返回 0。
         *  如果指定的数小于参数返回 -1。
         *  如果指定的数大于参数返回 1。
         */
        @Override
        public int compareTo(FlowBean o) {
            return this.getSumFlow() > o.getSumFlow() ? -1 : 1 ;//按照指定的总流量的倒序排序
    //        return this.getSumFlow() > o.getSumFlow() ? 1 : -1 ;//按照指定的总流量的正序排序
        }
    }

        Mapper方法

    public class FlowSumMapper extends Mapper<LongWritable,Text,Text,FlowBean> {
        Text k = new Text();
        FlowBean v = new FlowBean();
    
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] fields = line.split("	");
    
            String phoNum = fields[1];//提前目标文件中的手机号
            long upFlow = Long.parseLong(fields[fields.length-3]);//提取目标文件中的上行流量
            long downFlow = Long.parseLong(fields[fields.length-2]);//提取目标文件中的下行流量
    
            k.set(phoNum);
            v.set(upFlow,downFlow);
            context.write(k,v);
        }
    }

        Reducer方法

    public class FlowSumReducer extends Reducer<Text,FlowBean,Text,FlowBean> {
        FlowBean v = new FlowBean();
    
        @Override
        protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
            long sumUpFlow = 0;
            long sumDownFlowd = 0;
            for (FlowBean value : values) {
                sumUpFlow += value.getUpFlow();//获取每条记录的上行流量并计算总和
                sumDownFlowd += value.getDownFlow();//获取每条记录的下行流量并计算总和
            }
            v.set(sumUpFlow ,sumDownFlowd);
            context.write(key,v);
        }
    }

        主方法

    public class FlowSumRunner {
        public static void main(String[] args) throws Exception{
            Configuration conf = new Configuration();
            //指定mr程序使用本地模式模拟一套环境执行mr程序,一般用于本地代码测试
            conf.set("mapreduce.framework.name","local");
            //通过job方法获得mr程序运行的实例
            Job job = Job.getInstance(conf);
    
            //指定本次mr程序的运行主类
            job.setJarByClass(FlowSumRunner.class);
            //指定本次mr程序使用的mapper reduce
            job.setMapperClass(FlowSumMapper.class);
            job.setReducerClass(FlowSumReducer.class);
            //指定本次mr程序map输出的数据类型
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(FlowBean.class);
            //指定本次mr程序reduce输出的数据类型,也就是说最终的输出类型
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(FlowBean.class);
            //指定本次mr程序待处理数据目录   输出结果存放目录
            FileInputFormat.addInputPath(job,new Path("D:\flowsum\input"));
            FileOutputFormat.setOutputPath(job,new Path("D:\flowsum\output"));
    
            //提交本次mr程序
            boolean b = job.waitForCompletion(true);
            System.exit(b ? 0 : 1);//程序执行成功,退出状态码为0,退出程序,否则为1
        }
    }

        

       3 . 排序的实现

          使用上面的输出作为该需求的输入

        Mapper方法

    public class FlowSumSortMapper extends Mapper<LongWritable,Text,FlowBean,Text> {
        FlowBean k = new FlowBean();
        Text v = new Text();
    
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] fileds = line.split("	");
    
            String phoNum = fileds[0];
            long sumUpFlow = Long.parseLong(fileds[1]);
            long sumDownFlow = Long.parseLong(fileds[2]);
    
            v.set(phoNum);
            k.set(sumUpFlow,sumDownFlow);
            context.write(k,v);
        }
    }

        Reducer方法

    public class FlowSumSortReducer extends Reducer<FlowBean,Text,Text,FlowBean> {
        @Override
        protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            Text phoNum = values.iterator().next();//iterator中只有一个值
            context.write(phoNum,key);
        }
    }

        主方法

     1 //得出上题结果的基础之上再加一个需求:将统计结果按照总流量倒序排序
     2 public class FlowSumSortDriver {
     3     public static void main(String[] args) throws Exception{
     4         Configuration conf = new Configuration();
     5         //指定mr程序使用本地模式模拟一套环境执行mr程序,一般用于本地代码测试
     6         conf.set("mapreduce.framework.name","local");
     7 
     8         //通过job方法获得mr程序运行的实例
     9         Job job = Job.getInstance(conf);
    10 
    11         //指定本次mr程序的运行主类
    12         job.setJarByClass(FlowSumSortDriver.class);
    13         //指定本次mr程序使用的mapper reduce
    14         job.setMapperClass(FlowSumSortMapper.class);
    15         job.setReducerClass(FlowSumSortReducer.class);
    16         //指定本次mr程序map输出的数据类型
    17         job.setMapOutputKeyClass(FlowBean.class);
    18         job.setMapOutputValueClass(Text.class);
    19         //指定本次mr程序reduce输出的数据类型,也就是说最终的输出类型
    20         job.setOutputKeyClass(Text.class);
    21         job.setOutputValueClass(FlowBean.class);
    22         //指定本次mr程序待处理数据目录   输出结果存放目录
    23         FileInputFormat.addInputPath(job,new Path("D:\flowsum\output"));
    24         FileOutputFormat.setOutputPath(job,new Path("D:\flowsum\outsortput"));
    25 
    26         //提交本次mr程序
    27         boolean b = job.waitForCompletion(true);
    28         System.exit(b ? 0 : 1);//程序执行成功,退出状态码为0,退出程序,否则为1
    29     }
    30 }

    Mapreduce的分区—Partitioner

    1 .  需求

        将流量汇总统计结果按照手机归属地不同省份输出到不同文件中。

    2 .  分析

        Mapreduce 中会将 map 输出的 kv 对,按照相同 key 分组,然后分发给不同的 reducetask。

        默认的分发规则为:根据 key 的 hashcode%reducetask 数来分发

        所以:如果要按照我们自己的需求进行分组,则需要改写数据分发(分组)组件 Partitioner,自定义一个 CustomPartitioner 继承抽象类:Partitioner,然后在job 对象中,设置自定义 partitioner: job.setPartitionerClass(CustomPartitioner.class)

    3 .  实现

        自定义partitioner类

    public class ProvincePartitioner extends Partitioner<Text, FlowBean> {   
        public static HashMap<String, Integer>  provinceMap = new HashMap<String, Integer>();   
        static{
            provinceMap.put("134", 0);
            provinceMap.put("135", 1);
            provinceMap.put("136", 2);
            provinceMap.put("137", 3);
            provinceMap.put("138", 4);
        }
    
        @Override
        public int getPartition(Text key, FlowBean value, int numPartitions) {
            Integer code = provinceMap.get(key.toString().substring(0, 3));       
            if (code != null) {
                return code;
            }      
            return 5;
        }
    }

        Mapper、Reducer及主方法

     1 public class FlowSumProvince {   
     2  public static class FlowSumProvinceMapper extends Mapper<LongWritable, Text, Text, FlowBean>{        
     3      Text k = new Text();
     4      FlowBean  v = new FlowBean();
     5      
     6      @Override
     7      protected void map(LongWritable key, Text value,Context context)
     8              throws IOException, InterruptedException {
     9              //拿取一行文本转为String
    10              String line = value.toString();
    11              //按照分隔符	进行分割
    12              String[] fileds = line.split("	");
    13              //获取用户手机号
    14              String phoneNum = fileds[1];
    15              
    16              long upFlow = Long.parseLong(fileds[fileds.length-3]);
    17              long downFlow = Long.parseLong(fileds[fileds.length-2]);
    18              
    19              k.set(phoneNum);
    20              v.set(upFlow, downFlow);            
    21              context.write(k,v);                
    22         }        
    23     }
    24        
    25     public static class FlowSumProvinceReducer extends Reducer<Text, FlowBean, Text, FlowBean>{        
    26         FlowBean  v  = new FlowBean();         
    27         @Override
    28         protected void reduce(Text key, Iterable<FlowBean> flowBeans,Context context) throws IOException, InterruptedException {            
    29             long upFlowCount = 0;
    30             long downFlowCount = 0;
    31             
    32             for (FlowBean flowBean : flowBeans) {                
    33                 upFlowCount += flowBean.getUpFlow();                
    34                 downFlowCount += flowBean.getDownFlow();                
    35             }
    36             v.set(upFlowCount, downFlowCount);            
    37             context.write(key, v);
    38     }
    39        
    40     public static void main(String[] args) throws Exception{       
    41         Configuration conf = new Configuration();
    42         Job job = Job.getInstance(conf);
    43 
    44         //指定我这个 job 所在的 jar包位置
    45         job.setJarByClass(FlowSumProvince.class);       
    46         //指定我们使用的Mapper是那个类  reducer是哪个类
    47         job.setMapperClass(FlowSumProvinceMapper.class);
    48         job.setReducerClass(FlowSumProvinceReducer.class);        
    49         // 设置我们的业务逻辑 Mapper 类的输出 key 和 value 的数据类型
    50         job.setMapOutputKeyClass(Text.class);
    51         job.setMapOutputValueClass(FlowBean.class);        
    52         // 设置我们的业务逻辑 Reducer 类的输出 key 和 value 的数据类型
    53         job.setOutputKeyClass(Text.class);
    54         job.setOutputValueClass(FlowBean.class);
    55                 
    56         //这里设置运行reduceTask的个数
    57         job.setNumReduceTasks(6);
    58                
    59         //这里指定使用我们自定义的分区组件
    60         job.setPartitionerClass(ProvincePartitioner.class);
    61                
    62         FileInputFormat.setInputPaths(job, new Path("D:\flowsum\input"));
    63         // 指定处理完成之后的结果所保存的位置
    64         FileOutputFormat.setOutputPath(job, new Path("D:\flowsum\outputProvince"));        
    65         boolean res = job.waitForCompletion(true);
    66         System.exit(res ? 0 : 1);       
    67     }
    68  }
    69 }
  • 相关阅读:
    从滴滴flinkCEP说起
    从零讲Java,给你一条清晰地学习道路!该学什么就学什么!
    从coalesce算子发散开的
    从架构理解价值-我的软件世界观
    为什么说Redis是单线程的?
    全排列_获取第几个排列_获取是第几个排列__康托展开,逆康托展开
    求排列是全排列的第几个,求第几个是全排列___康托展开,逆康托展开
    搜索入门_简单搜索bfs dfs大杂烩
    无根树同构_hash
    乘法逆元_三种方法
  • 原文地址:https://www.cnblogs.com/jifengblog/p/9277425.html
Copyright © 2020-2023  润新知