• MapReduce编程:单词去重


    编程实现单词去重要用到NullWritable类型。

    NullWritable:

    NullWritable 是一种特殊的Writable 类型,由于它的序列化是零长度的,所以没有字节被写入流或从流中读出,可以用作占位符。比如,在MapReduce 中,在不需要这个位置的时候,键或值能够被声明为NullWritable,从而有效存储一个不变的空值。

    通过调用NullWritable.get() 方法来检索。

    单词去重我们最后要输出的形式是<单词>,所以值可以声明为NullWritable。

    代码如下:

     1 package org.apache.hadoop.examples;
     2      
     3     import java.io.IOException;
     4     import java.util.Iterator;
     5     import java.util.StringTokenizer;
     6     import org.apache.hadoop.conf.Configuration;
     7     import org.apache.hadoop.fs.Path;
     8     import org.apache.hadoop.io.IntWritable;
     9     import org.apache.hadoop.io.NullWritable;
    10     import org.apache.hadoop.io.Text;
    11     import org.apache.hadoop.mapreduce.Job;
    12     import org.apache.hadoop.mapreduce.Mapper;
    13     import org.apache.hadoop.mapreduce.Reducer;
    14     import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    15     import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    16      
    17     public class DistinctWord{
    18         public DistinctWord() {
    19         }
    20      
    21         public static void main(String[] args) throws Exception {
    22             Configuration conf = new Configuration();
    23             
    24             //String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
    25             String[] otherArgs = new String[]{"input","output"};  //设置输入和输出
    26             if(otherArgs.length < 2) {
    27                 System.err.println("Usage: wordcount <in> [<in>...] <out>");
    28                 System.exit(2);
    29             }
    30      
    31             Job job = Job.getInstance(conf, "distinct word");
    32             
    33             job.setJarByClass(DistinctWord.class);  //设置jar包所在路径
    34             
    35             //指定Mapper和Reducer类
    36             job.setMapperClass(DistinctWord.DistinctWordMapper.class);  
    37             job.setCombinerClass(DistinctWord.DistinctWordReducer.class);
    38             job.setReducerClass(DistinctWord.DistinctWordReducer.class);
    39             
    40             //指定MapTask的输出类型
    41             job.setMapOutputKeyClass(Text.class);
    42             job.setMapOutputValueClass(NullWritable.class);
    43             
    44             //指定ReduceTask的输出类型
    45             job.setOutputKeyClass(Text.class);
    46             job.setOutputValueClass(NullWritable.class);
    47      
    48             //指定数据输入路径
    49             for(int i = 0; i < otherArgs.length - 1; ++i) {
    50                 FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    51             }
    52             
    53             //指定数据输出路径
    54             FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
    55             
    56             //提交任务
    57             System.exit(job.waitForCompletion(true)?0:1);
    58         }
    59      
    60       
    61         //输出类型定义为NullWritable
    62         public static class DistinctWordMapper extends Mapper<Object, Text, Text, NullWritable> {
    63             private Text word = new Text();
    64      
    65             public DistinctWordMapper() {
    66             }
    67      
    68             public void map(Object key, Text value, Mapper<Object, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
    69                 StringTokenizer itr = new StringTokenizer(value.toString());  //分词器
    70      
    71                 while(itr.hasMoreTokens()) {
    72                     this.word.set(itr.nextToken());
    73                     context.write(this.word, NullWritable.get());
    74                 }
    75      
    76             }
    77         }
    78         
    79         
    80         
    81         public static class DistinctWordReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
    82             
    83             public DistinctWordReducer() {
    84             }
    85             
    86             //reduce方法每调用一次,就接收到一组相同的单词,所以直接输出一次key即可。
    87             public void reduce(Text key, Iterable<NullWritable> values, Reducer<Text, NullWritable, Text, NullWritable>.Context context) throws IOException, InterruptedException {
    88                 context.write(key, NullWritable.get());
    89             }
    90         }
    91         
    92         
    93     }
  • 相关阅读:
    IntelliJ IDEA 2019.2.3 x64版本设置Run Dashboard
    IntelliJ IDEA 使用Tomcat后在控制台出现乱码'中文乱码 “淇℃伅”“涓夋湀”
    window10家庭版管理员权限问题
    用navicat连接数据库报错:1130-host ... is not allowed to connect to this MySql server如何处理
    IntelliJ IDEA导入已有的项目的方法
    从养孩子谈谈 IO 模型(一)
    写作之路,以梦为马,不负韶华
    数据库核心:索引,你知道多少?
    面试:啥是数据倾斜?就是数据歪啦!
    ​面试:业务开发中你用到了哪些算法(续)?
  • 原文地址:https://www.cnblogs.com/zyb993963526/p/10246987.html
Copyright © 2020-2023  润新知