MapReduce编程：单词去重

编程实现单词去重要用到NullWritable类型。

NullWritable：

NullWritable 是一种特殊的Writable 类型，由于它的序列化是零长度的，所以没有字节被写入流或从流中读出，可以用作占位符。比如，在MapReduce 中，在不需要这个位置的时候，键或值能够被声明为NullWritable，从而有效存储一个不变的空值。

通过调用NullWritable.get() 方法来检索。

单词去重我们最后要输出的形式是<单词>，所以值可以声明为NullWritable。

代码如下：

 1 package org.apache.hadoop.examples;
 2      
 3     import java.io.IOException;
 4     import java.util.Iterator;
 5     import java.util.StringTokenizer;
 6     import org.apache.hadoop.conf.Configuration;
 7     import org.apache.hadoop.fs.Path;
 8     import org.apache.hadoop.io.IntWritable;
 9     import org.apache.hadoop.io.NullWritable;
10     import org.apache.hadoop.io.Text;
11     import org.apache.hadoop.mapreduce.Job;
12     import org.apache.hadoop.mapreduce.Mapper;
13     import org.apache.hadoop.mapreduce.Reducer;
14     import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
15     import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
16      
17     public class DistinctWord{
18         public DistinctWord() {
19         }
20      
21         public static void main(String[] args) throws Exception {
22             Configuration conf = new Configuration();
23             
24             //String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
25             String[] otherArgs = new String[]{"input","output"};  //设置输入和输出
26             if(otherArgs.length < 2) {
27                 System.err.println("Usage: wordcount <in> [<in>...] <out>");
28                 System.exit(2);
29             }
30      
31             Job job = Job.getInstance(conf, "distinct word");
32             
33             job.setJarByClass(DistinctWord.class);  //设置jar包所在路径
34             
35             //指定Mapper和Reducer类
36             job.setMapperClass(DistinctWord.DistinctWordMapper.class);  
37             job.setCombinerClass(DistinctWord.DistinctWordReducer.class);
38             job.setReducerClass(DistinctWord.DistinctWordReducer.class);
39             
40             //指定MapTask的输出类型
41             job.setMapOutputKeyClass(Text.class);
42             job.setMapOutputValueClass(NullWritable.class);
43             
44             //指定ReduceTask的输出类型
45             job.setOutputKeyClass(Text.class);
46             job.setOutputValueClass(NullWritable.class);
47      
48             //指定数据输入路径
49             for(int i = 0; i < otherArgs.length - 1; ++i) {
50                 FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
51             }
52             
53             //指定数据输出路径
54             FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
55             
56             //提交任务
57             System.exit(job.waitForCompletion(true)?0:1);
58         }
59      
60       
61         //输出类型定义为NullWritable
62         public static class DistinctWordMapper extends Mapper<Object, Text, Text, NullWritable> {
63             private Text word = new Text();
64      
65             public DistinctWordMapper() {
66             }
67      
68             public void map(Object key, Text value, Mapper<Object, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
69                 StringTokenizer itr = new StringTokenizer(value.toString());  //分词器
70      
71                 while(itr.hasMoreTokens()) {
72                     this.word.set(itr.nextToken());
73                     context.write(this.word, NullWritable.get());
74                 }
75      
76             }
77         }
78         
79         
80         
81         public static class DistinctWordReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
82             
83             public DistinctWordReducer() {
84             }
85             
86             //reduce方法每调用一次，就接收到一组相同的单词，所以直接输出一次key即可。
87             public void reduce(Text key, Iterable<NullWritable> values, Reducer<Text, NullWritable, Text, NullWritable>.Context context) throws IOException, InterruptedException {
88                 context.write(key, NullWritable.get());
89             }
90         }
91         
92         
93     }

相关阅读:
IntelliJ IDEA 2019.2.3 x64版本设置Run Dashboard
IntelliJ IDEA 使用Tomcat后在控制台出现乱码'中文乱码 “淇℃伅”“涓夋湀”
window10家庭版管理员权限问题
 用navicat连接数据库报错：1130-host ... is not allowed to connect to this MySql server如何处理
 IntelliJ IDEA导入已有的项目的方法
 从养孩子谈谈 IO 模型（一）
写作之路，以梦为马，不负韶华
 数据库核心：索引，你知道多少？
面试：啥是数据倾斜？就是数据歪啦！
面试：业务开发中你用到了哪些算法（续）？
原文地址：https://www.cnblogs.com/zyb993963526/p/10246987.html