Hadoop学习笔记(8) ——实战做个倒排索引

Hadoop学习笔记(8) ——实战做个倒排索引
Hadoop学习笔记(8)

——实战做个倒排索引

倒排索引是文档检索系统中最常用数据结构。根据单词反过来查在文档中出现的频率，而不是根据文档来，所以称倒排索引(Inverted Index)。结构如下:

这张索引表中，每个单词都对应着一系列的出现该单词的文档，权表示该单词在该文档中出现的次数。现在我们假定输入的是以下的文件清单：

T1 ： hello world hello china

T2 : hello hadoop

T3 ： bye world bye hadoop bye bye

输入这些文件，我们最终将会得到这样的索引文件：

bye    T3:4;

china    T1:1;

hadoop    T2:1;T3:1;

hello    T1:2;T2:1;

world    T1:1;T3:1;

接下来，我们就是要想办法利用hadoop来把这个输入，变成输出。从上一章中，其实也就是分析如何将hadoop中的步骤个性化，让其工作。整个步骤中，最主要的还是map和reduce过程，其它的都可称之为配角，所以我们先来分析下map和reduce的过程将会是怎样？

首先是Map的过程。Map的输入是文本输入，一条条的行记录进入。输出呢？应该包含：单词、所在文件、单词数。 Map的输入是key-value。那这三个信息谁是key，谁是value呢？数量是需要累计的，单词数肯定在value里，单词在key中，文件呢？不同文件内的相同单词也不能累加的，所以这个文件应该在key中。这样key中就应该包含两个值：单词和文件，value则是默认的数量1，用于后面reduce来进行合并。

所以Map后的结果应该是这样的：

Key value

Hello;T1 1

Hello:T1 1

World:T1 1

China:T1 1

Hello:T2 1

…

即然这个key是复合的，所以常归的类型已经不能满足我们的要求了，所以得设置一个复合健。复合健的写法在上一章中描述到了。所以这里我们就直接上代码：
1. public static class MyType implements WritableComparable<MyType>{
2. public MyType(){
3. }
5. private String word;
6. public String Getword(){return word;}
7. public void Setword(String value){ word = value;}
9. private String filePath;
10. public String GetfilePath(){return filePath;}
11. public void SetfilePath(String value){ filePath = value;}
13. @Override
14. public void write(DataOutput out) throws IOException {
15. out.writeUTF(word);
16. out.writeUTF(filePath);
17. }
19. @Override
20. public void readFields(DataInput in) throws IOException {
21. word = in.readUTF();
22. filePath = in.readUTF();
23. }
25. @Override
26. public int compareTo(MyType arg0) {
27. if (word != arg0.word)
28. return word.compareTo(arg0.word);
29. return filePath.compareTo(arg0.filePath);
30. }
31. }
有了这个复合健的定义后，这个Map函数就好写了：
1. public static class InvertedIndexMapper extends
2. Mapper<Object, Text, MyType, Text> {
4. public void map(Object key, Text value, Context context)
5. throws InterruptedException, IOException {
7. FileSplit split = (FileSplit) context.getInputSplit();
8. StringTokenizer itr = new StringTokenizer(value.toString());
10. while (itr.hasMoreTokens()) {
11. MyType keyInfo = new MyType();
12. keyInfo.Setword(itr.nextToken());
13. keyInfo.SetfilePath(split.getPath().toUri().getPath().replace("/user/zjf/in/", ""));
14. context.write(keyInfo, new Text("1"));
15. }
16. }
17. }
注意：第13行，路径是全路径的，为了看起来方便，我们把目录替换掉，直接取文件名。

有了Map，接下来就可以考虑Recude了，以及在Map之后的Combine。Map的输出的Key类型是MyType，所以Reduce以及Combine的输入就必须是MyType了。

如果直接将Map的结果送到Reduce后，发现还需要做大量的工作来将Key中的单词再重排一下。所以我们考虑在Reduce前加一个Combine，先将数量进行一轮合并。

这个Combine将会输入下面的值：

Key value

bye    T3:4;

china    T1:1;

hadoop    T2:1;

hadoop    T3:1;

hello    T1:2;

hello    T2:1;

world    T1:1;

world    T3:1;

代码如下：
1. public static class InvertedIndexCombiner extends
2. Reducer<MyType, Text, MyType, Text> {
4. public void reduce(MyType key, Iterable<Text> values, Context context)
5. throws InterruptedException, IOException {
6. int sum = 0;
7. for (Text value : values) {
8. sum += Integer.parseInt(value.toString());
9. }
10. context.write(key, new Text(key.GetfilePath()+ ":" + sum));
11. }
12. }
有了上面Combine后的结果，再进行Reduce就容易了，只需要将value结果进行合并处理：
1. public static class InvertedIndexReducer extends
2. Reducer<MyType, Text, Text, Text> {
4. public void reduce(MyType key, Iterable<Text> values, Context context)
5. throws InterruptedException, IOException {
6. Text result = new Text();
8. String fileList = new String();
9. for (Text value : values) {
10. fileList += value.toString() + ";";
11. }
12. result.set(fileList);
14. context.write(new Text(key.Getword()), result);
15. }
16. }
  
  经过这个Reduce处理，就得到了下面的结果：
bye    T3:4;

china    T1:1;

hadoop    T2:1;T3:1;

hello    T1:2;T2:1;

world    T1:1;T3:1;

最后，MapReduce函数都写完后，就可以挂在Job中运行了。
1. public static void main(String[] args) throws IOException,
2. InterruptedException, ClassNotFoundException {
3. Configuration conf = new Configuration();
4. System.out.println("url:" + conf.get("fs.default.name"));
6. Job job = new Job(conf, "InvertedIndex");
7. job.setJarByClass(InvertedIndex.class);
8. job.setMapperClass(InvertedIndexMapper.class);
9. job.setMapOutputKeyClass(MyType.class);
10. job.setMapOutputValueClass(Text.class);
12. job.setCombinerClass(InvertedIndexCombiner.class);
13. job.setReducerClass(InvertedIndexReducer.class);
15. job.setOutputKeyClass(Text.class);
16. job.setOutputValueClass(Text.class);
18. Path path = new Path("out");
19. FileSystem hdfs = FileSystem.get(conf);
20. if (hdfs.exists(path))
21. hdfs.delete(path, true);
23. FileInputFormat.addInputPath(job, new Path("in"));
24. FileOutputFormat.setOutputPath(job, new Path("out"));
25. job.waitForCompletion(true);
26. }
注：这里为了调试方便，我们把in和out都写死，不用传入执行参数了，并且，每次执行前，判断out文件夹是否存在，如果存在则删除。
相关阅读:
实现Callable接口实现多线程
 匿名内部类方式实现
 实现Runnable接口方式
 后台线程
 继承Thread类
 线程中断详解
 第六章、Linux 的档案权限不目彔配置
 安装virtual box
CISCO实验记录六：EIGRP路由协议
 zabbix监控项整理Items-key
原文地址：https://www.cnblogs.com/zjfstudio/p/3913549.html

Hadoop学习笔记(8) ——实战 做个倒排索引

Hadoop学习笔记(8) ——实战做个倒排索引