Mapreduce 反向索引

Mapreduce 反向索引

反向索引主要用于全文搜索，就是形成一个word url这样的结构
file1:

MapReduce is simple

file2:

MapReduce is powerful is simple

file3:

Hello MapReduce bye MapReduce

那么经过反向索引后就是：

Hello     file3.txt:1;
MapReduce     file3.txt:2;fil1.txt:1;fil2.txt:1;
bye     file3.txt:1;
is     fil1.txt:1;fil2.txt:2;
powerful     fil2.txt:1;
simple     fil2.txt:1;fil1.txt:1;

主要的方法就是，对每个文件的内容进行遍历，形成的key为word+filename，value=1然后在combiner中将key相同的进行累加，这样就得到在同一个文件中word的字数了。最后在reduce中将filename进行分割即可。不过这里有个小的bug，一般来说combiner是在同一个节点上进行reduce，但是我这里却是用于统计同一个文件了，如果说文件很大，那么很有可能一个文件的内容会被分配到两个不同的节点上，那么就有会bug了。所以这里只能适合小的文件。

PS：获得文件名String filename = ((FileSplit) context.getInputSplit()).getPath().getName();别的似乎没有了。

public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

                 public void map(LongWritable ikey, Text ivalue, Context context)

                                                 throws IOException, InterruptedException {

                                StringTokenizer st= new StringTokenizer(ivalue.toString());

                                FileSplit split=new FileSplit();

                                split = (FileSplit) context.getInputSplit();

                                InputSplit isplit=context.getInputSplit();

                                String filename = ((FileSplit) context.getInputSplit()).getPath().getName();

                                 while(st.hasMoreTokens()){

                                                 //int splitIndex = split.getPath().toString().indexOf("file");

                                                String key=st.nextToken()+":" +filename;

                                                context.write( new Text(key),new Text("1"));

                                }

                }

}

public class MyCombiner extends Reducer<Text, Text, Text, Text> {

                 public void reduce(Text _key, Iterable<Text> values, Context context)

                                                 throws IOException, InterruptedException {

                                 // process values

                                 int sum=0;

                                 for (Text val : values) {

                                                sum++;

                                }

                                StringTokenizer st= new StringTokenizer(_key.toString(),":");

                                String key=st.nextToken();

                                String value=st.nextToken();

                                value=value+ ":"+sum;

                                context.write( new Text(key),new Text(value));

                }

}

public class MyReducer extends Reducer<Text, Text, Text, Text> {

                 public void reduce(Text _key, Iterable<Text> values, Context context)

                                                 throws IOException, InterruptedException {

                                 // process values

                                String filelist= new String();

                                 for (Text val : values) {

                                                filelist=filelist+val.toString()+ "; ";

                                }

                                context.write(_key, new Text(filelist));

                                 //System.out.println(_key.toString()+filelist);

                }

}
相关阅读:
JavaScript中的map()函数
 JS中去除字符串空白符
 JS中的reduce函数
 Javascript中中括号的几种形式
 Jquery点击加载更多
 百度编辑器的调用
 Newtonsoft.json 二次引用出错解决办法
 WebConfig配置文件
 jqGrid首次加载时不加载任何数据
 jqGrid列的统计
原文地址：https://www.cnblogs.com/sunrye/p/4543365.html

最新文章
第11课
 第10课
 第9课
 第8课
 设计模式
 python基础所需
 ip地址的正则表达式
 IO编程
 云服务
 SSL/TLS协议运行机制