TopK的一个简单实现

TopK的一个简单实现
转自：http://rangerwolf.iteye.com/blog/2119096

题外话：

《Hadoop in Action》是一本非常不错的交Hadoop的入门书，而且建议看英文版。此书作者的英文表达非常简单易懂。相信有一定英文阅读能力的同学直接用英文版就能非常容易的上手~

进入正题。这个题目是《Hadoop in Action》上面的一道题目，求出Top K的值。

我自己随便弄了一个输入文件：
Java代码
1. g 445
2. a 1117
3. b 222
4. c 333
5. d 444
6. e 123
7. f 345
8. h 456
讲讲我的思路：

对于Top K的问题，首先要在每个block/分片之中找到这部分的Top K。并且由于只能输出一次，所以输出的工作需要在cleanup方法之中进行。为了简单，使用的是java之中的TreeMap，因为这个数据结构天生就带有排序功能。而Reducer的工作流程跟Map其实是完全一致的，只是光Map一步还不够，所以只能再加一个Reduce步骤。

最终输出的格式为如下：(K=2)
Java代码
1. 1117 a
2. 456 g
所以需要使用map。如果只需要输出大小的话，直接使用TreeSet会更高效一点。

下面是实现的代码：
Java代码
1. package hadoop_in_action_exersice;
3. import java.io.IOException;
4. import java.util.TreeMap;
6. import org.apache.hadoop.conf.Configuration;
7. import org.apache.hadoop.fs.Path;
8. import org.apache.hadoop.io.IntWritable;
9. import org.apache.hadoop.io.LongWritable;
10. import org.apache.hadoop.io.Text;
11. import org.apache.hadoop.mapreduce.Job;
12. import org.apache.hadoop.mapreduce.Mapper;
13. import org.apache.hadoop.mapreduce.Reducer;
14. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
15. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
17. public class TopK {
19. public static final int K = 2;
21. public static class KMap extends Mapper<LongWritable, Text, IntWritable, Text> {
23. TreeMap<Integer, String> map = new TreeMap<Integer, String>();
25. public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
27. String line = value.toString();
28. if(line.trim().length() > 0 && line.indexOf(" ") != -1) {
30. String[] arr = line.split(" ", 2);
31. String name = arr[0];
32. Integer num = Integer.parseInt(arr[1]);
33. map.put(num, name);
35. if(map.size() > K) {
36. map.remove(map.firstKey());
37. }
38. }
39. }
41. @Override
42. protected void cleanup(
43. Mapper<LongWritable, Text, IntWritable, Text>.Context context)
44. throws IOException, InterruptedException {
46. for(Integer num : map.keySet()) {
47. context.write(new IntWritable(num), new Text(map.get(num)));
48. }
50. }
52. }
55. public static class KReduce extends Reducer<IntWritable, Text, IntWritable, Text> {
57. TreeMap<Integer, String> map = new TreeMap<Integer, String>();
59. public void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
61. map.put(key.get(), values.iterator().next().toString());
62. if(map.size() > K) {
63. map.remove(map.firstKey());
64. }
65. }
67. @Override
68. protected void cleanup(
69. Reducer<IntWritable, Text, IntWritable, Text>.Context context)
70. throws IOException, InterruptedException {
71. for(Integer num : map.keySet()) {
72. context.write(new IntWritable(num), new Text(map.get(num)));
73. }
74. }
75. }
77. public static void main(String[] args) {
78. // TODO Auto-generated method stub
80. Configuration conf = new Configuration();
81. try {
82. Job job = new Job(conf, "my own word count");
83. job.setJarByClass(TopK.class);
84. job.setMapperClass(KMap.class);
85. job.setCombinerClass(KReduce.class);
86. job.setReducerClass(KReduce.class);
87. job.setOutputKeyClass(IntWritable.class);
88. job.setOutputValueClass(Text.class);
89. FileInputFormat.setInputPaths(job, new Path("/home/hadoop/DataSet/Hadoop/WordCount-Result"));
90. FileOutputFormat.setOutputPath(job, new Path("/home/hadoop/DataSet/Hadoop/TopK-output1"));
91. System.out.println(job.waitForCompletion(true));
92. } catch (IOException e) {
93. // TODO Auto-generated catch block
94. e.printStackTrace();
95. } catch (ClassNotFoundException e) {
96. // TODO Auto-generated catch block
97. e.printStackTrace();
98. } catch (InterruptedException e) {
99. // TODO Auto-generated catch block
100. e.printStackTrace();
101. }
102. }
103. }
相关阅读:
20200925--矩阵加法(奥赛一本通P93 6 多维数组)
20200924--图像相似度(奥赛一本通P92 5多维数组)
20200923--计算鞍点(奥赛一本通P91 4)
20200922--计算矩阵边缘元素之和（奥赛一本通P91 3二维数组）
20200921--同行列对角线的格(奥赛一本通P89 2 二维数组)
磨人的.net core 3.1（二） DataReader的问题
 磨人的.net core 3.1（一） CORS的问题
 Vue SSR问题：返回的js打包文件为HTML文件
 axios与.net core API实现文件下载
 .Net Core API中基于System.Threading.Timer的定时任务
原文地址：https://www.cnblogs.com/cxzdy/p/4996222.html