Mapreduce实例——二次排序

Mapreduce实例——二次排序
原理

在Map阶段，使用job.setInputFormatClass定义的InputFormat将输入的数据集分割成小数据块splites，同时InputFormat提供一个RecordReder的实现。本实验中使用的是TextInputFormat，他提供的RecordReder会将文本的字节偏移量作为key，这一行的文本作为value。这就是自定义Map的输入是<LongWritable, Text>的原因。然后调用自定义Map的map方法，将一个个<LongWritable, Text>键值对输入给Map的map方法。注意输出应该符合自定义Map中定义的输出<IntPair, IntWritable>。最终是生成一个List<IntPair, IntWritable>。在map阶段的最后，会先调用job.setPartitionerClass对这个List进行分区，每个分区映射到一个reducer。每个分区内又调用job.setSortComparatorClass设置的key比较函数类排序。可以看到，这本身就是一个二次排序。如果没有通过job.setSortComparatorClass设置key比较函数类，则可以使用key实现的compareTo方法进行排序。在本实验中，就使用了IntPair实现的compareTo方法。

在Reduce阶段，reducer接收到所有映射到这个reducer的map输出后，也是会调用job.setSortComparatorClass设置的key比较函数类对所有数据对排序。然后开始构造一个key对应的value迭代器。这时就要用到分组，使用job.setGroupingComparatorClass设置的分组函数类。只要这个比较器比较的两个key相同，他们就属于同一个组，它们的value放在一个value迭代器，而这个迭代器的key使用属于同一个组的所有key的第一个key。最后就是进入Reducer的reduce方法，reduce方法的输入是所有的（key和它的value迭代器）。同样注意输入与输出的类型必须与自定义的Reducer中声明的一致。

环境

Linux Ubuntu 14.04

jdk-7u75-linux-x64

hadoop-2.6.0-cdh5.4.5

hadoop-2.6.0-eclipse-cdh5.4.5.jar

eclipse-java-juno-SR2-linux-gtk-x86_64

内容

在电商网站中，用户进入页面浏览商品时会产生访问日志，记录用户对商品的访问情况，现有goods_visit2表，包含（goods_id,click_num）两个字段，数据内容如下：
1. goods_id click_num
2. 1010037 100
3. 1010102 100
4. 1010152 97
5. 1010178 96
6. 1010280 104
7. 1010320 103
8. 1010510 104
9. 1010603 96
10. 1010637 97
编写MapReduce代码，功能为根据商品的点击次数(click_num)进行降序排序，再根据goods_id升序排序，并输出所有商品。

输出结果如下：
1. 点击次数商品id
2. ------------------------------------------------
3. 104 1010280
4. 104 1010510
5. ------------------------------------------------
6. 103 1010320
7. ------------------------------------------------
8. 100 1010037
9. 100 1010102
10. ------------------------------------------------
11. 97 1010152
12. 97 1010637
13. ------------------------------------------------
14. 96 1010178
15. 96 1010603
实验步骤

1.切换到/apps/hadoop/sbin目录下，开启Hadoop。
1. cd /apps/hadoop/sbin
2. ./start-all.sh
2.在Linux本地新建/data/mapreduce8目录。
1. mkdir -p /data/mapreduce8
3.在Linux中切换到/data/mapreduce8目录下，用wget命令从http://192.168.1.100:60000/allfiles/mapreduce8/goods_visit2网址上下载文本文件goods_visit2。
1. cd /data/mapreduce8
2. wget http://192.168.1.100:60000/allfiles/mapreduce8/goods_visit2
然后在当前目录下用wget命令从http://192.168.1.100:60000/allfiles/mapreduce8/hadoop2lib.tar.gz网址上下载项目用到的依赖包。
1. wget http://192.168.1.100:60000/allfiles/mapreduce8/hadoop2lib.tar.gz
将hadoop2lib.tar.gz解压到当前目录下。
1. tar zxvf hadoop2lib.tar.gz
4.首先在HDFS上新建/mymapreduce8/in目录，然后将Linux本地/data/mapreduce8目录下的goods_visit2文件导入到HDFS的/mymapreduce8/in目录中。
1. hadoop fs -mkdir -p /mymapreduce8/in
2. hadoop fs -put /data/mapreduce8/goods_visit2 /mymapreduce8/in
5.新建Java Project项目，项目名为mapreduce8。

在mapreduce8项目下新建一个package包，包名为mapreduce。

在mapreduce的package包下新建一个SecondarySort类。

6.添加项目所需依赖的jar包，右键单击mapreduce8，新建一个文件夹hadoop2lib，用于存放项目所需的jar包。

将/data/mapreduce8目录下，hadoop2lib目录中的jar包，拷贝到eclipse中mapreduce8项目的hadopo2lib目录下。

选中hadoop2lib目录下所有jar包，并添加到Build Path中。

7.编写Java代码，并描述其设计思路

二次排序：在mapreduce中，所有的key是需要被比较和排序的，并且是二次，先根据partitioner，再根据大小。而本例中也是要比较两次。先按照第一字段排序，然后在第一字段相同时按照第二字段排序。根据这一点，我们可以构造一个复合类IntPair，他有两个字段，先利用分区对第一字段排序，再利用分区内的比较对第二字段排序。Java代码主要分为四部分：自定义key，自定义分区函数类，map部分，reduce部分。

自定义key的代码：
1. public static class IntPair implements WritableComparable<IntPair>
2. {
3. int first; //第一个成员变量
4. int second; //第二个成员变量
6. public void set(int left, int right)
7. {
8. first = left;
9. second = right;
10. }
11. public int getFirst()
12. {
13. return first;
14. }
15. public int getSecond()
16. {
17. return second;
18. }
19. @Override
20. //反序列化，从流中的二进制转换成IntPair
21. public void readFields(DataInput in) throws IOException
22. {
23. // TODO Auto-generated method stub
24. first = in.readInt();
25. second = in.readInt();
26. }
27. @Override
28. //序列化，将IntPair转化成使用流传送的二进制
29. public void write(DataOutput out) throws IOException
30. {
31. // TODO Auto-generated method stub
32. out.writeInt(first);
33. out.writeInt(second);
34. }
35. @Override
36. //key的比较
37. public int compareTo(IntPair o)
38. {
39. // TODO Auto-generated method stub
40. if (first != o.first)
41. {
42. return first < o.first ? 1 : -1;
43. }
44. else if (second != o.second)
45. {
46. return second < o.second ? -1 : 1;
47. }
48. else
49. {
50. return 0;
51. }
52. }
53. @Override
54. public int hashCode()
55. {
56. return first * 157 + second;
57. }
58. @Override
59. public boolean equals(Object right)
60. {
61. if (right == null)
62. return false;
63. if (this == right)
64. return true;
65. if (right instanceof IntPair)
66. {
67. IntPair r = (IntPair) right;
68. return r.first == first && r.second == second;
69. }
70. else
71. {
72. return false;
73. }
74. }
75. }
所有自定义的key应该实现接口WritableComparable，因为是可序列的并且可比较的，并重载方法。该类中包含以下几种方法：1.反序列化，从流中的二进制转换成IntPair 方法为public void readFields(DataInput in) throws IOException 2.序列化，将IntPair转化成使用流传送的二进制方法为public void write(DataOutput out)3. key的比较 public int compareTo(IntPair o) 另外新定义的类应该重写的两个方法 public int hashCode() 和public boolean equals(Object right) 。

分区函数类代码
1. public static class FirstPartitioner extends Partitioner<IntPair, IntWritable>
2. {
3. @Override
4. public int getPartition(IntPair key, IntWritable value,int numPartitions)
5. {
6. return Math.abs(key.getFirst() * 127) % numPartitions;
7. }
8. }
对key进行分区，根据自定义key中first乘以127取绝对值在对numPartions取余来进行分区。这主要是为实现了第一次排序。按分区分。

分组函数类代码
1. public static class GroupingComparator extends WritableComparator
2. {
3. protected GroupingComparator()
4. {
5. super(IntPair.class, true);
6. }
7. @Override
8. //Compare two WritableComparables.
9. public int compare(WritableComparable w1, WritableComparable w2)
10. {
11. IntPair ip1 = (IntPair) w1;
12. IntPair ip2 = (IntPair) w2;
13. int l = ip1.getFirst();
14. int r = ip2.getFirst();
15. return l == r ? 0 : (l < r ? -1 : 1);
16. }
17. }
分组函数类。在reduce阶段，构造一个key对应的value迭代器的时候，只要first相同就属于同一个组，放在一个value迭代器。这是一个比较器，需要继承WritableComparator。

map代码：
1. public static class Map extends Mapper<LongWritable, Text, IntPair, IntWritable>
2. {
3. //自定义map
4. private final IntPair intkey = new IntPair();
5. private final IntWritable intvalue = new IntWritable();
6. public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
7. {
8. String line = value.toString();
9. StringTokenizer tokenizer = new StringTokenizer(line);
10. int left = 0;
11. int right = 0;
12. if (tokenizer.hasMoreTokens())
13. {
14. left = Integer.parseInt(tokenizer.nextToken());
15. if (tokenizer.hasMoreTokens())
16. right = Integer.parseInt(tokenizer.nextToken());
17. intkey.set(right, left);
18. intvalue.set(left);
19. context.write(intkey, intvalue);
20. }
21. }
22. }
在map阶段，使用job.setInputFormatClass定义的InputFormat将输入的数据集分割成小数据块splites，同时InputFormat提供一个RecordReder的实现。本例子中使用的是TextInputFormat，他提供的RecordReder会将文本的一行的行号作为key，这一行的文本作为value。这就是自定义Map的输入是<LongWritable, Text>的原因。然后调用自定义Map的map方法，将一个个<LongWritable, Text>键值对输入给Map的map方法。注意输出应该符合自定义Map中定义的输出<IntPair, IntWritable>。最终是生成一个List<IntPair, IntWritable>。在map阶段的最后，会先调用job.setPartitionerClass对这个List进行分区，每个分区映射到一个reducer。每个分区内又调用job.setSortComparatorClass设置的key比较函数类排序。可以看到，这本身就是一个二次排序。如果没有通过job.setSortComparatorClass设置key比较函数类，则使用key的实现的compareTo方法。在本例子中，使用了IntPair实现的compareTo方法。

Reduce代码：
1. public static class Reduce extends Reducer<IntPair, IntWritable, Text, IntWritable>
2. {
3. private final Text left = new Text();
4. private static final Text SEPARATOR = new Text("------------------------------------------------");
6. public void reduce(IntPair key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException
7. {
8. context.write(SEPARATOR, null);
9. left.set(Integer.toString(key.getFirst()));
10. System.out.println(left);
11. for (IntWritable val : values)
12. {
13. context.write(left, val);
14. //System.out.println(val);
15. }
16. }
17. }
在reduce阶段，reducer接收到所有映射到这个reducer的map输出后，也是会调用job.setSortComparatorClass设置的key比较函数类对所有数据对排序。然后开始构造一个key对应的value迭代器。这时就要用到分组，使用job.setGroupingComparatorClass设置的分组函数类。只要这个比较器比较的两个key相同，他们就属于同一个组，它们的value放在一个value迭代器，而这个迭代器的key使用属于同一个组的所有key的第一个key。最后就是进入Reducer的reduce方法，reduce方法的输入是所有的（key和它的value迭代器）。同样注意输入与输出的类型必须与自定义的Reducer中声明的一致。

完整代码：
1. package mapreduce;
2. import java.io.DataInput;
3. import java.io.DataOutput;
4. import java.io.IOException;
5. import java.util.StringTokenizer;
6. import org.apache.hadoop.conf.Configuration;
7. import org.apache.hadoop.fs.Path;
8. import org.apache.hadoop.io.IntWritable;
9. import org.apache.hadoop.io.LongWritable;
10. import org.apache.hadoop.io.Text;
11. import org.apache.hadoop.io.WritableComparable;
12. import org.apache.hadoop.io.WritableComparator;
13. import org.apache.hadoop.mapreduce.Job;
14. import org.apache.hadoop.mapreduce.Mapper;
15. import org.apache.hadoop.mapreduce.Partitioner;
16. import org.apache.hadoop.mapreduce.Reducer;
17. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
18. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
19. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
20. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
21. public class SecondarySort
22. {
24. public static class IntPair implements WritableComparable<IntPair>
25. {
26. int first;
27. int second;
29. public void set(int left, int right)
30. {
31. first = left;
32. second = right;
33. }
34. public int getFirst()
35. {
36. return first;
37. }
38. public int getSecond()
39. {
40. return second;
41. }
42. @Override
44. public void readFields(DataInput in) throws IOException
45. {
46. // TODO Auto-generated method stub
47. first = in.readInt();
48. second = in.readInt();
49. }
50. @Override
52. public void write(DataOutput out) throws IOException
53. {
54. // TODO Auto-generated method stub
55. out.writeInt(first);
56. out.writeInt(second);
57. }
58. @Override
60. public int compareTo(IntPair o)
61. {
62. // TODO Auto-generated method stub
63. if (first != o.first)
64. {
65. return first < o.first ? 1 : -1;
66. }
67. else if (second != o.second)
68. {
69. return second < o.second ? -1 : 1;
70. }
71. else
72. {
73. return 0;
74. }
75. }
76. @Override
77. public int hashCode()
78. {
79. return first * 157 + second;
80. }
81. @Override
82. public boolean equals(Object right)
83. {
84. if (right == null)
85. return false;
86. if (this == right)
87. return true;
88. if (right instanceof IntPair)
89. {
90. IntPair r = (IntPair) right;
91. return r.first == first && r.second == second;
92. }
93. else
94. {
95. return false;
96. }
97. }
98. }
100. public static class FirstPartitioner extends Partitioner<IntPair, IntWritable>
101. {
102. @Override
103. public int getPartition(IntPair key, IntWritable value,int numPartitions)
104. {
105. return Math.abs(key.getFirst() * 127) % numPartitions;
106. }
107. }
108. public static class GroupingComparator extends WritableComparator
109. {
110. protected GroupingComparator()
111. {
112. super(IntPair.class, true);
113. }
114. @Override
115. //Compare two WritableComparables.
116. public int compare(WritableComparable w1, WritableComparable w2)
117. {
118. IntPair ip1 = (IntPair) w1;
119. IntPair ip2 = (IntPair) w2;
120. int l = ip1.getFirst();
121. int r = ip2.getFirst();
122. return l == r ? 0 : (l < r ? -1 : 1);
123. }
124. }
125. public static class Map extends Mapper<LongWritable, Text, IntPair, IntWritable>
126. {
127. private final IntPair intkey = new IntPair();
128. private final IntWritable intvalue = new IntWritable();
129. public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
130. {
131. String line = value.toString();
132. StringTokenizer tokenizer = new StringTokenizer(line);
133. int left = 0;
134. int right = 0;
135. if (tokenizer.hasMoreTokens())
136. {
137. left = Integer.parseInt(tokenizer.nextToken());
138. if (tokenizer.hasMoreTokens())
139. right = Integer.parseInt(tokenizer.nextToken());
140. intkey.set(right, left);
141. intvalue.set(left);
142. context.write(intkey, intvalue);
143. }
144. }
145. }
147. public static class Reduce extends Reducer<IntPair, IntWritable, Text, IntWritable>
148. {
149. private final Text left = new Text();
150. private static final Text SEPARATOR = new Text("------------------------------------------------");
152. public void reduce(IntPair key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException
153. {
154. context.write(SEPARATOR, null);
155. left.set(Integer.toString(key.getFirst()));
156. System.out.println(left);
157. for (IntWritable val : values)
158. {
159. context.write(left, val);
160. //System.out.println(val);
161. }
162. }
163. }
164. public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException
165. {
167. Configuration conf = new Configuration();
168. Job job = new Job(conf, "secondarysort");
169. job.setJarByClass(SecondarySort.class);
170. job.setMapperClass(Map.class);
171. job.setReducerClass(Reduce.class);
172. job.setPartitionerClass(FirstPartitioner.class);
174. job.setGroupingComparatorClass(GroupingComparator.class);
175. job.setMapOutputKeyClass(IntPair.class);
177. job.setMapOutputValueClass(IntWritable.class);
179. job.setOutputKeyClass(Text.class);
181. job.setOutputValueClass(IntWritable.class);
183. job.setInputFormatClass(TextInputFormat.class);
185. job.setOutputFormatClass(TextOutputFormat.class);
186. String[] otherArgs=new String[2];
187. otherArgs[0]="hdfs://localhost:9000/mymapreduce8/in/goods_visit2";
188. otherArgs[1]="hdfs://localhost:9000/mymapreduce8/out";
190. FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
192. FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
194. System.exit(job.waitForCompletion(true) ? 0 : 1);
195. }
196. }
8.在SecondarySort类文件中，右键并点击=>Run As=>Run on Hadoop选项。

9.待执行完毕后，进入命令模式，在hdfs上从Java代码指定的输出路径中查看实验结果。
1. hadoop fs -ls /mymapreduce8/out
2. hadoop fs -cat /mymapreduce8/out/part-r-00000
相关阅读:
Codeforces Round #171 (Div. 2)
ACdream 1079 郭式树
 HDOJ 1517 博弈论
 ACdream 1080 面面数
 博弈论 Nim 博弈
 Codeforces Round #172 (Div. 2)
ACdream 1084 同心树
 STL bitset
博弈论 bash博弈
 POJ 3261 后缀数组
原文地址：https://www.cnblogs.com/aishanyishi/p/10304854.html