本来只是想拿搜狗的数据练练手的,却无意踏足MR的topK问题。经过几番波折,虽然现在看起来很简单,但是摸爬滚打中也学到了不少
数据是搜狗实验室下的搜索日志,格式大概为:
1 00:00:00 2982199073774412 [360安全卫士] 8 3 download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html 2 00:00:00 07594220010824798 [哄抢救灾物资] 1 1 news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml 3 00:00:00 5228056822071097 [75810部队] 14 5 www.greatoo.com/greatoo_cn/list.asp?link_id=276&title=%BE%DE%C2%D6%D0%C2%CE%C5 4 00:00:00 6140463203615646 [绳艺] 62 36 www.jd-cd.com/jd_opus/xx/200607/706.html 5 00:00:00 8561366108033201 [汶川地震原因] 3 2 www.big38.net/ 6 00:00:00 23908140386148713 [莫衷一是的意思] 1 2 www.chinabaike.com/article/81/82/110/2007/2007020724490.html 7 00:00:00 1797943298449139 [星梦缘全集在线观看] 8 5 www.6wei.net/dianshiju/????xa1xe9|????do=index 8 00:00:00 00717725924582846 [闪字吧] 1 2 www.shanziba.com/
我只是要搜索词,其他的不管,然后通过MR计算出搜索量最高的前N个词(N自定义)
整体项目结构为:
先来个类处理根据日志格式拿出搜索词
SEA.java
1 package org.admln.topK; 2 3 /** 4 * @author admln 5 * 6 */ 7 public class SEA { 8 9 private String seaWord; 10 11 private boolean isValid; 12 13 public static SEA parser(String line) { 14 SEA sea = new SEA(); 15 String str = line.split(" ")[2]; 16 if(str.length()<3) { 17 sea.setValid(false); 18 }else { 19 sea.setValid(true); 20 sea.setSeaWord(str.substring(1, str.length()-1)); 21 } 22 return sea; 23 } 24 25 26 public String getSeaWord() { 27 return seaWord; 28 } 29 30 31 public void setSeaWord(String seaWord) { 32 this.seaWord = seaWord; 33 } 34 35 36 public boolean isValid() { 37 return isValid; 38 } 39 40 41 public void setValid(boolean isValid) { 42 this.isValid = isValid; 43 } 44 45 }
然后就是MR
1 package org.admln.topK; 2 3 import java.io.IOException; 4 import java.util.Collections; 5 import java.util.Map.Entry; 6 import java.util.Set; 7 import java.util.TreeMap; 8 9 import org.apache.hadoop.conf.Configuration; 10 import org.apache.hadoop.fs.Path; 11 import org.apache.hadoop.io.IntWritable; 12 import org.apache.hadoop.io.Text; 13 import org.apache.hadoop.mapreduce.Job; 14 import org.apache.hadoop.mapreduce.Mapper; 15 import org.apache.hadoop.mapreduce.Reducer; 16 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 17 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 18 19 /** 20 * @author admln 21 * 22 */ 23 public class TopK { 24 25 public static class topKMapper extends 26 Mapper<Object, Text, Text, IntWritable> { 27 Text word = new Text(); 28 IntWritable ONE = new IntWritable(1); 29 30 @Override 31 public void map(Object key, Text value, Context context) 32 throws IOException, InterruptedException { 33 SEA sea = SEA.parser(value.toString()); 34 if (sea.isValid()) { 35 word.set(sea.getSeaWord()); 36 context.write(word, ONE); 37 } 38 } 39 } 40 41 public static class topKReducer extends 42 Reducer<Text, IntWritable, Text, IntWritable> { 43 int sum; 44 int max; 45 private static TreeMap<Integer,String> tree = new TreeMap<Integer,String>(Collections.reverseOrder()); 46 47 public void reduce(Text key, Iterable<IntWritable> values, 48 Context context) { 49 sum = 0; 50 max = context.getConfiguration().getInt("topK", 10); 51 for (IntWritable val : values) { 52 sum += val.get(); 53 } 54 tree.put(Integer.valueOf(sum), key.toString()); 55 if (tree.size() > max) { 56 tree.remove(tree.lastKey()); 57 } 58 59 } 60 61 @Override 62 protected void cleanup(Context context) throws IOException, InterruptedException { 63 Set<Entry<Integer, String>> set = tree.entrySet(); 64 for (Entry<Integer, String> entry : set) { 65 context.write(new Text(entry.getValue()), new IntWritable(entry.getKey())); 66 } 67 } 68 } 69 70 public static void main(String[] args) throws Exception { 71 Path input = new Path("hdfs://hadoop:8020/input/topK/"); 72 Path output = new Path("hdfs://hadoop:8020/output/topK/"); 73 74 Configuration conf = new Configuration(); 75 76 conf.setInt("topK", Integer.valueOf(args[1])); 77 78 Job job = new Job(conf, "topK"); 79 80 job.setJarByClass(TopK.class); 81 82 job.setMapperClass(topKMapper.class); 83 job.setReducerClass(topKReducer.class); 84 85 job.setOutputKeyClass(Text.class); 86 job.setOutputValueClass(IntWritable.class); 87 88 FileInputFormat.addInputPath(job, input); 89 FileOutputFormat.setOutputPath(job, output); 90 91 System.exit(job.waitForCompletion(true) ? 0 : 1); 92 93 } 94 95 }
然后上传数据(注意文件格式要从gb2312改成utf-8的。因为hadoop全部是utf-8编码的。如果不转码最后结果中文就是乱码)
本机调试或者上传到hadoop上运行
机器环境是centos6.4、hadoop是2.2.0、JDK是1.7
运行结果:
重要知识点:
1.TreeMap,虽然是Java的知识,还是普及了一下;
2.cleanup,这个复写API的执行时间要知道。
源码:http://pan.baidu.com/s/1i3y0rwL