Hadoop上的中文分词与词频统计实践
首先来推荐相关材料:http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/。小虾的这个统计武侠小说人名热度的段子很有意思,照虎画猫来实践一下。
与其不同的地方有:
0)其使用Hadoop Streaming,这里使用MapReduce框架。
1)不同的中文分词方法,这里使用IKAnalyzer,主页在http://code.google.com/p/ik-analyzer/。
2)这里的材料为《射雕英雄传》。哈哈,总要来一些改变。
0)使用WordCount源代码,修改其Map,在Map中使用IKAnalyzer的分词功能。
import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; import java.io.ByteArrayInputStream; import org.wltea.analyzer.core.IKSegmenter; import org.wltea.analyzer.core.Lexeme; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class ChineseWordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { byte[] bt = value.getBytes(); InputStream ip = new ByteArrayInputStream(bt); Reader read = new InputStreamReader(ip); IKSegmenter iks = new IKSegmenter(read,true); Lexeme t; while ((t = iks.next()) != null) { word.set(t.getLexemeText()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(ChineseWordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
1)So,完成了,本地插件模拟环境OK。打包(带上分词包)扔到集群上。
hadoop fs -put chinese_in.txt chinese_in.txt hadoop jar WordCount.jar chinese_in.txt out0 ...mapping reducing... hadoop fs -ls ./out0 hadoop fs -get part-r-00000 words.txt
2)数据后处理:
2.1)数据排序
head words.txt tail words.txt sort -k2 words.txt >0.txt head 0.txt tail 0.txt sort -k2r words.txt>0.txt head 0.txt tail 0.txt sort -k2rn words.txt>0.txt head -n 50 0.txt
2.2)目标提取
awk '{if(length($1)>=2) print $0}' 0.txt >1.txt
2.3)结果呈现
head 1.txt -n 50 | sed = | sed 'N;s/ //'
1郭靖 6427 2黄蓉 4621 3欧阳 1660 4甚么 1430 5说道 1287 6洪七公 1225 7笑道 1214 8自己 1193 9一个 1160 10师父 1080 11黄药师 1059 12心中 1046 13两人 1016 14武功 950 15咱们 925 16一声 912 17只见 827 18他们 782 19心想 780 20周伯通 771 21功夫 758 22不知 755 23欧阳克 752 24听得 741 25丘处机 732 26当下 668 27爹爹 664 28只是 657 29知道 654 30这时 639 31之中 621 32梅超风 586 33身子 552 34都是 540 35不是 534 36如此 531 37柯镇恶 528 38到了 523 39不敢 522 40裘千仞 521 41杨康 520 42你们 509 43这一 495 44却是 478 45众人 476 46二人 475 47铁木真 469 48怎么 464 49左手 452 50地下 448
在非人名词中有很多很有意思,如:5说道7笑道12心中17只见22不知30这时49左手。