需求:
- 每个文件中各个单词的出现次数并倒叙排列
- 输出所有文件中出现的数量最多的单词
测试文件:
随意在网上找几篇文章即可
这里给出三分文件:
news1:
don’t know what I do now is right, those are wrong, and when I finally Laosi when I know these.
So I can do now is to try to do well in everything, and then wait to die a natural death.Sometimes
I can be very happy to talk to everyone, can be very presumptuous, but no one knows, it is but very
deliberatelycamouflage, camouflage; I can make him very happy very happy,
but couldn’t find the source of happiness, just giggle.news2:
If not to the sun for smiling, warm is still in the sun there, but wewill laugh more confident calm;
if turned to found his own shadow, appropriate escape, the sun will be through the heart,warm each place
behind the corner; if an outstretched palm cannot fall butterfly, then clenched waving arms, given power;
if I can’t have bright smile, it will face to the sunshine, and sunshine smile together, in full bloom.news3:
Time is like a river, the left bank is unable to forget the memories, right is
worth grasp the youth, the middle of the fast flowing, is the sad young faint.
There are many good things, buttruly belong to own but not much. See the
courthouse blossom,honor or disgrace not Jing, hope heaven Yunjuanyunshu,
has no intention to stay. In this round the world, all can learn to use a
normal heart to treat all around, is also a kind of realm!
测试代码:
将所有文件存放在相同文件夹中(我用的是word_test),注意:文件夹中不要存放别的文件,以免影响程序运行
package word_count;
import java.io.IOException;
import java.util.HashMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class MRMapper extends Mapper<LongWritable, Text, WordBean, NullWritable> {
public static HashMap<String, Integer> words1 = new HashMap<String, Integer>();// 文件名-word num
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String filename = context.getInputSplit().toString();
filename = filename.substring(filename.indexOf("news"), filename.indexOf(".txt")) + ".txt";
String[] split = value.toString().toLowerCase().replaceAll("n't", " not").trim().replaceAll("\W", " ").replaceAll("\s+", " ").split(" ");
for (String s : split) {
// 文件名-word num
String file_word = filename + "-" + s;
if (words1.get(file_word) != null) {
words1.put(file_word, words1.get(file_word) + 1);
} else {
words1.put(file_word, 1);
}
}
}
protected void cleanup(Context context) throws IOException, InterruptedException {
for (HashMap.Entry<String, Integer> m : words1.entrySet()) {
String[] split = m.getKey().split("-");
context.write(new WordBean(split[0], split[1], m.getValue()), NullWritable.get());
}
words1.clear();//cleanup方法会在结束一份文件的读取时执行,因此需要在将此文件中内容输出完后清空map,防止内容重复输出
}
}
public static class MRReducer extends Reducer<WordBean, NullWritable, Text, NullWritable> {
public static HashMap<String, Integer> words = new HashMap<String, Integer>();
public static HashMap<String, String> file = new HashMap<String, String>();
public static String word_filename = "";
public static int max_word_num = 0;
protected void reduce(WordBean bean, Iterable<NullWritable> values, Context context)
throws IOException, InterruptedException {
if(words.get(bean.getFilename()) == null){
words.put(bean.getFilename(), bean.getNum());
file.put(bean.getFilename(), bean.getWord()+"-"+bean.getNum());
}
if(bean.getNum() > words.get(bean.getFilename())){
words.put(bean.getFilename(), bean.getNum());
file.put(bean.getFilename(), bean.getWord()+"-"+bean.getNum());
}
if (bean.getNum() > max_word_num) {
word_filename = bean.getFilename();
max_word_num = bean.getNum();
}
context.write(new Text(bean.toString()), NullWritable.get());
}
protected void cleanup(Context context) throws IOException, InterruptedException {
for (HashMap.Entry<String, String> m : file.entrySet()) {
String str = m.getKey() + "中出现次数最多的是:"+ m.getValue().replaceAll("-", ",共出现") + "次";
context.write(new Text(str), NullWritable.get());
}
String str = "所有文件中出现次数最多的是:" + file.get(word_filename).replaceAll("-", ",出现在" + word_filename + "中,共出现") + "次";
context.write(new Text(str), NullWritable.get());
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://hadoop5:9000");
Job job = Job.getInstance(conf, "Word sort");
job.setJarByClass(WordCount.class);
job.setMapperClass(MRMapper.class);
job.setReducerClass(MRReducer.class);
job.setMapOutputKeyClass(WordBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path("/input/word_test/"));
FileOutputFormat.setOutputPath(job, new Path("/output/put1"));
System.out.println(job.waitForCompletion(true) ? 1 : 0);
}
}
排序的方法有很多种,这里这是用一种比较简单常用的方法。
利用javaBean实现WritableComparable接口来实现自定义排序:
注:Writable接口实现序列化,Comparable接口的compareTo方法实现排序
package word_count;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class WordBean implements WritableComparable<WordBean> {
private String filename;
private String word;
private Integer num;
public void readFields(DataInput in) throws IOException {
this.filename = in.readUTF();
this.word = in.readUTF();
this.num = in.readInt();
}
public void write(DataOutput out) throws IOException {
out.writeUTF(filename);
out.writeUTF(word);
out.writeInt(num);
}
public int compareTo(WordBean o) {
if(o.getFilename().hashCode()==filename.hashCode()){
return o.getNum() > num?1:-1;
}else{
return o.getFilename().hashCode()>filename.hashCode()?-1:1;
}
}
public String toString() {
return filename + " " + word + " " + num;
}
public WordBean() {
super();
}
public WordBean(String filename, String word, Integer num) {
this.filename = filename;
this.word = word;
this.num = num;
}
public String getFilename() {
return filename;
}
public void setFilename(String filename) {
this.filename = filename;
}
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
public Integer getNum() {
return num;
}
public void setNum(Integer num) {
this.num = num;
}
}
测试结果:
news1.txt i 6
news1.txt very 5
news1.txt to 5
news1.txt can 4
news1.txt do 4
news1.txt happy 3
news1.txt but 3
news1.txt is 3
news1.txt now 2
news1.txt when 2
news1.txt know 2
news1.txt be 2
news1.txt and 2
news1.txt not 2
news1.txt a 1
news1.txt then 1
news1.txt the 1
news1.txt him 1
news1.txt well 1
news1.txt no 1
news1.txt talk 1
news1.txt laosi 1
news1.txt just 1
news1.txt knows 1
news1.txt of 1
news1.txt right 1
news1.txt what 1
news1.txt everything 1
news1.txt make 1
news1.txt happiness 1
news1.txt it 1
news1.txt those 1
news1.txt die 1
news1.txt wait 1
news1.txt so 1
news1.txt find 1
news1.txt sometimes 1
news1.txt death 1
news1.txt deliberatelycamouflage 1
news1.txt are 1
news1.txt source 1
news1.txt could 1
news1.txt natural 1
news1.txt in 1
news1.txt giggle 1
news1.txt one 1
news1.txt camouflage 1
news1.txt finally 1
news1.txt wrong 1
news1.txt everyone 1
news1.txt these 1
news1.txt presumptuous 1
news1.txt try 1
news2.txt the 6
news2.txt if 4
news2.txt to 3
news2.txt sun 3
news2.txt smile 2
news2.txt sunshine 2
news2.txt will 2
news2.txt in 2
news2.txt warm 2
news2.txt not 2
news2.txt and 1
news2.txt be 1
news2.txt there 1
news2.txt it 1
news2.txt bloom 1
news2.txt heart 1
news2.txt escape 1
news2.txt through 1
news2.txt but 1
news2.txt calm 1
news2.txt have 1
news2.txt butterfly 1
news2.txt is 1
news2.txt cannot 1
news2.txt waving 1
news2.txt own 1
news2.txt an 1
news2.txt found 1
news2.txt ca 1
news2.txt corner 1
news2.txt face 1
news2.txt more 1
news2.txt laugh 1
news2.txt for 1
news2.txt arms 1
news2.txt then 1
news2.txt confident 1
news2.txt clenched 1
news2.txt wewill 1
news2.txt power 1
news2.txt i 1
news2.txt shadow 1
news2.txt full 1
news2.txt turned 1
news2.txt place 1
news2.txt together 1
news2.txt given 1
news2.txt behind 1
news2.txt his 1
news2.txt each 1
news2.txt still 1
news2.txt bright 1
news2.txt appropriate 1
news2.txt palm 1
news2.txt fall 1
news2.txt smiling 1
news2.txt outstretched 1
news3.txt the 8
news3.txt to 5
news3.txt is 5
news3.txt a 3
news3.txt of 2
news3.txt all 2
news3.txt not 2
news3.txt bank 1
news3.txt belong 1
news3.txt disgrace 1
news3.txt there 1
news3.txt no 1
news3.txt has 1
news3.txt faint 1
news3.txt courthouse 1
news3.txt but 1
news3.txt flowing 1
news3.txt yunjuanyunshu 1
news3.txt treat 1
news3.txt also 1
news3.txt normal 1
news3.txt stay 1
news3.txt youth 1
news3.txt kind 1
news3.txt much 1
news3.txt intention 1
news3.txt unable 1
news3.txt around 1
news3.txt fast 1
news3.txt heart 1
news3.txt right 1
news3.txt honor 1
news3.txt jing 1
news3.txt things 1
news3.txt world 1
news3.txt many 1
news3.txt worth 1
news3.txt memories 1
news3.txt heaven 1
news3.txt forget 1
news3.txt hope 1
news3.txt time 1
news3.txt realm 1
news3.txt use 1
news3.txt round 1
news3.txt good 1
news3.txt grasp 1
news3.txt own 1
news3.txt river 1
news3.txt or 1
news3.txt can 1
news3.txt this 1
news3.txt sad 1
news3.txt see 1
news3.txt left 1
news3.txt are 1
news3.txt blossom 1
news3.txt like 1
news3.txt buttruly 1
news3.txt young 1
news3.txt learn 1
news3.txt middle 1
news3.txt in 1
news3.txt中出现次数最多的是:the,共出现8次
news2.txt中出现次数最多的是:the,共出现6次
news1.txt中出现次数最多的是:i,共出现6次
所有文件中出现次数最多的是:the,出现在news3.txt中,共出现8次
这里给出测试结果供大家参考!