2 - 润新知

2
二、编写MapReduce程序清洗信件内容数据

数据清洗概述

数据清洗是对数据进行重新审查和校验的过程，目的在于删除重复信息、纠正存在的错误，并提供数据一致性。

数据清洗从名字上也看的出就是把“脏”的“洗掉”，指发现并纠正数据文件中可识别的错误的最后一道程序，包括检查数据一致性，处理无效值和缺失值等。因为数据仓库中的数据是面向某一主题的数据的集合，这些数据从多个业务系统中抽取而来而且包含历史数据，这样就避免不了有的数据是错误数据、有的数据相互之间有冲突，这些错误的或有冲突的数据显然是我们不想要的，称为“脏数据”。我们要按照一定的规则把“脏数据”“洗掉”，这就是数据清洗。而数据清洗的任务是过滤那些不符合要求的数据，将过滤的结果交给业务主管部门，确认是否过滤掉还是由业务单位修正之后再进行抽取。不符合要求的数据主要是有不完整的数据、错误的数据、重复的数据三大类。数据清洗是与问卷审核不同，录入后的数据清洗一般是由计算机而不是人工完成。

分析需求

通过爬虫，我们可以得到咨询和投诉的详细页面。

页面内容如下，需要提取出对我们有用的信息

当然，判断字段是否对我们有用，判断依据是根据需求来定的。后续做的一些需求，会用到哪些字段，此处就会采集哪些字段。

这一节我们会使用MapReduce，对这些网页进行清洗，获取网页中的问题类型，标题，来信人，时间，网友评论数，信息内容，官方回答的机构，时间和回答的内容。

搭建解析框架

1.切换目录到/data/目录下，创建名为edu2的目录
1. cd /data/
2. mkdir /data/edu2
2.切换目录到/data/edu2目录下，使用wget命令，下载项目所依赖的lib包
1. cd /data/edu2
将pachongjar.zip压缩包，解压缩。
1. unzip
3.打开eclipse，新建Java Project

将项目命名为qingxi2

4.右键项目名，新建一个目录，命名为libs用于存储项目依赖的jar包

将/data/edu2/pachongjar目录下，所有的jar包，拷贝到项目下的libs目录下。

选中libs下，所有的jar文件，依次点击“Build Path” => "Add to Build Path"

5.右键src，点击 "New" => "Package"，新建一个包

将包命名为my.mr

右键包名，依次点击“New” => “Class”

填写类名，本实验需要创建三个类，分别命名为FileInput，FileRecordReader，QingxiHtml。

这样清洗过程的框架搭建完毕，下面开始编写代码实现功能。

编写MapReduce代码

1.执行jps，查看hadoop相关进程是否已经启动。
1. jps
若未启动，则需启动hadoop
1. cd /apps/hadoop/sbin
2. ./start-all.sh
2.切换目录到/data/edu2目录下，使用wget命令，下载爬取到的北京市政府百姓信件内容。
1. cd /data/edu2
2. wget http://192.168.1.100:60000/allfiles/second/edu2/govhtml.tar.gz
将govhtml.tar.gz解压缩
1. tar xzvf govhtml.tar.gz
在hdfs上创建目录，名为/myedu2，并将/data/edu2/govhtml下的数据，上传到hdfs中。
1. hadoop fs -mkdir -p /myedu2/in
2. hadoop fs -put /data/edu2/govhtml/* /myedu2/in
*此处也可以将自己爬取到的电商评论数据，上传到hdfs上。

3.（1）打开FileRecordReader页面，编写代码，完成对网页源码的读取，主要目的是将一个网页的全部代码转成一行让mapreduce读取分析，这样mapreduce就可以把一个网页的分析结果作为一行输出，即每个网页抓取的字段为一行。
1. package my.mr;
3. import java.io.BufferedReader;
4. import java.io.InputStreamReader;
6. import java.io.IOException;
8. import org.apache.hadoop.fs.FSDataInputStream;
9. import org.apache.hadoop.fs.FileSystem;
10. import org.apache.hadoop.fs.Path;
12. import org.apache.hadoop.io.Text;
13. import org.apache.hadoop.mapreduce.InputSplit;
14. import org.apache.hadoop.mapreduce.JobContext;
15. import org.apache.hadoop.mapreduce.RecordReader;
16. import org.apache.hadoop.mapreduce.TaskAttemptContext;
17. import org.apache.hadoop.mapreduce.lib.input.FileSplit;
19. public class FileRecordReader extends RecordReader<text,text>{
21. private FileSplit fileSplit;
22. private JobContext jobContext;
23. private Text currentKey = new Text();
24. private Text currentValue = new Text();
25. private boolean finishConverting = false;
26. @Override
27. public void close() throws IOException {
30. @Override
31. public Text getCurrentKey() throws IOException, InterruptedException {
32. return currentKey;
33. }
35. @Override
36. public Text getCurrentValue() throws IOException,
37. InterruptedException {
38. return currentValue;
39. }
41. @Override
42. public float getProgress() throws IOException, InterruptedException {
43. float progress = 0;
44. if(finishConverting){
45. progress = 1;
46. }
47. return progress;
48. }
50. @Override
51. public void initialize(InputSplit arg0, TaskAttemptContext arg1)
52. throws IOException, InterruptedException {
53. this.fileSplit = (FileSplit) arg0;
54. this.jobContext = arg1;
55. String filename = fileSplit.getPath().getName();
56. this.currentKey = new Text(filename);
57. }
59. @Override
60. public boolean nextKeyValue() throws IOException, InterruptedException {
61. if(!finishConverting){
62. int len = (int)fileSplit.getLength();
63. // byte[] content = new byte[len];
64. Path file = fileSplit.getPath();
65. FileSystem fs = file.getFileSystem(jobContext.getConfiguration());
66. FSDataInputStream in = fs.open(file);
67. //根据实际网页的编码格式修改
68. // BufferedReader br = new BufferedReader(new InputStreamReader(in,"gbk"));
69. BufferedReader br = new BufferedReader(new InputStreamReader(in,"utf-8"));
70. String line="";
71. String total="";
72. while((line= br.readLine())!= null){
73. total =total+line+" ";
74. }
75. br.close();
76. in.close();
77. fs.close();
78. currentValue = new Text(total);
79. finishConverting = true;
80. return true;
81. }
82. return false;
83. }
85. }
86. </text,text>
（2）打开FileInput ，编写代码，用以调用FileRecordReader 中重写的方法。
1. package my.mr;
2. import java.io.IOException;
4. import org.apache.hadoop.io.Text;
5. import org.apache.hadoop.mapreduce.InputSplit;
6. import org.apache.hadoop.mapreduce.RecordReader;
7. import org.apache.hadoop.mapreduce.TaskAttemptContext;
8. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
10. public class FileInput extends FileInputFormat<Text,Text>{
12. @Override
13. public RecordReader<Text, Text> createRecordReader(InputSplit arg0, TaskAttemptContext arg1) throws IOException,
14. InterruptedException {
15. // TODO Auto-generated method stub
16. RecordReader<Text,Text> recordReader = new FileRecordReader();
17. return recordReader;
18. }
20. }
（3）打开QingxiHtml编写代码，代码所实现的需求，是使用MapReduce解析网页，最终输出格式化的文本文件。

首先来看MapReduce通用的框架结构样式。
1. public class QingxiHtml {
2. public static void main(String[] args) throws IOException,
3. ClassNotFoundException, InterruptedException {
4. }
6. public static class doMapper extends Mapper<Object, Text, Text, Text> {
7. @Override
8. protected void map(Object key, Text value, Context context)
9. throws IOException, InterruptedException {
11. }
12. }
14. public static class doReducer extends Reducer<Text, Text, Text, Text>{
16. @Override
17. protected void reduce(Text key, Iterable<Text> values, Context context)
18. throws IOException, InterruptedException {
20. }
21. }
22. }
通过分析可以知道，此处只用Map任务即可实现具体功能，所以可以省去Reduce任务。

4.Main主函数。这里的main函数也是通用的结构

view plain copy
1. public static void main(String[] args) throws IOException,
2. ClassNotFoundException, InterruptedException {
3. Job job = Job.getInstance();
4. job.setJobName("QingxiHtml");
5. job.setJarByClass(QingxiHtml.class);
6. job.setMapperClass(doMapper.class);
8. job.setOutputKeyClass(Text.class);
9. job.setOutputValueClass(Text.class);
10. job.setInputFormatClass(FileInput.class);
11. Path in = new Path("hdfs://localhost:9000//myedu2/in");
12. Path out = new Path("hdfs://localhost:9000//myedu2/out/1");
13. FileInputFormat.addInputPath(job, in);
14. FileOutputFormat.setOutputPath(job, out);
15. System.exit(job.waitForCompletion(true) ? 0 : 1);
16. }
①定义Job

②设置Job参数

③设置Map任务

④设置Reduce任务

⑤定义任务的输出类型

⑥设置任务的输入输出目录

⑦提交执行

5.再来看Map任务，实现Map任务，必须继承org.apache.hadoop.mapreduce.Mapper类，并重写类里的map方法。

通过调用编写FileInput.class文件，将每个网页源码转化为一行字段输入。通过map任务，取得每行字段，并通过JXDocument 类，对网页源码进行解析，获取网页中的字段。

将相关字段以‘ ’分隔连接成一行，最终使用context.write类，输出到htfs上。
1. @Override
2. protected void map(Object key, Text value, Context context)
3. throws IOException, InterruptedException {
4. String htmlStr = value.toString();
5. JXDocument Document = new JXDocument(htmlStr);
6. if (htmlStr.indexOf("mail_track_h2") > 0) {
7. try {
8. //类型
9. String leixing = Document
10. .sel("//span[@class='font12 gray']/a[2]/text()")
11. .get(0).toString();
12. //标题
13. String biaoti = Document
14. .sel("//h2[@class='mail_track_h2']/text()").get(0)
15. .toString();
16. //来信人
17. String leixinren = Document
18. .sel("//p[@class='font12 gray time_mail']/span[1]/text()")
19. .get(0).toString().replaceAll("来信人：", "");
20. //时间
21. String shijian = Document
22. .sel("//p[@class='font12 gray time_mail']/span[2]/text()")
23. .get(0).toString().replaceAll("时间：", "");
24. //网友同问的数量或者网友评价的数量
25. String number = Document
26. .sel("//p[@class='font12 gray time_mail']/span[3]/allText()")
27. .get(0).toString().replace("网友同问： ", "").replace("网友评价数： ", "");
28. //信件内容
29. String problem = Document
30. .sel("//span[@class='font14 mail_problem']/text()")
31. .get(0).toString();
32. if (htmlStr.indexOf("margin-bottom:31px") > 0) {
33. //回答部门
34. String offic = Document
35. .sel("//div[@class='con_left float_left']/div[2]/span[1]/text()")
36. .get(0).toString();
37. //回答时间
38. String officpt = Document
39. .sel("//div[@class='con_left float_left']/div[2]/span[2]/text()")
40. .get(0).toString();
41. //回答内容
42. String officp = Document
43. .sel("//div[@class='con_left float_left']/div[2]/p[1]/text()")
44. .get(0).toString();
45. String dataout = leixing + " " + biaoti + " "
46. + leixinren + " " + shijian + " " + number
47. + " " + problem + " " + offic + " "
48. + officpt + " "+ officp;
49. System.out.println(dataout);
50. Text oneLines = new Text(dataout);
51. context.write(oneLines, new Text(""));
52. } else {
53. String dataout = leixing + " " + biaoti + " "
54. + leixinren + " " + shijian + " " + number
55. + " " + problem;
56. System.out.println(dataout);
57. Text oneLines = new Text(dataout);
58. context.write(oneLines, new Text(""));
59. }
60. } catch (XpathSyntaxErrorException e) {
61. // TODO Auto-generated catch block
62. e.printStackTrace();
63. }
64. }
65. }
完整代码如下
1. package my.mr;
3. import java.io.IOException;
4. import org.apache.hadoop.fs.Path;
5. import org.apache.hadoop.io.IntWritable;
6. import org.apache.hadoop.io.Text;
7. import org.apache.hadoop.mapreduce.Job;
8. import org.apache.hadoop.mapreduce.Mapper;
9. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
10. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
11. import cn.wanghaomiao.xpath.exception.XpathSyntaxErrorException;
12. import cn.wanghaomiao.xpath.model.JXDocument;
14. public class QingxiHtml {
15. public static class doMapper extends Mapper<object, text,="" text=""> {
16. public static final IntWritable one = new IntWritable(1);
17. public static Text word = new Text();
19. @Override
20. protected void map(Object key, Text value, Context context)
21. throws IOException, InterruptedException {
22. String htmlStr = value.toString();
23. JXDocument Document = new JXDocument(htmlStr);
24. if (htmlStr.indexOf("mail_track_h2") > 0) {
25. try {
26. String leixing = Document
27. .sel("//span[@class='font12 gray']/a[2]/text()")
28. .get(0).toString();
29. String biaoti = Document
30. .sel("//h2[@class='mail_track_h2']/text()").get(0)
31. .toString();
32. String leixinren = Document
33. .sel("//p[@class='font12 gray time_mail']/span[1]/text()")
34. .get(0).toString().replaceAll("来信人：", "");
35. String shijian = Document
36. .sel("//p[@class='font12 gray time_mail']/span[2]/text()")
37. .get(0).toString().replaceAll("时间：", "");
38. String number = Document
39. .sel("//p[@class='font12 gray time_mail']/span[3]/allText()")
40. .get(0).toString().replace("网友同问： ", "").replace("网友评价数： ", "");
41. String problem = Document
42. .sel("//span[@class='font14 mail_problem']/text()")
43. .get(0).toString();
44. if (htmlStr.indexOf("margin-bottom:31px") > 0) {
45. String offic = Document
46. .sel("//div[@class='con_left float_left']/div[2]/span[1]/text()")
47. .get(0).toString();
48. String officpt = Document
49. .sel("//div[@class='con_left float_left']/div[2]/span[2]/text()")
50. .get(0).toString();
52. String officp = Document
53. .sel("//div[@class='con_left float_left']/div[2]/p[1]/text()")
54. .get(0).toString();
55. String dataout = leixing + " " + biaoti + " "
56. + leixinren + " " + shijian + " " + number
57. + " " + problem + " " + offic + " "
58. + officpt + " "+ officp;
59. System.out.println(dataout);
60. Text oneLines = new Text(dataout);
61. context.write(oneLines, new Text(""));
62. } else {
63. String dataout = leixing + " " + biaoti + " "
64. + leixinren + " " + shijian + " " + number
65. + " " + problem;
66. System.out.println(dataout);
67. Text oneLines = new Text(dataout);
68. context.write(oneLines, new Text(""));
69. }
71. } catch (XpathSyntaxErrorException e) {
72. // TODO Auto-generated catch block
73. e.printStackTrace();
74. }
75. }
76. }
77. }
79. public static void main(String[] args) throws IOException,
80. ClassNotFoundException, InterruptedException {
81. Job job = Job.getInstance();
82. job.setJobName("QingxiHtml");
83. job.setJarByClass(QingxiHtml.class);
84. job.setMapperClass(doMapper.class);
86. job.setOutputKeyClass(Text.class);
87. job.setOutputValueClass(Text.class);
88. job.setInputFormatClass(FileInput.class);
89. Path in = new Path("hdfs://localhost:9000//myedu2/in");
90. Path out = new Path("hdfs://localhost:9000//myedu2/out/1");
91. FileInputFormat.addInputPath(job, in);
92. FileOutputFormat.setOutputPath(job, out);
93. System.exit(job.waitForCompletion(true) ? 0 : 1);
94. }
95. }
96. </object,>
执行测试

1.在mapreduce类中，右键，Run As => Run on Hadoop，将任务提交到hadoop中执行

2.等待任务执行完毕。切换目录到/data/edu2/下，并在命令行界面，输入脚本，查看hdfs上/myedu2/out是否有内容输出
1. cd /data/edu2/
2. hadoop fs -lsr /myedu2/out
若有输出，则将hdfs输出内容，下载到linux本地
1. hadoop fs -get /myedu2/out/1/*
使用vim或cat查看下载到的文件内容，可以看到结构比较清晰

3，若未在hdfs上，查看到输出结果，可以通过log日志排错。将/apps/hadoop/etc/hadoop/log4j.properties文件，拷贝到mapreduce项目的根目录下

可以看到在eclipse的console界面有执行过程的输出。
相关阅读:
OpenWRT Mac 虚拟机PD 分享 for 软路由
 How to write u disk from img in mac os x
linux find file > 100 M
gojs for data flow
正则表达式
 grep
搜索引擎Query Rewrite
Kafka replication
cassandra写数据CommitLog
Solr DIH JDBC 源码解析
原文地址：https://www.cnblogs.com/gkl20173667/p/12297418.html