• [hadoop](1) MapReduce:ChainMapper


    前言

    本章主要讲述的是对于hadoop生态系统中,MapReduce写的ChainMapper的学习。MapReduce是hadoop集群数据处理的默认框架。而对于数据集中所有的数据必然有一些不友好的数据,我们需要将其丢弃。我们称之为数据的预处理。所以我们需要将预处理模块与数据处理逻辑分开,以便以后可以复用数据预处理模块。以下是一个mapper的通用模式:

    • 丢弃无用的已损坏的数据
    • 处理有效数据,提取感兴趣的字段
    • 针对这些字段,输出我们感兴趣的数据

    准备工作

    数据集:ufo-60000条记录,这个数据集有一系列包含下列字段的UFO目击事件记录组成,每条记录的字段都是以tab键分割,文件名为ufo.tsv,这里就不提供下载连接了

    • sighting date:UFO目击事件发生时间
    • Recorded date:报告目击事件的时间
    • Location:目击事件发生的地点
    • Shape:UFO形状
    • Duration:目击事件持续时间
    • Dexcription:目击事件的大致描述

    例子: 

    19950915 19950915 Redmond, WA 6 min. Young man w/ 2 co-workers witness tiny, distinctly white round disc drifting slowly toward NE. Flew in dir. 90 deg. to winds.

    ChainMapper介绍

    全限定名: org.apache.hadoop.mapred.lib.ChainMapper 

    作用:顺序的执行多个mapper,并且最后一个mapper的输出会传递给reducer。

    ChainMapper的使用

    题目:通过使用 ChainMapper 类验证数据集的记录是否有效,即判断每条记录是否都可以划分为6个字符串

    • 上传ufo.tsv到hadoop
    hadoop dfs -put ufo.tsv ufo.tsv
    • 编写 UFORecordValidationMapper.java 
    import java.io.IOException;
    
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    import org.apache.hadoop.mapred.lib.*;
    
    public class UFORecordValidationMapper extends MapReduceBase implements Mapper<LongWritable, Text, LongWritable, Text> {
        public void map(LongWritable key, Text value, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException {
            String line = value.toString();
            if(validate(line)) {
                output.collect(key, value);
            }
        }
        
        private boolean validate(String str) {
            String[] parts = str.split("	");
            if(parts.length != 6) {
                return false;
            }
            return true;
        }
    }
    •  编写 UFOLocation.java 
    import java.io.IOException;
    import java.util.Iterator;
    import java.util.regex.*;
    
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    import org.apache.hadoop.mapred.lib.*;
    
    public class UFOLocation {
        public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable> {
            private final static LongWritable one = new LongWritable(1);
            private static Pattern locationPattern = Pattern.compile("[a-zA-Z]{2}[^a-zA-Z]*$");
    
            public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {
                String line = value.toString();
                String[] fields = line.split("	");
                String location = fields[2].trim();
                if(location.length() >= 2) {
                    Matcher matcher = locationPattern.matcher(location);
                    if(matcher.find()) {
                        int start = matcher.start();
                        String state = location.substring(start, start + 2);
                        output.collect(new Text(state.toUpperCase()), one);
                    }
                }
            }
        }
    
        public static void main(String...args) throws Exception {
            Configuration config = new Configuration();
            JobConf conf = new JobConf(config, UFOLocation.class);
            conf.setJobName("UFOLocation");
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(LongWritable.class);
    
            JobConf mapconf1 = new JobConf(false);
            ChainMapper.addMapper(conf, UFORecordValidationMapper.class, LongWritable.class, Text.class, LongWritable.class, Text.class, true, mapconf1);
            JobConf mapconf2 = new JobConf(false);
            ChainMapper.addMapper(conf, MapClass.class, LongWritable.class, Text.class, Text.class, LongWritable.class, true, mapconf2);
            conf.setMapperClass(ChainMapper.class);
            conf.setCombinerClass(LongSumReducer.class);
            conf.setReducerClass(LongSumReducer.class);
    
            FileInputFormat.setInputPaths(conf, args[0]);
            FileOutputFormat.setOutputPath(conf, new Path(args[1]));
            JobClient.runJob(conf);
        }
    }
    • 编译上述两个文件
    javac UFORecordValidationMapper.java UFOLocation.java
    • 将编译好的文件打包成jar
    jar cvf ufo.jar UFO*class
    • 提交打包好的jar包到hadoop上运行
    hadoop jar ufo.jar UFOLocation ufo.tsv output
    • 从hadoop上获取结果到本地
    hadoop dfs -get output/part-00000 ufo_result.txt
    • 查看结果
    more ufo_result.txt
  • 相关阅读:
    poj2253 青蛙
    这代码真是好,真是文艺,转来的
    java.text.MessageFormat
    java多线程的两种实现方式
    javascript with
    面向接口编程
    java 多线程 读写锁
    java 多线程 资源共享
    UML:继承、实现、依赖、关联、聚合、组合
    javascript 语言精粹 学习笔记
  • 原文地址:https://www.cnblogs.com/cafebabe-yun/p/8679994.html
Copyright © 2020-2023  润新知