• MapReduce之多个Job串联的案例


    @

    需求

    有三个文件,里面记录着一些单词,请统计每个单词分别在每个文件出现的次数

    数据输入
    在这里插入图片描述
    期待输出
    比如:atguigu c.txt-->2 b.txt-->2 a.txt-->3

    分析

    如果一个需求,一个MRjob无法完成,可以将需求拆分为若干Job,多个Job按照依赖关系依次执行!

    Job1
    Mapper: 默认一个MapTask只处理一个切片的数据,默认的切片策略,一个切片只属于一个文件。

    • keyin-valuein:atguigu pingping
    • keyout-valueout:atguigu-a.txt,1

    Reducer

    • keyin-valuein: atguigu-a.txt,1(mapper的输出,作为reducer的输入
    • keyout-valueout: atguigu-a.txt,3
      pingping-a.txt,2
      atguigu-b.txt,3
      pingping-b.txt,2

    Job2
    Mapper: 默认一个MapTask只处理一个切片的数据,默认的切片策略,一个切片只属于一个文件。

    • keyin-valuein: pingping,a.txt-2(上一个Job的reducer的输出,作为本次job的mapper的输入
    • keyout-valueout: pingping,a.txt-2(原封不动的输出

    Reducer

    • keyin-valuein:
      pingping,a.txt-2
      pingping,b.txt-2
    • keyout-valueout:pingping,a.txt-2 b.txt-2(最后将相同key下的value拼接即可

    代码实现

    Mapper1.java

    /*
     * 1.输入
     * 		atguigu pingping
     * 2.输出
     * 		atguigu-a.txt,1
     */
    public class Example1Mapper1 extends Mapper<LongWritable, Text, Text, IntWritable>{
    	
    	private String filename;
    	private Text out_key=new Text();
    	private IntWritable out_value=new IntWritable(1);
    	
    	@Override
    	protected void setup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
    			throws IOException, InterruptedException {
    		
    		InputSplit inputSplit = context.getInputSplit();
    		
    		FileSplit split=(FileSplit)inputSplit;
    		
    		filename=split.getPath().getName();
    		
    	}
    	
    	@Override
    	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
    			throws IOException, InterruptedException {
    		
    		String[] words = value.toString().split(" ");
    		
    		for (String word : words) {
    			
    			out_key.set(word+"-"+filename);
    			
    			context.write(out_key, out_value);
    		}
    		
    	}
    
    }
    

    Reducer1.java

    /*
     * 1.输入
     * 		atguigu-a.txt,1
     * 		atguigu-a.txt,1
     * 		atguigu-a.txt,1
     * 2.输出
     * 		atguigu-a.txt,3
     */
    public class Example1Reducer1 extends Reducer<Text, IntWritable, Text, IntWritable>{
    	
    	private IntWritable out_value=new IntWritable();
    	
    	@Override
    	protected void reduce(Text key, Iterable<IntWritable> values,
    			Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
    		
    		int sum=0;
    		
    		for (IntWritable value : values) {
    			sum+=value.get();
    		}
    		
    		out_value.set(sum);
    		
    		context.write(key, out_value);
    		
    	}
    
    }
    

    Mapper2.java

    /*
     * 1.输入
     * 		atguigu-a.txt	3
     * 		atguigu-b.txt	3
     * 		使用KeyValueTextInputFormat,可以使用一个分隔符,分隔符之前的作为key,之后的作为value
     * 2.输出
     * 		atguigu,a.txt	3
     * 		atguigu,b.txt	3
     */
    public class Example1Mapper2 extends Mapper<Text, Text, Text, Text>{
    	//不用重写map方法,父方法会自动将输入的key-value强转成输出的key-value
    }
    
    

    Reducer2.java

    /*
     * 1.输入
     * 		atguigu,a.txt	3
     * 		atguigu,b.txt	3
     * 		
     * 2.输出
     * 		atguigu,a.txt	3 b.txt	3
     * 		
     */
    public class Example1Reducer2 extends Reducer<Text, Text, Text, Text>{
    	
    	private Text out_value=new Text();
    	
    	@Override
    	protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
    			throws IOException, InterruptedException {
    	
    		StringBuffer sb = new StringBuffer();
    		
    		//拼接value
    		for (Text value : values) {
    			
    			sb.append(value.toString()+" ");
    			
    		}
    		
    		out_value.set(sb.toString());
    		
    		context.write(key, out_value);
    		
    	}
    
    }
    

    Driver.java

    /*
     * 1. Example1Driver 提交两个Job
     * 			Job2 必须 依赖于 Job1,必须在Job1已经运行完成之后,生成结果后,才能运行!
     * 
     * 2. JobControl: 定义一组MR jobs,还可以指定其依赖关系
     * 				可以通过addJob(ControlledJob aJob)向一个JobControl中添加Job对象!
     * 
     * 3. ControlledJob: 可以指定依赖关系的Job对象
     * 			addDependingJob(ControlledJob dependingJob): 为当前Job添加依赖的Job
     * 			 public ControlledJob(Configuration conf) : 基于配置构建一个ControlledJob
     * 
     */
    public class Example1Driver {
    	
    public static void main(String[] args) throws Exception {
    		
    		//定义路径
    		Path inputPath=new Path("e:/mrinput/index");
    		Path outputPath=new Path("e:/mroutput/index");
    		Path finalOutputPath=new Path("e:/mroutput/finalindex");
    		
    		//作为整个Job的配置
    		Configuration conf1 = new Configuration();
    		Configuration conf2 = new Configuration();
    		conf2.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "-");
    		
    		
    		//保证输出目录不存在
    		FileSystem fs=FileSystem.get(conf1);
    		
    		if (fs.exists(outputPath)) {
    			
    			fs.delete(outputPath, true);
    			
    		}
    		
    		if (fs.exists(finalOutputPath)) {
    			
    			fs.delete(finalOutputPath, true);
    			
    		}
    		
    		// ①创建Job
    		Job job1 = Job.getInstance(conf1);
    		Job job2 = Job.getInstance(conf2);
    		
    		// 设置Job名称
    		job1.setJobName("index1");
    		job2.setJobName("index2");
    		
    		// ②设置Job1
    		job1.setMapperClass(Example1Mapper1.class);
    		job1.setReducerClass(Example1Reducer1.class);
    		
    		job1.setOutputKeyClass(Text.class);
    		job1.setOutputValueClass(IntWritable.class);
    		
    		// 设置输入目录和输出目录
    		FileInputFormat.setInputPaths(job1, inputPath);
    		FileOutputFormat.setOutputPath(job1, outputPath);
    		
    		// ②设置Job2
    		job2.setMapperClass(Example1Mapper2.class);
    		job2.setReducerClass(Example1Reducer2.class);
    				
    		job2.setOutputKeyClass(Text.class);
    		job2.setOutputValueClass(Text.class);
    				
    		// 设置输入目录和输出目录
    		FileInputFormat.setInputPaths(job2, outputPath);
    		FileOutputFormat.setOutputPath(job2, finalOutputPath);
    		
    		// 设置job2的输入格式
    		job2.setInputFormatClass(KeyValueTextInputFormat.class);
    		
    		//--------------------------------------------------------
    		//构建JobControl
    		JobControl jobControl = new JobControl("index");
    		
    		//创建运行的Job
    		ControlledJob controlledJob1 = new ControlledJob(job1.getConfiguration());
    		ControlledJob controlledJob2 = new ControlledJob(job2.getConfiguration());
    		
    		//指定依赖关系
    		controlledJob2.addDependingJob(controlledJob1);
    		
    		// 向jobControl设置要运行哪些job
    		jobControl.addJob(controlledJob1);
    		jobControl.addJob(controlledJob2);
    		
    		//运行JobControl
    		Thread jobControlThread = new Thread(jobControl);
    		//设置此线程为守护线程
    		jobControlThread.setDaemon(true);
    		
    		jobControlThread.start();
    		
    		//获取JobControl线程的运行状态
    		while(true) {
    			
    			//判断整个jobControl是否全部运行结束
    			if (jobControl.allFinished()) {
    				
    				System.out.println(jobControl.getSuccessfulJobList());
    				
    				return;
    				
    			}
    			
    		}
    	
    	}	
    }
    

    输出结果

    job1的输出:
    在这里插入图片描述
    最终结果的输出:
    在这里插入图片描述

  • 相关阅读:
    leetcode刷题-26-删除有序数组重复项
    leetcode刷题-27-移除元素
    leetcode刷题-54-螺旋矩阵
    leetcode刷题-70-爬楼梯
    leetcode刷题-442-数组中重复的数据
    leetcode刷题-945-使数组唯一的最小增量
    leetcode刷题-11-盛最多水的容器
    random.choice函数
    Rating prediction and Ranking prediction
    Dev-c++在windows环境下无法debug(调试)的解决方案
  • 原文地址:https://www.cnblogs.com/sunbr/p/13519126.html
Copyright © 2020-2023  润新知