• 【Hadoop离线基础总结】MapReduce倒排索引建立


    MapReduce倒排索引建立


    求某些单词在文章中出现多少次

    • 有三个文档的内容,求hello,tom,jerry三个单词在其中各出现多少次
    hello tom
    hello jerry
    hello tom
    
    hello jerry
    hello jerry
    tom jerry
    
    hello jerry
    hello tom
    
    • java代码实现

    定义一个Mapper类

    package cn.itcast.demo2;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    
    import java.io.IOException;
    
    public class IndexMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //获取文件切片,强转!强转!
            FileSplit fileSplit = (FileSplit) context.getInputSplit();
            //获取文档名字
            String name = fileSplit.getPath().getName();
            //对v1进行切割
            String[] split = value.toString().split(" ");
            for (String s : split) {
                context.write(new Text(s + "-" + name), new IntWritable(1));
            }
        }
    }
    

    定义一个reducer类

    package cn.itcast.demo2;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    import java.io.IOException;
    
    public class IndexReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int i = 0;
            for (IntWritable value : values) {
                //获取单词出现的次数
                i += value.get();
            }
            context.write(key, new IntWritable(i));
        }
    }
    

    程序main函数入口

    package cn.itcast.demo2;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    
    public class IndexMain extends Configured implements Tool {
        @Override
        public int run(String[] args) throws Exception {
            //获取job对象
            Job job = Job.getInstance(super.getConf(), "getIndex");
            //输入数据,设置输入路径
            job.setInputFormatClass(TextInputFormat.class);
            TextInputFormat.setInputPaths(job, new Path("file:////Volumes/赵壮备份/大数据离线课程资料/5.大数据离线第五天/倒排索引/input"));
    
            //自定义map逻辑
            job.setMapperClass(IndexMapper.class);
            //设置k2,v2输出类型
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);
    
            //自定义reduce逻辑
            job.setReducerClass(IndexReducer.class);
            //设置k3,v3输出类型
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
    
            //输出数据,设置输出路径
            job.setOutputFormatClass(TextOutputFormat.class);
            TextOutputFormat.setOutputPath(job, new Path("file:////Volumes/赵壮备份/大数据离线课程资料/5.大数据离线第五天/倒排索引/output"));
    
            //提交任务到集群
            boolean b = job.waitForCompletion(true);
            return b ? 0 : 1;
        }
    
        public static void main(String[] args) throws Exception {
            int run = ToolRunner.run(new Configuration(), new IndexMain(), args);
            System.exit(run);
        }
    }
    

    输出结果

    hello-a.txt	3
    hello-b.txt	2
    hello-c.txt	2
    jerry-a.txt	1
    jerry-b.txt	3
    jerry-c.txt	1
    tom-a.txt	2
    tom-b.txt	1
    tom-c.txt	1
    
  • 相关阅读:
    生成排列与生成子集
    赛后总结AtCoder Beginner Contest 090(Beginner)
    树状数组笔记
    论怎么记住tarjan的板子
    tarjan缩点-受欢迎的牛-笔记
    tarjan模板(%%%hzwer)-2.0
    tarjan模板(%%%hzwer)
    匈牙利算法学习笔记
    最短路-Car的旅行路线
    数据结构 笔记1 搜索树
  • 原文地址:https://www.cnblogs.com/zzzsw0412/p/12772488.html
Copyright © 2020-2023  润新知