• MapReduce编程初步(WordCount,TopN)


    在配置Hadoop集群成功后,利用官方自带的例子简单测试了一下MapReduce程序WordCount,这个例子也就相当于编程入门的HelloWorld程序了,结构清晰容易理解,并且方便说明MapReduce的工作过程。这篇随笔主要想记录下在Eclipse中编写简单的MapReduce程序的上手过程。原创代码的内容不会很多,更多的是参考和借鉴现有的优秀代码。

     一、Hello 'MapReduce' World——WordCount程序

    1、在Eclipse中建立Java项目WordCount

      

    2、导入相关包(可以在Eclispe中为这三个包建立User Library以便使用)

      ①commons-cli-1.2.jar

      ②hadoop-common-2.7.3.jar

      ③hadoop-mapreduce-client-core-2.7.3.jar

    3、配置好Build Path,确保项目中引入了上述三个包

    4、新建包名为zmt.test,在其下建立新的Class名为WordCount,并键入官方源码

    package zmt.test;
    
    
    import java.io.IOException;
    import java.util.StringTokenizer;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;
    
    public class WordCount {
        
        public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{
            
            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();
            
            public void map(Object key, Text value, Context context) throws IOException, InterruptedException
            {
                StringTokenizer itr = new StringTokenizer(value.toString());
                while (itr.hasMoreTokens()){
                    word.set(itr.nextToken());
                    context.write(word, one);
                }
            }
        
        }
        
        public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
        {
            private IntWritable result = new IntWritable();
            
            public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{
                
                int sum = 0;
                for (IntWritable val : values){
                    sum += val.get();
                }
                result.set(sum);
                context.write(key, result);
            }
        }
        
        public static void main(String[] args) throws Exception{
            Configuration conf = new Configuration();
            
            String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
            
            if(otherArgs.length < 2){
                
                System.err.println("用法:wordcount <int> [<in>...] <out>");
                System.exit(2);
                
            }
            
            Job job = Job.getInstance(conf, "word count");
            job.setJarByClass(WordCount.class);
            job.setMapperClass(TokenizerMapper.class);
            job.setCombinerClass(IntSumReducer.class);
            job.setReducerClass(IntSumReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            for (int i=0; i<otherArgs.length-1; ++i){
                FileInputFormat.addInputPath(job,  new Path(otherArgs[i]));
            }
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length-1]));
            System.exit(job.waitForCompletion(true)?0:1);
        }
        
    }

    5、项目右键导出为jar文件,命名为WC.jar

    6、将WC.jar复制到虚拟机Master主机中,虚拟机安装了 VMWare Tools可以直接拖拽进行复制。此处复制到/home/admin/Documents/

    7、准备好待统计词频的文本文件,此处沿用之前搭建Hadoop时的README.txt。

      上传文件至Hadoop:hdfs dfs -put README.txt /data/input

    8、执行任务命令

      hadoop jar /home/admin/Documents/WC.jar zmt.test.WordCount /data/input/WC /data/output/WC

      需要关注的是入口类的路径zmt.test.WordCount,在更复杂的任务开发中需要指明MapReduce程序入口

    9、查看结果,命令行中会直接给出结果,也可以去/data/output/WC/part-r-00000查看文件内容

    10、任务跟踪,查看MapReduce程序运行情况

      http://192.168.222.134:8088/cluster

     二、TopN问题——找到前N个数

      TopN问题也是入门的一个很好的例子,可以更好地理解MapReduce程序的工作流程,更重要的是了解程序中哪些是模式,是可以更改的,是可以不这么写的。

      与WordCount重复的步骤就不再描述,直接给出关键代码和操作。

    1、生成随机数

      

    import java.io.BufferedWriter;
    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.FileOutputStream;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.io.OutputStream;
    import java.io.OutputStreamWriter;
    import java.io.UnsupportedEncodingException;
    import java.util.Random;
    
    
    public class Num_Generator {
        
        public static void main(String[] args) {
            
            FileOutputStream fos;
            
            OutputStreamWriter osw;
            
            BufferedWriter bw;
            
            Random random = new Random();
            
            String filename = "random_num";
            
            for (int i = 0; i < 10; i++) {
                
                String tmp_name = filename+""+i+".txt";
                
                File file = new File(tmp_name);
                
                try {
                    fos = new FileOutputStream(file);
                    
                    osw = new OutputStreamWriter(fos,"UTF-8");
                    
                    bw = new BufferedWriter(osw);
                    
                    for (int j = 0; j < 1000000; j++) {
                        
                        int rm = random.nextInt();
                        
                        bw.write(rm+"");
                        
                        bw.newLine();
                        
                    }
                    
                    bw.flush();
                    
                    
                } catch (FileNotFoundException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                } catch (UnsupportedEncodingException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                } catch (IOException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
                
                System.out.println(i+":Complete.");
                
            }
            
        }
        
    }

      该程序生成了10个文件,每个文件包括一百万个Integer范围的随机数,生成完成后将其复制并上传到虚拟机的Hadoop文件系统HDFS中

    2、TopN程序编写(该程序是参考另一篇博客的,很惭愧,链接忘了(;′⌒`))

    import java.io.IOException;
    import java.util.Iterator;
    import java.util.TreeMap;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;
    
    
    public class TopN {
        
        public static class MyMapper extends Mapper<Object, Text, NullWritable, IntWritable>
        {
            
            private TreeMap<Integer, Integer> tree = new TreeMap<Integer, Integer>();
            
    //        private final static IntWritable one = new IntWritable(1);
            
    //        private Text number = new Text();
            
            
            
            
            
            @Override
            protected void setup(Context context) throws IOException,
                    InterruptedException {
                // TODO Auto-generated method stub
    //            super.setup(context);
                System.out.println("Mapper("+context.getConfiguration().getInt("N", 10)+"):in setup...");
            }
    
    
            
    
    
    
            @Override
            protected void cleanup(Context context) throws IOException,
                    InterruptedException {
                // TODO Auto-generated method stub
    //            super.cleanup(context);
                System.out.println("Mapper("+context.getConfiguration().getInt("N", 10)+"):in cleanup...");
                for(Integer text : tree.values()){
                    
                    context.write(NullWritable.get(), new IntWritable(text));
                    
                }
            }
    
    
    
    
    
            @Override
            public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
                
                String key_num = value.toString();
                
                int num = Integer.parseInt(key_num);
                
                tree.put(num, num);
                
                if(tree.size() > context.getConfiguration().getInt("N", 10))
                    tree.remove(tree.firstKey());
                
    //            System.out.println("Mapper("+context.getConfiguration().getInt("N", 10)+"):"+key.toString()+"/"+value.toString());
                
    //            number.set(key_num);
                
    //            context.write(number, one);
                
            }
            
            
        }
        
        public static class MyReducer extends Reducer<NullWritable, IntWritable, NullWritable, IntWritable>
        {
            
    //        private IntWritable kk = new IntWritable();
            
            
            private TreeMap<Integer, Integer> tree = new TreeMap<Integer, Integer>();
            
    //        private IntWritable result = new IntWritable();
            
            @Override
            public void reduce(NullWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{
                
                for (IntWritable value : values){
                    
                    tree.put(value.get(), value.get());
                    
                    if(tree.size() > context.getConfiguration().getInt("N", 10))
                    {
                        tree.remove(tree.firstKey());
                    }
                    
                }
                
                
    //            System.out.println("Reducer("+context.getConfiguration().getInt("N", 10)+"):"+key.toString()+"/"+result.get());
                
            }
    
            @Override
            protected void cleanup(
                    org.apache.hadoop.mapreduce.Reducer.Context context)
                    throws IOException, InterruptedException {
                // TODO Auto-generated method stub
    //            super.cleanup(context);
                
                for(Integer val : tree.descendingKeySet()){
                    
                    context.write(NullWritable.get(), new IntWritable(val));
                    
                }
                
            }
            
            
            
        }
        
        public static void main(String[] args) throws Exception {
            
            Configuration conf = new Configuration();
            
            String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
            
            if(otherArgs.length < 3){
                
                System.err.println("heheda");
                
                System.exit(2);
                
            }
            
            conf.setInt("N", new Integer(otherArgs[0]));
            
            System.out.println("N:"+otherArgs[0]);
            
            Job job = Job.getInstance(conf, "TopN");
            job.setJarByClass(TopN.class);
            job.setMapperClass(MyMapper.class);
    //        job.setCombinerClass(MyReducer.class);
            job.setMapOutputKeyClass(NullWritable.class);
            job.setMapOutputValueClass(IntWritable.class);
            
            job.setReducerClass(MyReducer.class);
            job.setOutputKeyClass(NullWritable.class);
            job.setOutputValueClass(IntWritable.class);
            
            for (int i = 1; i < otherArgs.length-1; i++) {
                FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
            }
            
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length-1]));
            
            System.exit(job.waitForCompletion(true) ? 0 : 1);
            
        }
        
    }

    3、运行测试,需要输入参数N

      hadoop jar /home/hadoop/hadoop-2.7.3/share/hadoop/mapreduce/TopN.jar TopN 12 /data/input/test1 /data/output/TT

    4、查看结果

      hdfs dfs -cat /data/output/TT/part-r-00000

      

    [root@Master myscript]# hdfs dfs -cat /data/output/TT/part-r-00000
    2147483194
    2147483070
    2147483066
    2147482879
    2147482835
    2147482469
    2147482152
    2147481212
    2147481174
    2147480379
    2147479927
    2147479795
  • 相关阅读:
    postgresql 在linux下导出数据
    第一次linux下安装nginx记录
    第一次搭建redis集群
    手动mvn install指令向maven本地仓库安装jar包
    windows10下Kafka环境搭建
    在win10环境下搭建 solr 开发环境
    git 常用命令
    生成文件夹的树结构信息
    List集合和JSON互转工具类
    Cmd命令 查看端口被占用
  • 原文地址:https://www.cnblogs.com/zmt0429/p/6703504.html
Copyright © 2020-2023  润新知