• Atitit hadoop使用总结 目录 1.1. 下载300m ,解压后800M 1 1.2. 二:需要的jar包 1 2. Demo code 2 2.1. WCMapper 2 2.2. WC


    Atitit hadoop使用总结

     

    目录

    1.1. 下载300m ,解压后800M 1

    1.2. 二:需要的jar包 1

    2. Demo code 2

    2.1. WCMapper 2

    2.2. WCReduce 3

    2.3. (3)实现运行驱动 3

    3. Run 设置Hadoop  HADOOP_HOME 6

    3.1. Input txt 6

    3.2. Run output console 6

    3.3. Result output .txt 7

    4. 四:操作流程 jar mode 7

    5. Ref 7

     

     

      1. 下载300m ,解压后800M

     

    HDFS是Hadoop大数据平台中的分布式文件系统,为上层应用或其他大数据组件提供数据存储,如Hive,Mapreduce,Spark,HBase等。

     

     

      1. 二:需要的jar包

     

     

    hadoop-2.4.1\share\hadoop\common\hadoop-common-2.4.1.jar

    hadoop-2.4.1\share\hadoop\common\lib\所有jar包

     

     hadoop-2.4.1\share\hadoop\mapreduce\lib\所有jar包

    ---------------------

     

     

    1. Demo code
      1. WCMapper 

    package hadoopDemo;

     

    import java.io.IOException;

    import java.util.StringTokenizer;

    import org.apache.hadoop.io.IntWritable;

    import org.apache.hadoop.io.Text;

     

    import org.apache.hadoop.io.IntWritable;

    import org.apache.hadoop.io.LongWritable;

    import org.apache.hadoop.io.Text;

    import org.apache.hadoop.mapreduce.Mapper;

     

    import java.io.IOException;

     

    //  public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

    public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

     

    // 1.mapper阶段,切片

    // 1).mapper类首先要继承自mapper类,指定输入的key类型,输入的value类型

    // 2).指定输出的key类型,输出的value类型

    // 3).重写map方法

    // 在map方法里面获取的是文本的行号,一行文本的内容,写出的上下文对象

     

     

     

    @Override

    protected void map(LongWritable key, Text value_line, Context context) throws IOException, InterruptedException {

    String line = value_line.toString();

    String[] words = line.split(" ");

    for (String word : words) {

    Text key_Text = new Text();

    IntWritable val_IntWritable = new IntWritable(1);

    key_Text.set(word);

    context.write(key_Text, val_IntWritable);

    }

    }

    }

     

      1. WCReduce 

     

    package hadoopDemo;

     

    import org.apache.hadoop.io.IntWritable;

    import org.apache.hadoop.io.Text;

    import org.apache.hadoop.mapreduce.Reducer;

     

    import com.alibaba.fastjson.JSON;

    import com.google.common.collect.Maps;

     

    import java.io.IOException;

    import java.util.Map;

     

    public class WCReduce extends Reducer<Text,IntWritable,Text,IntWritable> {

        @Override

        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

            int sum=0; //定义一个变量来统计单词出现的次数

            for (IntWritable num:values //遍历这个迭代器,累计单词出现的次数

                 ) {

                sum += num.get();

                

                Map  m=Maps.newConcurrentMap();

                m.put("key",key );

                m.put("num",num);

                m.put("sum_curr",sum );

                System.out.println(JSON.toJSONString(m));

            }

            context.write(key,new IntWritable(sum));

        }

    }

     

      1. (3)实现运行驱动

    运行驱动的目的就是在程序中指定用户的Map类和Reduce类,并配置提交给Hadoop时的相关参数。例如实现一个词频统计的wordcount驱动类:MyWordCount.java,其核心代码如下:

     

     

     

    package hadoopDemo;

     

     

     

    import org.apache.hadoop.conf.Configuration;

    import org.apache.hadoop.fs.Path;

    import org.apache.hadoop.io.IntWritable;

    import org.apache.hadoop.io.Text;

    import org.apache.hadoop.mapreduce.Job;

    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    import java.io.IOException;

     

    public class WCDriver {

        public static void main(String[]  args) throws IOException, ClassNotFoundException, InterruptedException {

        

        

         System.load("D:\\haddop\\hadoop-3.1.1\\bin\\hadoop.dll");

        //创建Job作业

            Job job  = Job.getInstance(new Configuration());

        //设置驱动类

            job.setJarByClass(WCDriver.class);

            //设置mapper类、reduce类

            job.setMapperClass(WCMapper.class);

            job.setReducerClass(WCReduce.class);

            //设置map阶段输出的key类型、value类型

            job.setMapOutputKeyClass(Text.class);

            job.setMapOutputValueClass(IntWritable.class);

            //设置reduce阶段输出key类型、value类型

            job.setOutputKeyClass(Text.class);

            job.setOutputValueClass(IntWritable.class);

            //设置读取文件路径、输出文件路径

            String path_ipt ="D:\\workspace\\hadoopDemo\\ipt.txt";

    FileInputFormat.setInputPaths(job, new Path(path_ipt));

            String path_out = "D:\\workspace\\hadoopDemo\\out.txt";

    FileOutputFormat.setOutputPath(job, new Path(path_out));

            //等待提交作业

            boolean result = job.waitForCompletion(true);

            System.out.println(result);

            while(true)

            {

             Thread.sleep(5000);

             System.out.println("..");

            }

        //    System.exit(result ? 0 : 1);

        }

    }

     

     

     

    import org.apache.hadoop.conf.Conf?iguration;

    import org.apache.hadoop.fs.Path;

    import org.apache.hadoop.io.IntWritable;

    import org.apache.hadoop.io.Text;

    import org.apache.hadoop.mapreduce.Job;

    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class MyWordCount {

       public static void main(String[] args) throws Exception {

         Conf?iguration conf = new Conf?iguration();

         Job job = new Job(conf, "word count");

         job.setJarByClass(MyWordCount.class);

         job.setMapperClass(WordcountMapper.class);

         job.setCombinerClass(WordcountReducer.class);

         job.setReducerClass(WordcountReducer.class);

         job.setOutputKeyClass(Text.class);

         job.setOutputValueClass(IntWritable.class);

         FileInputFormat.addInputPath(job, new Path(args[0]));

         FileOutputFormat.setOutputPath(job, new Path(args[1]));

         System.exit(job.waitForCompletion(true) ? 0 : 1);

       }

    }

    从上述核心代码中可以看出,需要在main函数中设置输入/输出路径的参数,同时为了提交作业,需要job对象,并在job对象中指定作业名称、Map类、Reduce类,以及键值的类型等参数。来源:CUUG官网

     

    1. Run 设置Hadoop  HADOOP_HOME

    可以通过附加下面的命令到 ~/.bashrc 文件中设置 Hadoop 环境变量。

    export HADOOP_HOME=/usr/local/hadoop

    Eclipse envi only can cfg in run cfg ..

     

      1. Input txt 

     

    aaa bbb ccc aaa

     

      1. Run output console

    {"num":{},"sum_curr":1,"key":{"bytes":"YWFh","length":3}}

    {"num":{},"sum_curr":2,"key":{"bytes":"YWFh","length":3}}

    {"num":{},"sum_curr":1,"key":{"bytes":"YmJi","length":3}}

    {"num":{},"sum_curr":1,"key":{"bytes":"Y2Nj","length":3}}

     

      1. Result output .txt

    D:\workspace\hadoopDemo\out.txt\part-r-00000  file

    aaa 2

    bbb 1

    ccc 1

     

    1. 四:操作流程 jar mode

     

    1、将项目打成jar包上传到虚拟机上 if use jar mode

     

    运行jar文件

     

     

    1. Ref

    Mapreduce实例---统计单词个数(wordcount) - Tyshawn的博客 - CSDN博客.html

    MapperReduce入门Wordcount案例 - 小刘的博客 - CSDN博客.html

  • 相关阅读:
    Tree Recovery解题报告
    bjtuOJ1019 Robot
    bjtuOJ1137 蚂蚁爬杆
    栈的使用,rails
    重做catch the cow
    C#3.0新特性之匿名类型
    C#Lambda表达式的用法
    C#进程的使用方法详解
    C#进程管理启动和停止
    C#LINQ查询表达式用法
  • 原文地址:https://www.cnblogs.com/attilax/p/15197508.html
Copyright © 2020-2023  润新知