• mapreduce设置setMapOutputKeyClass与setMapOutputValueClass原因


    一般的mapreduce的wordcount程序如下:

    public class WcMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    
        @Override
        protected void map(LongWritable key, Text value, Context ctx) throws IOException, InterruptedException {
    
            String[] words = value.toString().split(" ");
            for (int i = 0; i < words.length; i++) {
                ctx.write(new Text(words[i]), new LongWritable(1L));
            }
        }
    }
    

     

    public class WcReduer extends Reducer<Text, LongWritable, Text, LongWritable> {
    
        LongWritable count = new LongWritable();
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context ctx) throws IOException, InterruptedException {
            Iterator<LongWritable> itr = values.iterator();
            long sum = 0L;
            while (itr.hasNext()) {
                sum = sum + itr.next().get();
            }
            count.set(sum);
            ctx.write(key, count);
        }
    }
    

      

    驱动作业代码:

    public class JobClient {
    
        public static void main(String[] args) throws Exception {
    
            Job job = Job.getInstance();
            job.setJarByClass(JobClient.class);
            job.setMapperClass(WcMapper.class);
            job.setReducerClass(WcReduer.class);
    
            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);
    
            job.setJobName("wordcount");
            FileInputFormat.addInputPath(job, new Path("/daxin/hadoop-mapreduce/words"));
            FileOutputFormat.setOutputPath(job, new Path("/daxin/hadoop-mapreduce/wordcount-result"));
            job.waitForCompletion(true);
        }
    }
    

      

    提交作业会报错:

    Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, received org.apache.hadoop.io.Text
    	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1072)
    	at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:715)
    	at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
    	at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
    	at com.daxin.blog.WcMapper.map(WcMapper.java:20)
    	at com.daxin.blog.WcMapper.map(WcMapper.java:13)
    	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at javax.security.auth.Subject.doAs(Subject.java:422)
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) 

    通过异常信息我们可以定位错误在源码中的位置:org.apache.hadoop.mapred.MapTask.MapOutputBuffer#collect,具体关键源码如下:

     public synchronized void collect(K key, V value, final int partition
                                         ) throws IOException {
          reporter.progress();
          if (key.getClass() != keyClass) {
            throw new IOException("Type mismatch in key from map: expected "
                                  + keyClass.getName() + ", received "
                                  + key.getClass().getName());
          }
          if (value.getClass() != valClass) {
            throw new IOException("Type mismatch in value from map: expected "
                                  + valClass.getName() + ", received "
                                  + value.getClass().getName());
          }
       .....
    }

    此处key.getClass可以确定是Text,需要确定keyClass是什么类型。下面就将确定一下keyClass类型,可以发现keyClass赋值源码:

     keyClass = (Class<K>)job.getMapOutputKeyClass();
    

     getMapOutputKeyClass源码:

      public Class<?> getMapOutputKeyClass() {
        Class<?> retv = getClass(JobContext.MAP_OUTPUT_KEY_CLASS, null, Object.class);
        if (retv == null) {
          retv = getOutputKeyClass();
        }
        return retv;
      } 

    其中MAP_OUTPUT_KEY_CLASS则是获取map输出的key的类型,由于我们驱动代码没有设置因此此处得到的值为默认值null,接下在调用getOutputKeyClass方法:

      public Class<?> getOutputKeyClass() {
        return getClass(JobContext.OUTPUT_KEY_CLASS,
                        LongWritable.class, Object.class);
      }
    

     

     public static final String OUTPUT_KEY_CLASS = "mapreduce.job.output.key.class";
    

      

    通过获取OUTPUT_KEY_CLASS的类型,OUTPUT_KEY_CLASS表示的是作业的key的输出类型,但是由于我们没有设置因此获取默认值为LongWritable。但是实际上我们的MapTask输出的key为Text,因而报如上类型不匹配错误。同理Map的value也有类似问题。为了解决此问题就需要显式的设置MapTask的Key、Value输出类型。代码如下:

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(LongWritable.class);
    

     

    最后也可以分析org.apache.hadoop.mapred.ReduceTask#run方法可以得知:当我们不显式设置Map的Key/Value输出时候,默认Map的key类型为LongWritable,Value为Text,获取类型的关键代码:

     Class keyClass = job.getMapOutputKeyClass();
     Class valueClass = job.getMapOutputValueClass();
    

    其实我有一个困惑,因为我们在写Mapper与Reducer任务时候,Mapper与Reducer都是泛型类,由于泛型类的泛型信息可以保留,为什么还要我们显式设置Map的Key、Value输出类型呢?

    我个人分析,可能存在错误,如果有错误望各位指正:

    虽然泛型类可以保留信息,也可以在运行时获取泛型信息,但是能够得到的信息是一个整体并不是每一个具体的泛型的信息,说的有点模糊,以Mapper为例,Mapper定义如下:

    public class WcMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    
        @Override
        protected void map(LongWritable key, Text value, Context ctx) throws IOException, InterruptedException {
    
         //......
        }
    }
    

      

     当我们获取该泛型信息时候只能获取到:

    org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text, org.apache.hadoop.io.Text, org.apache.hadoop.io.LongWritable>
    

     而不是获取到四个泛型组成的数组,个人觉着可能mapreduce处于此考虑所以要求显示设置输出的类型信息(此处需要具体类型信息的目的是为了序列化)。(当然如果有人说通过解析过去四个泛型信息,这样的确可以,但是这样实现的话代码是不是不太优雅?)

  • 相关阅读:
    汇编语言(王爽) 实验九
    汇编语言(王爽)实验十
    leetcode longest common prefix(easy) /java
    leetcode container-with-most-water(medium) /java
    jdbc 驱动设置
    天上的星星都洒在我身上(mysql重装失败指南)
    leetcode palindrome number(easy) /java
    异常处理
    递归函数
    序列化模块
  • 原文地址:https://www.cnblogs.com/leodaxin/p/8831092.html
Copyright © 2020-2023  润新知