• 关于MapReduce中自定义带比较key类、比较器类(二)——初学者从源码查看其原理


    Job类
    1. /**
    2.   * Define the comparator that controls 
    3.   * how the keys are sorted before they
    4.   * are passed to the {@link Reducer}.
    5.   * @param cls the raw comparator
    6.   * @see #setCombinerKeyGroupingComparatorClass(Class)
    7.   */
    8.  
    9.  publicvoid setSortComparatorClass(Class<? extends RawComparator> cls
    10. ) throws IllegalStateException{
    11. ensureState(JobState.DEFINE);
    12. conf.setOutputKeyComparatorClass(cls);
    13.  }
    Define the comparator that controls 
    how the keys are sorted before they  
         定义一个比较器,控制keys在被传递给Reducer之前是如何排序的
        <? extends RawComparator>
         是泛型的向下限定,要么是RawComparator类型,要是RawComparator的子类()
         RawComparator
         
             接口Comparator
                     ——子接口RawComparator:Compare two objects in binary. 
                                        compare方法
                                        public int compare(byte[] b1int s1int l1byte[] b2int s2int l2);

                             ——子实现类WritableComparator
         既然cls必须是类型或其子类类型,那么如果我们自定义的key类是WritableComparator也可以的
     
    JonConf类
         点击setOutputKeyComparatorClass,链接到JonConf类中
    1. /**
    2. * Set the {@link RawComparator} comparator used to compare keys.
    3. * @param theClass the {@link RawComparator} comparator used to
    4. * compare keys.
    5. * @see #setOutputValueGroupingComparator(Class)
    6. */
    7. 设定用于比较key的比较器,theClass参数就是那个比较器啦
    8. publicvoid setOutputKeyComparatorClass(Class<?extendsRawComparator> theClass){
    9. setClass(JobContext.KEY_COMPARATOR,
    10. theClass,RawComparator.class);
    11. }
        
    1. Set the {@link RawComparator} comparator used to compare keys.
    2. * @param theClass the {@link RawComparator} comparator used to
    3. * compare keys.
          设置用于比较key的比较器,参数theClass 就是这个比较器
          
    1. setClass(JobContext.KEY_COMPARATOR,theClass,RawComparator.class);
           关于setClass
          * An exception is thrown if <code>theClass</code> does not implement the
         * interface <code>xface</code>.      
     setClass的意思,从JobContext中取出KEY_COMPARATOR属性的值,该值对应的类要是RawComparator本身类型或其子类类型,如果不是其子类类型,则会报错。即。theClass实现了RawComparator。
          既然有setOutputKeyComparatorClass,j就会有getOutputKeyComparator。仍然在JobConf类中找到


    /**
    * Get the {@link RawComparator} comparator used to compare keys.
    获取到一个用于比较key的比较器,并返回,返回类型是RawComparator
    * @return the {@link RawComparator} comparator used to compare keys.
    */
    publicRawComparator getOutputKeyComparator(){
    Class<? extends RawComparator> theClass = getClass(
    JobContext.KEY_COMPARATOR, null,RawComparator.class);

    如果KEY_COMPARATOR属性中没值,则返回null

    if(theClass != null)
    returnReflectionUtils.newInstance(theClass,this);

    如果不为空,则就通过反射创建theClass

    否则,使用默认的
    returnWritableComparator.get(getMapOutputKeyClass().
    asSubclass(WritableComparable.class),this);
    }

    • if(theClass != null)
    1.   returnReflectionUtils.newInstance(theClass,this);
          假如我们制定了一个比较器类,即job.setSortComparatorClass(xxxS.class),xxxS,class继承了WritableComparator类型,复写了其中的compare方法。
     
    MapTask$MapOutputBuffer类
    到了这里,有一个疑问(强迫症患者专用),那么是谁来调用这个getOutputKeyComparator方法的呢?
    在MapTask类中有一个内部类MapOutputBuffer:
         属性:private RawComparator<K> comparator;  
        属性被赋值:
                   // k/v serialization
                   comparator = job.getOutputKeyComparator();
                  可见是在序列化的时候被调用赋值了
                  ctrl+shift+P 跳转到匹配的括号
          方法:compare
    1. /**
    2.      * Compare logical range, st i, j MOD offset capacity.
    3.      * Compare by partition, then by key.
    4.      * @see IndexedSortable#compare
    5.      */
    6. publicint compare(final int mi, final int mj){
    7.       final int kvi = offsetFor(mi % maxRec);
    8.       final int kvj = offsetFor(mj % maxRec);
    9.       final int kvip = kvmeta.get(kvi + PARTITION);
    10.       final int kvjp = kvmeta.get(kvj + PARTITION);
    11.       // sort by partition
    12.       if(kvip != kvjp){
    13.         return kvip - kvjp;
    14.       }
    15.       // sort by key
    16.       return comparator.compare(kvbuffer,
    17.           kvmeta.get(kvi + KEYSTART),
    18.           kvmeta.get(kvi + VALSTART)- kvmeta.get(kvi + KEYSTART),
    19.           kvbuffer,
    20.           kvmeta.get(kvj + KEYSTART),
    21.           kvmeta.get(kvj + VALSTART)- kvmeta.get(kvj + KEYSTART));
    22. }
    而在RawComparator中
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
    所以,当我们传递了一个WritableComparator的子类xxxS的时候,其实此时调用的是子类xxxS继承自WritableComparator类的那个compare方法,只不过其还有另一个重载的compare方法
    如下即为WritableComparator类中的这个compare
    1. /** Optimization hook.  Override this to make SequenceFile.Sorter's scream.
    2.    *
    3.    * <p>The default implementation reads the data into two {@link
    4.    * WritableComparable}s (using {@link
    5.    * Writable#readFields(DataInput)}, then calls {@link
    6.    * #compare(WritableComparable,WritableComparable)}.
    7.    */
    8.   @Override
    9.   publicint compare(byte[] b1,int s1,int l1, byte[] b2,int s2,int l2){
    10.     try{
    11.       buffer.reset(b1, s1, l1);                   // parse key1
    12.       key1.readFields(buffer);
    13.  
    14.       buffer.reset(b2, s2, l2);                   // parse key2
    15.       key2.readFields(buffer);
    16.  
    17.     }catch(IOException e){
    18.       thrownewRuntimeException(e);
    19.     }
    20.  
    21.     return compare(key1, key2);                   // compare them
    22.   }
    其实我看了下,前面部分应该是在通过数组来读取到两个key——key1、key2
    最终调用的是: compare(key1, key2);
    1. /** Compare two WritableComparables.
    2.    * <p> The default implementation uses the natural ordering, calling {@link
    3.    * Comparable#compareTo(Object)}. */
    4.   @SuppressWarnings("unchecked")
    5.   publicint compare(WritableComparable a,WritableComparable b){
    6.  
    7.     return a.compareTo(b);
    8.   }
     
     
     
    此时,调用的是WritableComparable类中的compareTo方法,而这个方法被我们复写了。
    (自定义类实现了WritableComparable接口,并复写了该compareTo方法)
    还有一点,之前不是提到,如果要用setSortComparatorClass,则必须是RawComparator类型或其子类嘛?
     
    (一)
    我们如果是自定义key类——keyxxxS类,且实现了WritableComparable接口,复写CompareTo方法
    此时,不用set,
    此时。它会return WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class), this);
    1. /**
    2.    * Get the key class for the map output data. If it is not set, use the
    3.  
    4.    * (final) output key class. This allows the map output key class to be
    5.    * different than the final output key class.
    6.    * 
    7.    * @return the map output key class.
    8.    */
    9.   publicClass<?> getMapOutputKeyClass(){
    10.     Class<?> retv = getClass(JobContext.MAP_OUTPUT_KEY_CLASS, null,Object.class);
    11.     if(retv == null){
    12.       retv = getOutputKeyClass();
    13.     }
    14.     return retv;
    15.   }
    顾名思义。就是获取key的类——即job.setMapOutputClass(xxx.class)中的那个,比如Text,比如我们自定义的keyxxxS
     
    怎么自定义key类——keyxxxS类的
     
    WritableComparable接口的声明:
    1. public interface WritableComparable<T> extends Writable,Comparable<T>
    1. /**
    2.  * A serializable object which implements a simple, efficient, serialization 
    3.  * protocol, based on {@link DataInput} and {@link DataOutput}.
    4.  
    5.  一个实现了一个简单高效的序列化协议(基于....)的可序列化的对象
    6.  * <p>Any <code>key</code> or <code>value</code> type in the Hadoop Map-Reduce
    7.  * framework implements this interface.</p>
    8.  在hadoop mp框架中。任何一个key或者value类型实现该接口
         (意思就是说,任意键和值所属的类型应该实现该接口咯)
    9.    比如Text,IntWritable
        我们查看查看Text类的源码验证之
      1. publicclassText extends BinaryComparable
      2.     implements WritableComparable<BinaryComparable>{}
       
     
    1.  *<p>Implementations typically implement a static<code>read(DataInput)</code>
    2.  * method which constructs a new instance, calls {@link#readFields(DataInput)} 
    3.  * and returns the instance.</p>
    4.  
    5. 实现类通常实现一个静态的read方法——它构建一个新的实例,调用readFields,返回实例
    下面是注释中给出的一个完整的例子:
    1.   <p>Example:</p>
    2.  *<p><blockquote><pre>
    3.  *     publicclassMyWritableComparable implements WritableComparable<MyWritableComparable>{
    4.  *       // Some data
    5.  *       privateint counter;
    6.  *       privatelong timestamp;
    7.  *       
    8.  *       publicvoid write(DataOutput out) throws IOException{
    9.  *         out.writeInt(counter);
    10.  *         out.writeLong(timestamp);
    11.  *       }
    12.  *       
    13.  *       publicvoid readFields(DataInput in) throws IOException{
    14.  *         counter = in.readInt();
    15.  *         timestamp = in.readLong();
    16.  *       }
    17.  *       
    18.  *       publicint compareTo(MyWritableComparable o){
    19.  *         int thisValue =this.value;
    20.  *         int thatValue = o.value;
    21.  *         return(thisValue &lt; thatValue ?-1:(thisValue==thatValue ?0:1));
    22.  *       }
    23.  *
    24.  *       publicint hashCode(){
    25.  *         final int prime =31;
    26.  *         int result =1;
    27.  *         result = prime * result + counter;
    28.  *         result = prime * result +(int)(timestamp ^(timestamp &gt;&gt;&gt;32));
    29.  *         return result
    30.  *       }
    31.  *     }
     
     
    (二)
    如果是自定义比较器xxxS类,则继承WritableComparator类,复写其中的compare方法
    并且要job.setSortComparatorClass(xxxS)
    (也是返回一个RawComparator的子实现类,还是会调用复写后的compareTo方法的)
     
    怎么自定义比较器类xxxS
    1. classWritableComparator implements RawComparator,Configurable
    2.    A Comparatorfor{@linkWritableComparable}s.
    3.  *<p>This base implemenation uses the natural ordering.  To define alternate
    4.  * orderings, override {@link#compare(WritableComparable,WritableComparable)}.
    5.  *<p>One may optimize compare-intensive operations by overriding
    6.  *{@link#compare(byte[],int,int,byte[],int,int)}.  Static utility methods are
    7.  * provided to assist in optimized implementations of this method.
     WritableComparator类是一个给WritableComparablel类对象的比较器
    这个基本实现类使用的是自然顺序排序。如果要自定义,则复写compare方法
     





  • 相关阅读:
    索引
    mysql事务
    centos 7 gitlab安装服务器
    内网穿透工具 frp使用
    eslint配置
    nodejs连接mongodb(密码)
    插入排序
    直接插入排序
    koa中 log4js使用
    JS中的prototype、__proto__与constructor(图解)
  • 原文地址:https://www.cnblogs.com/xuanlvshu/p/5748098.html
Copyright © 2020-2023  润新知