关于MapReduce中自定义带比较key类、比较器类（二）——初学者从源码查看其原理

关于MapReduce中自定义带比较key类、比较器类（二）——初学者从源码查看其原理
Job类
1. /**
2. * Define the comparator that controls
3. * how the keys are sorted before they
4. * are passed to the {@link Reducer}.
5. * @param cls the raw comparator
6. * @see #setCombinerKeyGroupingComparatorClass(Class)
7. */
9. publicvoid setSortComparatorClass(Class<? extends RawComparator> cls
10. ) throws IllegalStateException{
11. ensureState(JobState.DEFINE);
12. conf.setOutputKeyComparatorClass(cls);
13. }
Define the comparator that controls

how the keys are sorted before they

定义一个比较器，控制keys在被传递给Reducer之前是如何排序的

 <? extends RawComparator>

是泛型的向下限定，要么是RawComparator类型，要是RawComparator的子类（）

RawComparator

接口Comparator

——子接口RawComparator：Compare two objects in binary.

 compare方法

 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

——子实现类WritableComparator

既然cls必须是类型或其子类类型，那么如果我们自定义的key类是WritableComparator也可以的

JonConf类

点击setOutputKeyComparatorClass，链接到JonConf类中
1. /**
2. * Set the {@link RawComparator} comparator used to compare keys.
3. * @param theClass the {@link RawComparator} comparator used to
4. * compare keys.
5. * @see #setOutputValueGroupingComparator(Class)
6. */
7. 设定用于比较key的比较器，theClass参数就是那个比较器啦
8. publicvoid setOutputKeyComparatorClass(Class<?extendsRawComparator> theClass){
9. setClass(JobContext.KEY_COMPARATOR,
10. theClass,RawComparator.class);
11. }
1. Set the {@link RawComparator} comparator used to compare keys.
2. * @param theClass the {@link RawComparator} comparator used to
3. * compare keys.
设置用于比较key的比较器，参数theClass 就是这个比较器
1. setClass(JobContext.KEY_COMPARATOR,theClass,RawComparator.class);
关于setClass

* An exception is thrown if <code>theClass</code> does not implement the

* interface <code>xface</code>.

setClass的意思，从JobContext中取出KEY_COMPARATOR属性的值，该值对应的类要是RawComparator本身类型或其子类类型，如果不是其子类类型，则会报错。即。theClass实现了RawComparator。

既然有setOutputKeyComparatorClass,j就会有getOutputKeyComparator。仍然在JobConf类中找到
/**
* Get the {@link RawComparator} comparator used to compare keys.
获取到一个用于比较key的比较器，并返回，返回类型是RawComparator
* @return the {@link RawComparator} comparator used to compare keys.
*/
publicRawComparator getOutputKeyComparator(){
Class<? extends RawComparator> theClass = getClass(
JobContext.KEY_COMPARATOR, null,RawComparator.class);

如果KEY_COMPARATOR属性中没值，则返回null

if(theClass != null)
returnReflectionUtils.newInstance(theClass,this);

如果不为空，则就通过反射创建theClass

否则，使用默认的
returnWritableComparator.get(getMapOutputKeyClass().
asSubclass(WritableComparable.class),this);
}
- if(theClass != null)
returnReflectionUtils.newInstance(theClass,this);
假如我们制定了一个比较器类，即job.setSortComparatorClass(xxxS.class)，xxxS,class继承了WritableComparator类型，复写了其中的compare方法。

MapTask$MapOutputBuffer类

到了这里，有一个疑问（强迫症患者专用），那么是谁来调用这个getOutputKeyComparator方法的呢？
在MapTask类中有一个内部类MapOutputBuffer：

属性：private RawComparator<K> comparator;

属性被赋值：

// k/v serialization

comparator = job.getOutputKeyComparator();

可见是在序列化的时候被调用赋值了

ctrl+shift+P 跳转到匹配的括号

方法：compare
/**

 * Compare logical range, st i, j MOD offset capacity.

 * Compare by partition, then by key.

 * @see IndexedSortable#compare

 */

publicint compare(final int mi, final int mj){

 final int kvi = offsetFor(mi % maxRec);

 final int kvj = offsetFor(mj % maxRec);

 final int kvip = kvmeta.get(kvi + PARTITION);

 final int kvjp = kvmeta.get(kvj + PARTITION);

 // sort by partition

 if(kvip != kvjp){

 return kvip - kvjp;

 }

 // sort by key

 return comparator.compare(kvbuffer,

 kvmeta.get(kvi + KEYSTART),

 kvmeta.get(kvi + VALSTART)- kvmeta.get(kvi + KEYSTART),

 kvbuffer,

 kvmeta.get(kvj + KEYSTART),

 kvmeta.get(kvj + VALSTART)- kvmeta.get(kvj + KEYSTART));

}
而在RawComparator中：

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

所以，当我们传递了一个WritableComparator的子类xxxS的时候，其实此时调用的是子类xxxS继承自WritableComparator类的那个compare方法，只不过其还有另一个重载的compare方法

如下即为WritableComparator类中的这个compare
/** Optimization hook. Override this to make SequenceFile.Sorter's scream.

 *

 * The default implementation reads the data into two {@link

 * WritableComparable}s (using {@link

 * Writable#readFields(DataInput)}, then calls {@link

 * #compare(WritableComparable,WritableComparable)}.

 */

@Override

publicint compare(byte[] b1,int s1,int l1, byte[] b2,int s2,int l2){

 try{

 buffer.reset(b1, s1, l1); // parse key1

 key1.readFields(buffer);

 buffer.reset(b2, s2, l2); // parse key2

 key2.readFields(buffer);

 }catch(IOException e){

 thrownewRuntimeException(e);

 }

 return compare(key1, key2); // compare them

}
其实我看了下，前面部分应该是在通过数组来读取到两个key——key1、key2
最终调用的是： compare(key1, key2);
/** Compare two WritableComparables.

 * The default implementation uses the natural ordering, calling {@link

 * Comparable#compareTo(Object)}. */

@SuppressWarnings("unchecked")

publicint compare(WritableComparable a,WritableComparable b){

 return a.compareTo(b);

}
此时，调用的是WritableComparable类中的compareTo方法，而这个方法被我们复写了。

（自定义类实现了WritableComparable接口，并复写了该compareTo方法）

还有一点，之前不是提到，如果要用setSortComparatorClass，则必须是RawComparator类型或其子类嘛？

（一）

我们如果是自定义key类——keyxxxS类，且实现了WritableComparable接口，复写CompareTo方法

此时，不用set，

此时。它会return WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class), this);
/**

 * Get the key class for the map output data. If it is not set, use the

 * (final) output key class. This allows the map output key class to be

 * different than the final output key class.

 *

 * @return the map output key class.

 */

publicClass<?> getMapOutputKeyClass(){

 Class<?> retv = getClass(JobContext.MAP_OUTPUT_KEY_CLASS, null,Object.class);

 if(retv == null){

 retv = getOutputKeyClass();

 }

 return retv;

}
顾名思义。就是获取key的类——即job.setMapOutputClass(xxx.class)中的那个，比如Text，比如我们自定义的keyxxxS
怎么自定义key类——keyxxxS类的

WritableComparable接口的声明：
public interface WritableComparable<T> extends Writable,Comparable<T>
/**

* A serializable object which implements a simple, efficient, serialization

* protocol, based on {@link DataInput} and {@link DataOutput}.

一个实现了一个简单高效的序列化协议（基于....）的可序列化的对象

* Any <code>key</code> or <code>value</code> type in the Hadoop Map-Reduce

* framework implements this interface.

在hadoop mp框架中。任何一个key或者value类型实现该接口
（意思就是说，任意键和值所属的类型应该实现该接口咯）

比如Text，IntWritable

我们查看查看Text类的源码验证之

publicclassText extends BinaryComparable

implements WritableComparable<BinaryComparable>{}
*Implementations typically implement a static<code>read(DataInput)</code>

* method which constructs a new instance, calls {@link#readFields(DataInput)}

* and returns the instance.

实现类通常实现一个静态的read方法——它构建一个新的实例，调用readFields，返回实例
下面是注释中给出的一个完整的例子：
Example:

*<blockquote><pre>

* publicclassMyWritableComparable implements WritableComparable<MyWritableComparable>{

* // Some data

* privateint counter;

* privatelong timestamp;

*

* publicvoid write(DataOutput out) throws IOException{

* out.writeInt(counter);

* out.writeLong(timestamp);

* }

*

* publicvoid readFields(DataInput in) throws IOException{

* counter = in.readInt();

* timestamp = in.readLong();

* }

*

* publicint compareTo(MyWritableComparable o){

* int thisValue =this.value;

* int thatValue = o.value;

* return(thisValue < thatValue ?-1:(thisValue==thatValue ?0:1));

* }

*

* publicint hashCode(){

* final int prime =31;

* int result =1;

* result = prime * result + counter;

* result = prime * result +(int)(timestamp ^(timestamp >>>32));

* return result

* }

* }
（二）

如果是自定义比较器xxxS类，则继承WritableComparator类，复写其中的compare方法

并且要job.setSortComparatorClass(xxxS)

（也是返回一个RawComparator的子实现类，还是会调用复写后的compareTo方法的）

怎么自定义比较器类xxxS的
classWritableComparator implements RawComparator,Configurable

A Comparatorfor{@linkWritableComparable}s.

*This base implemenation uses the natural ordering. To define alternate

* orderings, override {@link#compare(WritableComparable,WritableComparable)}.

*One may optimize compare-intensive operations by overriding

*{@link#compare(byte[],int,int,byte[],int,int)}. Static utility methods are

* provided to assist in optimized implementations of this method.
WritableComparator类是一个给WritableComparablel类对象的比较器

这个基本实现类使用的是自然顺序排序。如果要自定义，则复写compare方法

##########################################################

参考：

http://www.idouba.net/hadoop_mapreduce_shuffle_map_output/

http://www.cnblogs.com/Dreama/articles/2196833.html

http://www.cnblogs.com/lxf20061900/p/3794514.html

http://hugh-wangp.iteye.com/blog/1491175

http://www.tuicool.com/articles/vaaMRz

来自为知笔记(Wiz)
相关阅读:
索引
 mysql事务
 centos 7 gitlab安装服务器
 内网穿透工具 frp使用
 eslint配置
 nodejs连接mongodb(密码)
插入排序
 直接插入排序
 koa中 log4js使用
 JS中的prototype、__proto__与constructor（图解）
原文地址：https://www.cnblogs.com/xuanlvshu/p/5748098.html