关于MapReduce中自定义分组类（三）

关于MapReduce中自定义分组类（三）
Job类
/**

 * Define the comparator that controls which keys are grouped together

 * for a single call to

 * {@link Reducer#reduce(Object, Iterable,

 * org.apache.hadoop.mapreduce.Reducer.Context)}

 * @param cls the raw comparator to use

 * @throws IllegalStateException if the job is submitted

 * @see #setCombinerKeyGroupingComparatorClass(Class)

 */

publicvoid setGroupingComparatorClass(Class<? extends RawComparator> cls

 ) throws IllegalStateException{

 ensureState(JobState.DEFINE);

 conf.setOutputValueGroupingComparator(cls);

}
JobConf类

在JobConf类中的setOutputValueGroupingComparator方法：
/**

 * Set the user defined {@link RawComparator} comparator for

 * grouping keys in the input to the reduce.

 *

 * This comparator should be provided if the equivalence rules for keys

 * for sorting the intermediates are different from those for grouping keys

 * before each call to

 * {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.

 *

 * For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed

 * in a single call to the reduce function if K1 and K2 compare as equal.

 *

 * Since {@link #setOutputKeyComparatorClass(Class)} can be used to control

 * how keys are sorted, this can be used in conjunction to simulate

 * secondary sort on values.

 *

 * Note: This is not a guarantee of the reduce sort being

 * stable in any sense. (In any case, with the order of available

 * map-outputs to the reduce being non-deterministic, it wouldn't make

 * that much sense.)

 *

 * @param theClass the comparator class to be used for grouping keys.

 * It should implement <code>RawComparator</code>.

 * @see #setOutputKeyComparatorClass(Class)

 * @see #setCombinerKeyGroupingComparator(Class)

 */

publicvoid setOutputValueGroupingComparator(

 Class<? extends RawComparator> theClass){

 setClass(JobContext.GROUP_COMPARATOR_CLASS,

 theClass,RawComparator.class);

}
ctrl+O

找到getOutputValueGroupingComparator
/**

 * Get the user defined {@link WritableComparable} comparator for

 * grouping keys of inputs to the reduce.

 *

 * @return comparator set by the user for grouping values.

 * @see #setOutputValueGroupingComparator(Class) for details.

 */

publicRawComparator getOutputValueGroupingComparator(){

 Class<? extends RawComparator> theClass = getClass(

 JobContext.GROUP_COMPARATOR_CLASS, null,RawComparator.class);

 if(theClass == null){

 return getOutputKeyComparator();

 }

 returnReflectionUtils.newInstance(theClass,this);

}
那么谁调用了getOutputValueGroupingComparator方法

ReduceTask类

在ReduceTask类中：

（这里没有定义属性comparator，因为直接作为返回值接受接好了啊）
RawComparator comparator = job.getOutputValueGroupingComparator();
这里get到的comparator其实就是我们自定义的xxxG

于是查找，哪里用到了comparator
if(useNewApi){

 runNewReducer(job, umbilical, reporter, rIter, comparator,

 keyClass, valueClass);

 }else{

 runOldReducer(job, umbilical, reporter, rIter, comparator,

 keyClass, valueClass);

 }
因为有新旧API之分啊
所以找到该runNewReducer方法：
private<INKEY,INVALUE,OUTKEY,OUTVALUE>void runNewReducer(JobConf job,

 final TaskUmbilicalProtocol umbilical,

 final TaskReporter reporter,

 RawKeyValueIterator rIter,

 RawComparator<INKEY> comparator,

 Class<INKEY> keyClass,

 Class<INVALUE> valueClass

 ) throws IOException,InterruptedException,

 ClassNotFoundException{

 // wrap value iterator to report progress.

 final RawKeyValueIterator rawIter = rIter;

 rIter =newRawKeyValueIterator(){

 publicvoid close() throws IOException{

 rawIter.close();

 }

 publicDataInputBuffer getKey() throws IOException{

 return rawIter.getKey();

 }

 publicProgress getProgress(){

 return rawIter.getProgress();

 }

 publicDataInputBuffer getValue() throws IOException{

 return rawIter.getValue();

 }

 public boolean next() throws IOException{

 boolean ret = rawIter.next();

 reporter.setProgress(rawIter.getProgress().getProgress());

 return ret;

 }

 };

 // make a task context so we can get the classes

 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =

 new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,

 getTaskID(), reporter);

 // make a reducer

 org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =

 (org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)

 ReflectionUtils.newInstance(taskContext.getReducerClass(), job);

 org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW =

 newNewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);

 job.setBoolean("mapred.skip.on", isSkipping());

 job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());

 org.apache.hadoop.mapreduce.Reducer.Context

 reducerContext = createReduceContext(reducer, job, getTaskID(),

 rIter, reduceInputKeyCounter,

 reduceInputValueCounter,

 trackedRW,

 committer,

 reporter, comparator, keyClass,

 valueClass);

 try{

 reducer.run(reducerContext);

 } finally {

 trackedRW.close(reducerContext);

 }

}
runNewReducer方法接收该comparator参数后传递给了createReduceContext方法
Task类

在Task里面的createReduceContext方法：
@SuppressWarnings("unchecked")

protectedstatic<INKEY,INVALUE,OUTKEY,OUTVALUE>

org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context

createReduceContext(org.apache.hadoop.mapreduce.Reducer

 <INKEY,INVALUE,OUTKEY,OUTVALUE> reducer,

 Configuration job,

 org.apache.hadoop.mapreduce.TaskAttemptID taskId,

 RawKeyValueIterator rIter,

 org.apache.hadoop.mapreduce.Counter inputKeyCounter,

 org.apache.hadoop.mapreduce.Counter inputValueCounter,

 org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> output,

 org.apache.hadoop.mapreduce.OutputCommitter committer,

 org.apache.hadoop.mapreduce.StatusReporter reporter,

 RawComparator<INKEY> comparator,

 Class<INKEY> keyClass,Class<INVALUE> valueClass

) throws IOException,InterruptedException{

 org.apache.hadoop.mapreduce.ReduceContext<INKEY, INVALUE, OUTKEY, OUTVALUE>

 reduceContext =

 newReduceContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, taskId,

 rIter,

 inputKeyCounter,

 inputValueCounter,

 output,

 committer,

 reporter,

 comparator,

 keyClass,

 valueClass);
ReduceContextImpl类
找到ReduceContextImpl中找到：

publicReduceContextImpl(Configuration conf,TaskAttemptID taskid,

RawKeyValueIterator input,

Counter inputKeyCounter,

Counter inputValueCounter,

RecordWriter<KEYOUT,VALUEOUT> output,

OutputCommitter committer,

StatusReporter reporter,

RawComparator<KEYIN> comparator,

Class<KEYIN> keyClass,

Class<VALUEIN> valueClass

) throws InterruptedException,IOException{

super(conf, taskid, output, committer, reporter);

this.input = input;

this.inputKeyCounter = inputKeyCounter;

this.inputValueCounter = inputValueCounter;

this.comparator = comparator;

this.serializationFactory =newSerializationFactory(conf);

this.keyDeserializer = serializationFactory.getDeserializer(keyClass);

this.keyDeserializer.open(buffer);

this.valueDeserializer = serializationFactory.getDeserializer(valueClass);

this.valueDeserializer.open(buffer);

hasMore = input.next();

this.keyClass = keyClass;

this.valueClass = valueClass;

this.conf = conf;

this.taskid = taskid;

}

在ReduceContextImpl类内查找comparator
/**

 * Advance to the next key/value pair.

 */

@Override

public boolean nextKeyValue() throws IOException,InterruptedException{

 if(!hasMore){

 key = null;

 value = null;

 returnfalse;

 }

 firstValue =!nextKeyIsSame;

 DataInputBuffer nextKey = input.getKey();

 currentRawKey.set(nextKey.getData(), nextKey.getPosition(),

 nextKey.getLength()- nextKey.getPosition());

 buffer.reset(currentRawKey.getBytes(),0, currentRawKey.getLength());

 key = keyDeserializer.deserialize(key);

 DataInputBuffer nextVal = input.getValue();

 buffer.reset(nextVal.getData(), nextVal.getPosition(), nextVal.getLength()

 - nextVal.getPosition());

 value = valueDeserializer.deserialize(value);

 currentKeyLength = nextKey.getLength()- nextKey.getPosition();

 currentValueLength = nextVal.getLength()- nextVal.getPosition();

 if(isMarked){

 backupStore.write(nextKey, nextVal);

 }

 hasMore = input.next();

 if(hasMore){

 nextKey = input.getKey();

 nextKeyIsSame = comparator.compare(currentRawKey.getBytes(),0,

 currentRawKey.getLength(),

 nextKey.getData(),

 nextKey.getPosition(),

 nextKey.getLength()- nextKey.getPosition()

 )==0;

 }else{

 nextKeyIsSame =false;

 }

 inputValueCounter.increment(1);

 returntrue;

}

这个compare方法，调用的是接口RawComparator中的
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

而一般如Text、IntWritable这些都实现了该方法

（一）未设置
if(theClass == null){

return getOutputKeyComparator();

}
/**

 * Get the {@link RawComparator} comparator used to compare keys.

 *

 * @return the {@link RawComparator} comparator used to compare keys.

 */

publicRawComparator getOutputKeyComparator(){

 Class<? extends RawComparator> theClass = getClass(

 JobContext.KEY_COMPARATOR, null,RawComparator.class);

 if(theClass != null)

 returnReflectionUtils.newInstance(theClass,this);

 returnWritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class),this);

}
没有job.setGroupingComparatorClass(xxxG.class);的时候，即使用默认的，调用Map输出的时候的key所属的类中的comparae，比如Text中的

原来默认情况下，调用的是比较器啊（更准确说是那个比较方法）

（这里比较器又分两种：

1 key的类类型中的compareTo方法

2 自定义比较器类中的compare方法

）

无论我们使用1还是2哪种方式，显然，分组和比较要么都用1 ，要么都用2，这样都是同一套规则，显然也不怎么合适。

所以我们一般是在自定义比较器类的同时又自定义分组类

（二）设置了
returnReflectionUtils.newInstance(theClass,this);
如果我们job.setGroupingComparatorClass(xxxG.class),则是创建我们自定义的这个分组类的这个xxxG

这个xxxG得继承WritableComparator类，复写compare方法

如：

public static class SelfGroupComparator extends WritableComparator{

复写compare方法即可

这样，调用逻辑和compare的一样。

我更推荐方法2
alt+左箭头，返回上一次查看源码的地方

来自为知笔记(Wiz)
相关阅读:
10个迷惑新手的Cocoa，ObjectiveC开发难点和问题
 如何成为Android高手
 利用ModalViewController切换View
自定义导航栏的返回按钮(xcode)
iphone开发笔记和技巧总结(原址持续更新)
axis2出现错误
 NYOJ 214(二分插入)
NYOJ 17(LIS转为LCS,但是MLE)
NYOJ 214（LIS二分插入）
NYOJ 36(增量法解决LCS)
原文地址：https://www.cnblogs.com/xuanlvshu/p/5748428.html