• hadoop-mapreduce中reducetask执行分析


    ReduceTask的执行

    Reduce处理程序中须要运行三个类型的处理,

    1.copy,从各mapcopy数据过来

    2.sort,对数据进行排序操作。

    3.reduce,运行业务逻辑的处理。

    ReduceTask的执行也是通过run方法開始,

    通过mapreduce.job.reduce.shuffle.consumer.plugin.class配置shuffleplugin,

    默认是Shuffle实现类。实现ShuffleConsumerPlugin接口。

    生成Shuffle实例,并运行plugininit函数进行初始化。

    Class<?extendsShuffleConsumerPlugin>clazz =

    job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN,Shuffle.class,ShuffleConsumerPlugin.class);

    shuffleConsumerPlugin =ReflectionUtils.newInstance(clazz, job);

    LOG.info("UsingShuffleConsumerPlugin: " +shuffleConsumerPlugin);


    ShuffleConsumerPlugin.ContextshuffleContext =

    newShuffleConsumerPlugin.Context(getTaskID(),job, FileSystem.getLocal(job),umbilical,

    super.lDirAlloc,reporter, codec,

    combinerClass,combineCollector,

    spilledRecordsCounter,reduceCombineInputCounter,

    shuffledMapsCounter,

    reduceShuffleBytes,failedShuffleCounter,

    mergedMapOutputsCounter,

    taskStatus,copyPhase,sortPhase,this,

    mapOutputFile,localMapFiles);

    shuffleConsumerPlugin.init(shuffleContext);

    运行shufflerun函数。得到RawKeyValueIterator的实例。

    rIter =shuffleConsumerPlugin.run();


    Shuffle.run函数定义:

    .....................................


    inteventsPerReducer = Math.max(MIN_EVENTS_TO_FETCH,

    MAX_RPC_OUTSTANDING_EVENTS/ jobConf.getNumReduceTasks());

    intmaxEventsToFetch = Math.min(MAX_EVENTS_TO_FETCH,eventsPerReducer);

    生成map的完毕状态获取线程,并启动此线程,此线程中从am中获取此job中全部完毕的mapevent

    通过ShuffleSchedulerImpl实例把全部的map的完毕的maphost,mapid,

    等记录到mapLocations容器中。此线程每一秒运行一个获取操作。

    //Start the map-completion events fetcher thread

    finalEventFetcher<K,V> eventFetcher =

    newEventFetcher<K,V>(reduceId,umbilical,scheduler,this,

    maxEventsToFetch);

    eventFetcher.start();

    以下看看EventFetcher.run函数的运行过程:以下代码中我仅仅保留了代码的主体部分。

    ...................

    EventFetcher.run:

    publicvoid run() {

    intfailures = 0;

    ........................

    intnumNewMaps = getMapCompletionEvents();

    ..................................

    }

    ......................

    }

    EventFetcher.getMapCompletionEvents

    ..................................

    MapTaskCompletionEventsUpdateupdate =

    umbilical.getMapCompletionEvents(

    (org.apache.hadoop.mapred.JobID)reduce.getJobID(),

    fromEventIdx,

    maxEventsToFetch,

    (org.apache.hadoop.mapred.TaskAttemptID)reduce);

    events =update.getMapTaskCompletionEvents();

    .....................

    for(TaskCompletionEvent event : events) {

    scheduler.resolve(event);

    if(TaskCompletionEvent.Status.SUCCEEDED== event.getTaskStatus()) {

    ++numNewMaps;

    }

    }

    shecdulerShuffleShedulerImpl的实例。

    ShuffleShedulerImpl.resolve

    caseSUCCEEDED:

    URI u = getBaseURI(reduceId,event.getTaskTrackerHttp());

    addKnownMapOutput(u.getHost() +":"+ u.getPort(),

    u.toString(),

    event.getTaskAttemptId());

    maxMapRuntime= Math.max(maxMapRuntime,event.getTaskRunTime());

    break;

    .......

    ShuffleShedulerImpl.addKnownMapOutput函数:

    mapid与相应的host加入到mapLocations容器中,

    MapHost host =mapLocations.get(hostName);

    if(host == null){

    host = newMapHost(hostName, hostUrl);

    mapLocations.put(hostName,host);

    }

    此时会把host的状设置为PENDING

    host.addKnownMap(mapId);

    同一时候把host加入到pendingHosts容器中。notify相关的Fetcher文件copy线程。

    //Mark the host as pending

    if(host.getState() == State.PENDING){

    pendingHosts.add(host);

    notifyAll();

    }

    .....................


    回到ReduceTask.run函数中,接着向下运行

    //Start the map-output fetcher threads

    booleanisLocal = localMapFiles!= null;

    通过mapreduce.reduce.shuffle.parallelcopies配置的值。默觉得5。生成获取map数据的线程数。

    生成Fetcher线程实例,并启动相关的线程。

    通过mapreduce.reduce.shuffle.connect.timeout配置连接超时时间。默认180000

    通过mapreduce.reduce.shuffle.read.timeout配置读取超时时间。默觉得180000

    finalintnumFetchers = isLocal ? 1 :

    jobConf.getInt(MRJobConfig.SHUFFLE_PARALLEL_COPIES,5);

    Fetcher<K,V>[] fetchers =newFetcher[numFetchers];

    if(isLocal) {

    fetchers[0] = newLocalFetcher<K, V>(jobConf,reduceId,scheduler,

    merger,reporter,metrics,this,reduceTask.getShuffleSecret(),

    localMapFiles);

    fetchers[0].start();

    }else{

    for(inti=0; i < numFetchers; ++i) {

    fetchers[i] = newFetcher<K,V>(jobConf,reduceId,scheduler,merger,

    reporter,metrics,this,

    reduceTask.getShuffleSecret());

    fetchers[i].start();

    }

    }

    .........................


    接下来进行Fetcher线程里面,看看Fetcher.run函数执行流程:

    ..........................

    MapHost host = null;

    try{

    //If merge is on, block

    merger.waitForResource();

    ShuffleScheduler中取出一个MapHost实例,

    //Get a host to shuffle from

    host = scheduler.getHost();

    metrics.threadBusy();

    运行shuffle操作。

    //Shuffle

    copyFromHost(host);

    } finally{

    if(host != null){

    scheduler.freeHost(host);

    metrics.threadFree();

    }

    }

    接下来看看ShuffleScheduler中的getHost函数:

    ........

    假设pendingHosts的值没有。先wait住,等待EventFetcher线程去获取数据来notifywait

    while(pendingHosts.isEmpty()){

    wait();

    }


    MapHost host = null;

    Iterator<MapHost> iter =pendingHosts.iterator();

    pendingHostsrandom出一个MapHost,并返回给调用程序。

    intnumToPick = random.nextInt(pendingHosts.size());

    for(inti=0; i <= numToPick; ++i) {

    host = iter.next();

    }


    pendingHosts.remove(host);

    ........................

    当得到一个MapHost后,运行copyFromHost来进行数据的copy操作。

    此时。一个taskhosturl样子基本上是这个样子:

    host:port/mapOutput?

    job=xxx&reduce=123(当前reducepartid)&map=

    copyFromHost的代码部分:

    .....

    List<TaskAttemptID>maps = scheduler.getMapsForHost(host);

    .....

    Set<TaskAttemptID>remaining = newHashSet<TaskAttemptID>(maps);

    .....

    此部分完毕后,url样子中map=后面会有非常多个mapid。多个用英文的”,”号分开的。

    URLurl = getMapOutputURL(host, maps);

    此处依据url打开httpconnection,

    假设mapreduce.shuffle.ssl.enabled配置为true时,会打开SSL连接。默觉得false.

    openConnection(url);

    .....

    设置连接超时时间,header,读取超时时间等值。并打开HttpConnection的连接。

    // put url hashinto http header

    connection.addRequestProperty(

    SecureShuffleUtils.HTTP_HEADER_URL_HASH,encHash);

    //set the read timeout

    connection.setReadTimeout(readTimeout);

    //put shuffle version into httpheader

    connection.addRequestProperty(ShuffleHeader.HTTP_HEADER_NAME,

    ShuffleHeader.DEFAULT_HTTP_HEADER_NAME);

    connection.addRequestProperty(ShuffleHeader.HTTP_HEADER_VERSION,

    ShuffleHeader.DEFAULT_HTTP_HEADER_VERSION);

    connect(connection,connectionTimeout);

    .....

    运行文件的copy操作。此处是迭代运行。每个读取一个map的文件。

    并把remaining中的值去掉一个。直到remaining的值所有读取完毕。

    TaskAttemptID[] failedTasks =null;

    while(!remaining.isEmpty() && failedTasks == null){

    copyMapOutput函数中。每次读取一个mapid,

    依据MergeManagerImpl中的reserve函数,

    1.检查map的输出是否超过了mapreduce.reduce.memory.totalbytes配置的大小。

    此配置的默认值

    是当前RuntimemaxMemory*mapreduce.reduce.shuffle.input.buffer.percent配置的值。

    Buffer.percent的默认值为0.90;

    假设mapoutput超过了此配置的大小时,生成一个OnDiskMapOutput实例。

    2.假设没有超过此大小,生成一个InMemoryMapOutput实例。

    failedTasks =copyMapOutput(host, input, remaining);

    }

    copyMapOutput函数中首先调用的MergeManagerImpl.reserve函数:

    if(!canShuffleToMemory(requestedSize)) {

    .....

    returnnewOnDiskMapOutput<K,V>(mapId, reduceId,this,requestedSize,

    jobConf,mapOutputFile,fetcher, true);

    }

    .....

    if(usedMemory> memoryLimit){

    .....,当前使用的memory已经超过了配置的内存使用大小。此时返回null

    host又一次加入到shuffleSchedulerpendingHosts队列中。

    returnnull;

    }

    returnunconditionalReserve(mapId, requestedSize, true);

    生成一个InMemoryMapOutput,并把usedMemory加上此mapoutput的大小。

    privatesynchronizedInMemoryMapOutput<K, V> unconditionalReserve(

    TaskAttemptID mapId, longrequestedSize, booleanprimaryMapOutput) {

    usedMemory+= requestedSize;

    returnnewInMemoryMapOutput<K,V>(jobConf,mapId, this,(int)requestedSize,

    codec,primaryMapOutput);

    }


    以下是当usedMemory使用超过了指定的大小后,的处理部分。又一次把host加入到队列中。

    例如以下所看到的:copyMapOutput函数

    if(mapOutput == null){

    LOG.info("fetcher#"+ id + "- MergeManager returned status WAIT ...");

    //Notan error but wait to process data.

    returnEMPTY_ATTEMPT_ID_ARRAY;

    }

    此时host中还有没处理完毕的mapoutput,Fetcher.run中,又一次加入到队列中把此host

    if(host != null){

    scheduler.freeHost(host);

    metrics.threadFree();

    }

    .........

    接下来还是在copyMapOutput函数中,

    通过mapoutput也就是merge.reserve函数返回的实例的shuffle函数。

    假设mapoutputInMemoryMapOutput,在调用shuffle时,直接把map输出写入到内存。

    假设是OnDiskMapOutput,在调用shuffle时,直接把map的输出写入到local暂时文件里。

    ....

    最后,运行ShuffleScheduler.copySucceeded完毕文件的copy,调用mapout.commit函数。

    scheduler.copySucceeded(mapId,host, compressedLength,

    endTime -startTime, mapOutput);

    并从remaining中移出处理过的mapid,


    接下来看看MapOutput.commit函数:

      a.InMemoryMapOutput.commit函数:

      publicvoidcommit() throwsIOException {

    merger.closeInMemoryFile(this);

    }

    调用MergeManagerImpl.closeInMemoryFile函数:

    publicsynchronizedvoidcloseInMemoryFile(InMemoryMapOutput<K,V> mapOutput) {

    把此mapOutput实例加入到inMemoryMapOutputs列表中。

    inMemoryMapOutputs.add(mapOutput);

    LOG.info("closeInMemoryFile-> map-output of size: " +mapOutput.getSize()

    + ",inMemoryMapOutputs.size() -> " +inMemoryMapOutputs.size()

    + ",commitMemory -> " + commitMemory+ ", usedMemory ->"+ usedMemory);

    commitMemory的大小添加当前传入的mapoutputsize大小。

    commitMemory+=mapOutput.getSize();

    检查是否达到merge的值,

    此值是mapreduce.reduce.memory.totalbytes配置

    *mapreduce.reduce.shuffle.merge.percent配置的值。

    默认是当前Runtimememory*0.90*0.90

    也就是说,仅仅有有新的mapoutput增加,这个检查条件就肯定会达到

    //Can hang if mergeThreshold is really low.

    if(commitMemory>= mergeThreshold){

    .......

    把正在进行mergemapoutput列表加入到一起发起merge操作。

    inMemoryMapOutputs.addAll(inMemoryMergedMapOutputs);

    inMemoryMergedMapOutputs.clear();

    inMemoryMerger.startMerge(inMemoryMapOutputs);

    commitMemory= 0L; // Reset commitMemory.

    }

    假设mapreduce.reduce.merge.memtomem.enabled配置为true,默觉得false

    同一时候inMemoryMapOutputs中的mapoutput个数

    达到了mapreduce.reduce.merge.memtomem.threshold配置的值,

    默认值是mapreduce.task.io.sort.factor配置的值,默觉得100

    发起memTomemmerger操作。

    if(memToMemMerger!= null){

    if(inMemoryMapOutputs.size()>= memToMemMergeOutputsThreshold){

    memToMemMerger.startMerge(inMemoryMapOutputs);

    }

    }

    }


    MergemanagerImpl.InMemoryMerger.merger函数操作:

    在运行inMemoryMerger.startMerge(inMemoryMapOutputs);操作后,会notify此线程,

    同一时候运行merger函数:

    publicvoidmerge(List<InMemoryMapOutput<K,V>> inputs) throwsIOException {

    if(inputs == null|| inputs.size() == 0) {

    return;

    }

    ....................

    TaskAttemptID mapId =inputs.get(0).getMapId();

    TaskID mapTaskId =mapId.getTaskID();


    List<Segment<K, V>>inMemorySegments = newArrayList<Segment<K, V>>();

    生成InMemoryReader实例,并把传入的容器清空,把生成好后的segment放到到inmemorysegments中。

    longmergeOutputSize =

    createInMemorySegments(inputs,inMemorySegments,0);

    intnoInMemorySegments = inMemorySegments.size();

    生成一个输出的文件路径。

    Path outputPath =

    mapOutputFile.getInputFileForWrite(mapTaskId,

    mergeOutputSize).suffix(

    Task.MERGED_OUTPUT_PREFIX);

    针对输出的暂时文件生成一个Write实例。

    Writer<K,V> writer =

    newWriter<K,V>(jobConf,rfs,outputPath,

    (Class<K>)jobConf.getMapOutputKeyClass(),

    (Class<V>)jobConf.getMapOutputValueClass(),

    codec,null);


    RawKeyValueIterator rIter = null;

    CompressAwarePathcompressAwarePath;

    try{

    LOG.info("Initiatingin-memory merge with " +noInMemorySegments +

    "segments...");

    此部分与map端的输出没什么差别,得到几个segment的文件的一个iterator,

    此部分是一个优先堆,每一次next都会从全部的segment中读取出最小的一个keyvalue

    rIter = Merger.merge(jobConf,rfs,

    (Class<K>)jobConf.getMapOutputKeyClass(),

    (Class<V>)jobConf.getMapOutputValueClass(),

    inMemorySegments,inMemorySegments.size(),

    newPath(reduceId.toString()),

    (RawComparator<K>)jobConf.getOutputKeyComparator(),

    reporter,spilledRecordsCounter,null,null);

    假设没有combiner程序,直接写入到文件,否则,假设有combiner,先运行combiner处理。

    if(null== combinerClass){

    Merger.writeFile(rIter,writer, reporter,jobConf);

    } else{

    combineCollector.setWriter(writer);

    combineAndSpill(rIter,reduceCombineInputCounter);

    }

    writer.close();

    此处与map端的输出不同的地方在这里,这里不写入spillindex文件,

    而是生成一个CompressAwarePath,把输出路径,大写和小写入到此实例中。

    compressAwarePath = newCompressAwarePath(outputPath,

    writer.getRawLength(),writer.getCompressedLength());


    LOG.info(reduceId+

    "Merge of the " + noInMemorySegments+

    "files in-memory complete." +

    "Local file is " + outputPath + "of size " +

    localFS.getFileStatus(outputPath).getLen());

    } catch(IOException e) {

    //makesure that we delete the ondiskfile that we created

    //earlierwhen we invoked cloneFileAttributes

    localFS.delete(outputPath,true);

    throwe;

    }

    此处,把生成的文件加入到onDiskMapOutputs属性中。

    并检查此容器中的文件是否达到了mapreduce.task.io.sort.factor配置的值,

    假设是,发起diskmerger操作。

    //Note the output of the merge

    closeOnDiskFile(compressAwarePath);

    }


    }

    上面最后一行的所有定义在以下这里。

    publicsynchronizedvoidcloseOnDiskFile(CompressAwarePath file) {

    onDiskMapOutputs.add(file);

    if(onDiskMapOutputs.size()>= (2 * ioSortFactor- 1)) {

    onDiskMerger.startMerge(onDiskMapOutputs);

    }

    }


    b.OnDiskMapOutput.commit函数:

    tmp文件rename到指定的文件夹下,生成一个CompressAwarePath实例。调用上面提到的处理程序。

    publicvoidcommit() throwsIOException {

    fs.rename(tmpOutputPath,outputPath);

    CompressAwarePathcompressAwarePath = newCompressAwarePath(outputPath,

    getSize(), this.compressedSize);

    merger.closeOnDiskFile(compressAwarePath);

    }


    MergeManagerImpl.OnDiskMerger.merger函数:

    这个函数到如今基本上没有什么能够讲解的东西,注意一点就是,

    merge一个文件后,会把这个merge后的文件路径又一次加入到onDiskMapOutputs容器中。

    publicvoidmerge(List<CompressAwarePath> inputs) throwsIOException {

    //sanity check

    if(inputs == null|| inputs.isEmpty()) {

    LOG.info("Noondisk files to merge...");

    return;

    }

    longapproxOutputSize = 0;

    intbytesPerSum =

    jobConf.getInt("io.bytes.per.checksum",512);

    LOG.info("OnDiskMerger:We have " + inputs.size() +

    "map outputs on disk. Triggering merge...");

    //1. Prepare the list of files to be merged.

    for(CompressAwarePath file : inputs) {

    approxOutputSize +=localFS.getFileStatus(file).getLen();

    }


    //add the checksum length

    approxOutputSize +=

    ChecksumFileSystem.getChecksumLength(approxOutputSize,bytesPerSum);


    //2. Start the on-disk merge process

    Path outputPath =

    localDirAllocator.getLocalPathForWrite(inputs.get(0).toString(),

    approxOutputSize,jobConf).suffix(Task.MERGED_OUTPUT_PREFIX);

    Writer<K,V> writer =

    newWriter<K,V>(jobConf,rfs,outputPath,

    (Class<K>)jobConf.getMapOutputKeyClass(),

    (Class<V>)jobConf.getMapOutputValueClass(),

    codec,null);

    RawKeyValueIterator iter = null;

    CompressAwarePathcompressAwarePath;

    Path tmpDir = newPath(reduceId.toString());

    try{

    iter = Merger.merge(jobConf,rfs,

    (Class<K>)jobConf.getMapOutputKeyClass(),

    (Class<V>)jobConf.getMapOutputValueClass(),

    codec,inputs.toArray(newPath[inputs.size()]),

    true,ioSortFactor,tmpDir,

    (RawComparator<K>)jobConf.getOutputKeyComparator(),

    reporter,spilledRecordsCounter,null,

    mergedMapOutputsCounter,null);


    Merger.writeFile(iter,writer, reporter,jobConf);

    writer.close();

    compressAwarePath = newCompressAwarePath(outputPath,

    writer.getRawLength(),writer.getCompressedLength());

    } catch(IOException e) {

    localFS.delete(outputPath,true);

    throwe;

    }


    closeOnDiskFile(compressAwarePath);


    LOG.info(reduceId+

    "Finished merging " + inputs.size()+

    "map output files on disk of total-size "+

    approxOutputSize + "."+

    "Local output file is " + outputPath+ " of size "+

    localFS.getFileStatus(outputPath).getLen());

    }

    }


    ok,如今mapcopy部分运行完毕,回到ShuffleConsumerPluginrun方法中,

    也就是Shufflerun方法中,接着上面的代码向下分析:

    此处等待全部的copy操作完毕。

    //Wait for shuffle to complete successfully

    while(!scheduler.waitUntilDone(PROGRESS_FREQUENCY)){

    reporter.progress();

    synchronized(this){

    if(throwable!= null){

    thrownewShuffleError("error in shuffle in "+ throwingThreadName,

    throwable);

    }

    }

    }

    假设运行到这一行时。说明全部的mapcopy操作已经完毕,

    关闭查找map执行状态的线程与执行copy操作的几个线程。

    //Stop the event-fetcher thread

    eventFetcher.shutDown();

    //Stop the map-output fetcher threads

    for(Fetcher<K,V> fetcher : fetchers) {

    fetcher.shutDown();

    }

    //stop the scheduler

    scheduler.close();

    am发送状态,通知AM。此时要运行排序操作。

    copyPhase.complete();// copy is already complete

    taskStatus.setPhase(TaskStatus.Phase.SORT);

    reduceTask.statusUpdate(umbilical);


    运行最后的merge,事实上在合并全部文件与memory中的数据时。也同一时候会进行排序操作。

    //Finish the on-going merges...

    RawKeyValueIterator kvIter =null;

    try{

    kvIter = merger.close();

    }catch(Throwable e) {

    thrownewShuffleError("Error while doingfinal merge " , e);

    }


    //Sanity check

    synchronized(this){

    if(throwable!= null){

    thrownewShuffleError("error in shuffle in "+ throwingThreadName,

    throwable);

    }

    }

    最后返回这个合并后的iterator实例。

    returnkvIter;


    Merger也就是MergeManagerImpl.close函数:

    publicRawKeyValueIterator close() throwsThrowable {

    关闭几个merge的线程。在关闭时会等待现有的merge完毕。

    //Wait for on-going merges to complete

    if(memToMemMerger!= null){

    memToMemMerger.close();

    }

    inMemoryMerger.close();

    onDiskMerger.close();

    List<InMemoryMapOutput<K,V>> memory =

    newArrayList<InMemoryMapOutput<K, V>>(inMemoryMergedMapOutputs);

    inMemoryMergedMapOutputs.clear();

    memory.addAll(inMemoryMapOutputs);

    inMemoryMapOutputs.clear();

    List<CompressAwarePath>disk = newArrayList<CompressAwarePath>(onDiskMapOutputs);

    onDiskMapOutputs.clear();

    运行终于的merge操作。

    returnfinalMerge(jobConf,rfs,memory, disk);

    }

    最后的一个merge操作

    privateRawKeyValueIterator finalMerge(JobConf job, FileSystem fs,

    List<InMemoryMapOutput<K,V>>inMemoryMapOutputs,

    List<CompressAwarePath>onDiskMapOutputs

    )throwsIOException {

    LOG.info("finalMergecalled with " +

    inMemoryMapOutputs.size() +" in-memory map-outputs and "+

    onDiskMapOutputs.size() + "on-disk map-outputs");

    finalfloatmaxRedPer =

    job.getFloat(MRJobConfig.REDUCE_INPUT_BUFFER_PERCENT,0f);

    if(maxRedPer > 1.0 || maxRedPer < 0.0) {

    thrownewIOException(MRJobConfig.REDUCE_INPUT_BUFFER_PERCENT+

    maxRedPer);

    }

    得到能够cache到内存的大小,比例通过mapreduce.reduce.input.buffer.percent配置,

    intmaxInMemReduce = (int)Math.min(

    Runtime.getRuntime().maxMemory()* maxRedPer, Integer.MAX_VALUE);


    //merge configparams

    Class<K> keyClass =(Class<K>)job.getMapOutputKeyClass();

    Class<V> valueClass =(Class<V>)job.getMapOutputValueClass();

    booleankeepInputs = job.getKeepFailedTaskFiles();

    finalPath tmpDir = newPath(reduceId.toString());

    finalRawComparator<K> comparator =

    (RawComparator<K>)job.getOutputKeyComparator();


    //segments required to vacate memory

    List<Segment<K,V>>memDiskSegments = newArrayList<Segment<K,V>>();

    longinMemToDiskBytes = 0;

    booleanmergePhaseFinished = false;

    if(inMemoryMapOutputs.size() > 0) {

    TaskID mapId =inMemoryMapOutputs.get(0).getMapId().getTaskID();

    这个地方依据可cache到内存的值。把不能cache到内存的部分生成InMemoryReader实例。

    并加入到memDiskSegments容器中。

    inMemToDiskBytes =createInMemorySegments(inMemoryMapOutputs,

    memDiskSegments,

    maxInMemReduce);

    finalintnumMemDiskSegments = memDiskSegments.size();

    把内存中多于部分的mapoutput数据写入到文件里,并把文件路径加入到onDiskMapOutputs容器中。

    if(numMemDiskSegments > 0 &&

    ioSortFactor> onDiskMapOutputs.size()) {

    ...........

    此部分主要是写入内存中多于的mapoutput到磁盘中去

    mergePhaseFinished = true;

    //must spill to disk, but can't retain in-memfor intermediate merge

    finalPath outputPath =

    mapOutputFile.getInputFileForWrite(mapId,

    inMemToDiskBytes).suffix(

    Task.MERGED_OUTPUT_PREFIX);

    finalRawKeyValueIterator rIter = Merger.merge(job,fs,

    keyClass, valueClass,memDiskSegments, numMemDiskSegments,

    tmpDir, comparator,reporter,spilledRecordsCounter,null,

    mergePhase);

    Writer<K,V> writer = newWriter<K,V>(job, fs, outputPath,

    keyClass, valueClass, codec,null);

    try{

    Merger.writeFile(rIter,writer, reporter,job);

    writer.close();

    onDiskMapOutputs.add(newCompressAwarePath(outputPath,

    writer.getRawLength(),writer.getCompressedLength()));

    writer = null;

    //add to list of final disk outputs.

    } catch(IOException e) {

    if(null!= outputPath) {

    try{

    fs.delete(outputPath,true);

    } catch(IOException ie) {

    //NOTHING

    }

    }

    throwe;

    } finally{

    if(null!= writer) {

    writer.close();

    }

    }

    LOG.info("Merged" + numMemDiskSegments + "segments, " +

    inMemToDiskBytes + "bytes to disk to satisfy " +

    "reducememory limit");

    inMemToDiskBytes = 0;

    memDiskSegments.clear();

    } elseif(inMemToDiskBytes != 0) {

    LOG.info("Keeping" + numMemDiskSegments + "segments, " +

    inMemToDiskBytes + "bytes in memory for " +

    "intermediate,on-disk merge");

    }

    }


    //segments on disk

    List<Segment<K,V>>diskSegments = newArrayList<Segment<K,V>>();

    longonDiskBytes = inMemToDiskBytes;

    longrawBytes = inMemToDiskBytes;

    生成眼下文件里有的全部的mapoutput路径的onDisk数组

    CompressAwarePath[] onDisk =onDiskMapOutputs.toArray(

    newCompressAwarePath[onDiskMapOutputs.size()]);

    for(CompressAwarePath file : onDisk) {

    longfileLength = fs.getFileStatus(file).getLen();

    onDiskBytes += fileLength;

    rawBytes +=(file.getRawDataLength() > 0) ?

    file.getRawDataLength() :fileLength;


    LOG.debug("Diskfile: " + file + "Length is " + fileLength);

    把如今reduce端接收过来并存储到文件里的mapoutput生成segment并加入到distSegments容器中

    diskSegments.add(newSegment<K, V>(job, fs, file, codec,keepInputs,

    (file.toString().endsWith(

    Task.MERGED_OUTPUT_PREFIX)?

    null: mergedMapOutputsCounter),file.getRawDataLength()

    ));

    }

    LOG.info("Merging" + onDisk.length+ " files, "+

    onDiskBytes + "bytes from disk");

    按内容的大小从小到大排序此distSegments容器

    Collections.sort(diskSegments,newComparator<Segment<K,V>>() {

    publicintcompare(Segment<K, V> o1, Segment<K, V> o2) {

    if(o1.getLength() == o2.getLength()) {

    return0;

    }

    returno1.getLength() < o2.getLength() ? -1 : 1;

    }

    });

    把如今memory中全部的mapoutput内容生成segment并加入到finalSegments容器中。

    //build final list of segments from merged backed by disk + in-mem

    List<Segment<K,V>>finalSegments = newArrayList<Segment<K,V>>();

    longinMemBytes = createInMemorySegments(inMemoryMapOutputs,

    finalSegments,0);

    LOG.info("Merging" + finalSegments.size() + "segments, " +

    inMemBytes + "bytes from memory into reduce");

    if(0 != onDiskBytes) {

    finalintnumInMemSegments = memDiskSegments.size();

    diskSegments.addAll(0,memDiskSegments);

    memDiskSegments.clear();

    //Pass mergePhase only if there is a going to be intermediate

    //merges. See comment where mergePhaseFinished is being set

    Progress thisPhase =(mergePhaseFinished) ? null: mergePhase;

    这个部分是把如今磁盘上的mapoutput生成一个iterator,

    RawKeyValueIterator diskMerge =Merger.merge(

    job, fs, keyClass, valueClass,codec,diskSegments,

    ioSortFactor,numInMemSegments, tmpDir, comparator,

    reporter,false,spilledRecordsCounter,null,thisPhase);

    diskSegments.clear();

    if(0 == finalSegments.size()) {

    returndiskMerge;

    }

    把如今磁盘上的iterator也相同加入到finalSegments容器中。

    也就是此时,这个容器中有两个优先堆排序的队列,每next一次,要从内存与磁盘中找出最小的一个kv.

    finalSegments.add(newSegment<K,V>(

    newRawKVIteratorReader(diskMerge, onDiskBytes), true,rawBytes));

    }

    returnMerger.merge(job,fs, keyClass, valueClass,

    finalSegments,finalSegments.size(), tmpDir,

    comparator, reporter,spilledRecordsCounter,null,

    null);

    }


    shuffle部分如今所有运行完毕。又一次加到ReduceTask.run函数中,接着代码向下分析:

    rIter =shuffleConsumerPlugin.run();

    ............

    RawComparatorcomparator = job.getOutputValueGroupingComparator();

    if(useNewApi) {

    runNewReducer(job, umbilical,reporter, rIter, comparator,

    keyClass,valueClass);

    }else{

    runOldReducer........

    }

    在以上代码中运行runNewReducer主要是运行reducerun函数。

    org.apache.hadoop.mapreduce.TaskAttemptContexttaskContext =

    neworg.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,

    getTaskID(), reporter);

    //make a reducer

    org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>reducer =

    (org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)

    ReflectionUtils.newInstance(taskContext.getReducerClass(),job);

    org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE>trackedRW =

    newNewTrackingRecordWriter<OUTKEY, OUTVALUE>(this,taskContext);

    job.setBoolean("mapred.skip.on",isSkipping());

    job.setBoolean(JobContext.SKIP_RECORDS,isSkipping());

    org.apache.hadoop.mapreduce.Reducer.Context

    reducerContext =createReduceContext(reducer, job, getTaskID(),

    rIter,reduceInputKeyCounter,

    reduceInputValueCounter,

    trackedRW,

    committer,

    reporter,comparator, keyClass,

    valueClass);

    try{

    reducer.run(reducerContext);

    }finally{

    trackedRW.close(reducerContext);

    }


    以上代码中创建Reducer执行的Context,并执行reducer.run函数:

    createReduceContext函数定义部分代码:

    org.apache.hadoop.mapreduce.ReduceContext<INKEY,INVALUE, OUTKEY, OUTVALUE>

    reduceContext =

    newReduceContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,taskId,

    rIter,

    inputKeyCounter,

    inputValueCounter,

    output,

    committer,

    reporter,

    comparator,

    keyClass,

    valueClass);


    org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context

    reducerContext =

    newWrappedReducer<INKEY, INVALUE, OUTKEY,OUTVALUE>().getReducerContext(

    reduceContext);

    ReduceContextImpl主要是运行在RawKeyValueInterator中读取数据的相关操作。

    Reducer.run函数:

    publicvoid run(Context context) throwsIOException, InterruptedException {

    setup(context);

    try{

    while(context.nextKey()) {

    reduce(context.getCurrentKey(),context.getValues(),context);

    //If a back up store is used, reset it

    Iterator<VALUEIN> iter =context.getValues().iterator();

    if(iterinstanceofReduceContext.ValueIterator) {

    ((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();

    }

    }

    }finally{

    cleanup(context);

    }

    }

    run函数中通过context.nextkey来得到下一行的数据,这部分主要在ReduceContextImpl中完毕:

    nextkey调用nextKeyValue函数:

    publicboolean nextKeyValue() throwsIOException, InterruptedException {

    if(!hasMore){

    key= null;

    value= null;

    returnfalse;

    }

    此处用来检查是否是一个key以下的第一个value,假设是第一个value时。此值为false,

    也就是说。nextKeyIsSame的值是true时,表示如今next的数据与currentkey是一行数据。

    否则表示已经进行了换行操作。

    firstValue= !nextKeyIsSame;

    运行一下RawKeyValueInterator(也就是Merge中的队列),得到当前最小的key

    DataInputBuffer nextKey =input.getKey();

    key设置到buffer中,设置到buffer中的目的是为了通过keyDeserializer来读取一个key的值。

    currentRawKey.set(nextKey.getData(),nextKey.getPosition(),

    nextKey.getLength()- nextKey.getPosition());

    buffer.reset(currentRawKey.getBytes(),0, currentRawKey.getLength());

    buffer中读取key的值,并存储到key中,这个地方要注意一下,

    以下先看看这部分的定义:

    .........................

    生成一个keyDeserializer实例,

    this.keyDeserializer= serializationFactory.getDeserializer(keyClass);

    buffer当成keyDeserializerInputStream

    this.keyDeserializer.open(buffer);

    Deserializer中运行deserializer函数的定义:

    此部分定义能够看出,一个key/value仅仅会生成实例,此部分从性能上考虑主要是为了降低对象的生成。

    每次生成一个数据时,都是通过readFields又一次去生成Writable实例中的内容。

    因此,非常多同学在reduce中使用value时,会出现数据引用不正确的情况,

    由于对象还是同一个对象,但值是最后一个,所以会出现数据不正确的情况

    publicWritable deserialize(Writable w) throwsIOException {

    Writable writable;

    if(w == null){

    writable

    = (Writable)ReflectionUtils.newInstance(writableClass,getConf());

    } else{

    writable = w;

    }

    writable.readFields(dataIn);

    returnwritable;

    }

    .........................

    读取key的内容

    key= keyDeserializer.deserialize(key);

    key同样的方式,得到当前的value的值。

    DataInputBuffer nextVal =input.getValue();

    buffer.reset(nextVal.getData(),nextVal.getPosition(), nextVal.getLength()

    - nextVal.getPosition());

    value= valueDeserializer.deserialize(value);


    currentKeyLength= nextKey.getLength() - nextKey.getPosition();

    currentValueLength= nextVal.getLength() - nextVal.getPosition();


    isMarked的值为false,同一时候backupStore属性为null

    if(isMarked){

    backupStore.write(nextKey,nextVal);

    }

    input运行一次next操作,此处会从全部的文件/memory中找到最小的一个kv.

    hasMore= input.next();

    if(hasMore) {

    比較一下,是否与currentkey是同一个key,假设是表示在同一行中。

    也就是key同样。

    nextKey = input.getKey();

    nextKeyIsSame= comparator.compare(currentRawKey.getBytes(),0,

    currentRawKey.getLength(),

    nextKey.getData(),

    nextKey.getPosition(),

    nextKey.getLength()- nextKey.getPosition()

    )== 0;

    }else{

    nextKeyIsSame= false;

    }

    inputValueCounter.increment(1);

    returntrue;

    }


    接下来是调用reduce函数。此时会通过context.getValues函数把key相应的全部的value传给reduce.

    此处的context.getValues例如以下所看到的:

    ReduceContextImpl.getValues()

    public

    Iterable<VALUEIN>getValues() throwsIOException, InterruptedException {

    returniterable;

    }

    以上代码中直接返回的是iterable的实例,此实例在ReduceContextImpl实例生成时生成。

    privateValueIterable iterable = newValueIterable();

    这个类是ReduceContextImpl中的内部类

    protectedclass ValueIterable implementsIterable<VALUEIN> {

    privateValueIterator iterator= newValueIterator();

    @Override

    publicIterator<VALUEIN> iterator() {

    returniterator;

    }

    }

    此实例中引用一个ValueIterator类。这也是一个内部类。

    每次进行运行时。通过此ValueIterator.next来获取一条数据,

    publicVALUEIN next() {

    inReset的值默觉得false.也就是说inReset检查内部的代码不会运行,事实上backupStore本身值就是null

    假设想使用backupStore,须要运行其内部的make函数。

    if(inReset) {

    .................里面的代码不分析

    }

    假设是key以下的第一个value,firstValue设置为false,由于下一次来时,就不是firstValue.

    返回当前的value

    //if this is the first record, we don't need to advance

    if(firstValue){

    firstValue= false;

    returnvalue;

    }

    //if this isn't the first record and the next key is different, they

    //can't advance it here.

    if(!nextKeyIsSame){

    thrownewNoSuchElementException("iteratepast last value");

    }

    //otherwise, go to the next key/value pair

    try{

    这里表示不是第一个value的时候,也就是firstValue的值为false,运行一下nextKeyValue函数。

    得到当前的value.返回。

    nextKeyValue();

    returnvalue;

    } catch(IOException ie) {

    thrownewRuntimeException("next valueiterator failed", ie);

    } catch(InterruptedException ie) {

    //this is bad, but we can't modify the exception list of java.util

    thrownewRuntimeException("next valueiterator interrupted", ie);

    }

    }


    reduce运行完毕后的输出。跟map端无reduce时的输出一样。

    直接输出。


  • 相关阅读:
    section_4.python操作mysqlsql注入导入导出数据库
    section_3.单表多表查询
    section_2.约束索引存储引擎
    Section_1.Mysql基础
    day7.线程-线程队列进程池和线程池回调函数协程
    Mysql小技巧总汇
    常用对照表的参考_chapter-one(Content-Type)
    ORACLE 数据库配置
    Shiro入门(用户权限控制)
    Quartz定时任务调度机制解析(CronTirgger、SimpleTrigger )
  • 原文地址:https://www.cnblogs.com/gccbuaa/p/6806708.html
Copyright © 2020-2023  润新知