Reduce Task的学习笔记

Reduce Task的学习笔记
转自：http://blog.csdn.net/androidlushangderen/article/details/41243505

MapReduce五大过程已经分析过半了，上次分析完Map的过程，着实花费了我的很多时间，不过收获很大，值得了额，这次用同样的方法分析完了Reduce的过程，也算是彻底摸透了MapReduce思想的2个最最重要的思想了吧。好，废话不多，切入正题，在学习Reduce过程分析的之前，我特意查了书籍上或网络上相关的资料，我发现很大都是大同小异，缺乏对于源码的参照分析，所以我个人认为，我了可以在某些细节上讲得跟明白些，也许会比较好。因为Map和Reduce的过程的整体流程是非常相近的，如果你看过之前我写的Map Task的分析，相信你也能很快理解我的Reduce过程的分析的。Reduce过程的集中表现体现于Reduce Task中，Reduce Task与Map Reduce一样，分为Job-setup Task, Job-cleanup Task, Task-cleanup Task和Reduce Task。我分析的主要是最后一个Reduce Task 。Reduce Task 主要分为5个阶段:

Shuffle------------------->Merge------------------->Sort------------------->Reduce------------------->Write

其中最重要的部分为前3部分，我也会花最多的时间描述前面3个阶段的任务。

Shuffle阶段。我们知道，Reduce的任务在最最开始的时候，就是接收Map任务中输出的中间结果的数据，key-value根据特定的分区算法，给相应的Reduce任务做处理，所以这时需要Reduce任务去远程拷贝Map输出的中间数据了，这个过程就称作Shuffle阶段，所以这个阶段也称为Copy阶段。在Shuffle阶段中，有个GetMapEventsThread，会定期发送RPC请求，获取远程执行好的Map Task的列表，把他们的输出location映射到mapLocation中。
[java] view plain copy print ?
1. ....
2. //GetMapEventsThread线程是远程调用获得已经完成的Map任务的列表
3. int numNewMaps = getMapCompletionEvents();
4. if (LOG.isDebugEnabled()) {
5. if (numNewMaps > 0) {
6. LOG.debug(reduceTask.getTaskID() + ": " +
7. "Got " + numNewMaps + " new map-outputs");
8. }
9. }
10. Thread.sleep(SLEEP_TIME);
11. }
进入getMapCompletionEvents方法，继续看:
[java] view plain copy print ?
1. ...
2. for (TaskCompletionEvent event : events) {
3. switch (event.getTaskStatus()) {
4. case SUCCEEDED:
5. {
6. URI u = URI.create(event.getTaskTrackerHttp());
7. String host = u.getHost();
8. TaskAttemptID taskId = event.getTaskAttemptId();
9. URL mapOutputLocation = new URL(event.getTaskTrackerHttp() +
10. "/mapOutput?job=" + taskId.getJobID() +
11. "&map=" + taskId +
12. "&reduce=" + getPartition());
13. List<MapOutputLocation> loc = mapLocations.get(host);
14. if (loc == null) {
15. loc = Collections.synchronizedList
16. (new LinkedList<MapOutputLocation>());
17. mapLocations.put(host, loc);
18. }
19. //loc中添加新的已经完成的，mapOutputLocation,mapLocations是全局共享的
20. loc.add(new MapOutputLocation(taskId, host, mapOutputLocation));
21. numNewMaps ++;
22. }
23. break;
24. ....
为了避免出现网络热点，Reduce Task对输出的位置进行了混洗的操作，然后保存到scheduleCopies中，后续的拷贝操作都是围绕着这个列表进行的。这个变量保存在了一个叫ReduceCopier的类里面。确认拷贝的目标位置，还只是Shuffle阶段的前半部分，这时看一下，执行的入口代码在哪里。回到Reduce Task的入口run()代码:
[java] view plain copy print ?
1. public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
2. throws IOException, InterruptedException, ClassNotFoundException {
3. this.umbilical = umbilical;
4. job.setBoolean("mapred.skip.on", isSkipping());
6. if (isMapOrReduce()) {
7. //设置不同阶段任务的进度
8. copyPhase = getProgress().addPhase("copy");
9. sortPhase = getProgress().addPhase("sort");
10. reducePhase = getProgress().addPhase("reduce");
11. }
12. // start thread that will handle communication with parent
13. //创建Task任务报告，与父进程进行联系沟通
14. TaskReporter reporter = new TaskReporter(getProgress(), umbilical,
15. jvmContext);
16. reporter.startCommunicationThread();
17. //判断是否使用的是新的额API
18. boolean useNewApi = job.getUseNewReducer();
19. initialize(job, getJobID(), reporter, useNewApi);
21. // check if it is a cleanupJobTask
22. //和map任务一样，Task有4种，Job-setup Task, Job-cleanup Task, Task-cleanup Task和ReduceTask
23. if (jobCleanup) {
24. //这里执行的是Job-cleanup Task
25. runJobCleanupTask(umbilical, reporter);
26. return;
27. }
28. if (jobSetup) {
29. //这里执行的是Job-setup Task
30. runJobSetupTask(umbilical, reporter);
31. return;
32. }
33. if (taskCleanup) {
34. //这里执行的是Task-cleanup Task
35. runTaskCleanupTask(umbilical, reporter);
36. return;
37. }
39. /* 后面的内容就是开始执行Reduce的Task */
41. // Initialize the codec
42. codec = initCodec();
44. boolean isLocal = "local".equals(job.get("mapred.job.tracker", "local"));
45. if (!isLocal) {
46. reduceCopier = new ReduceCopier(umbilical, job, reporter);
47. if (!reduceCopier.fetchOutputs()) {
48. ......
到了reduceCopier.fetchOutps()这里必须停一步了，因为后面的Shuffle阶段和Merge阶段都在这里实现:
[java] view plain copy print ?
1. /**
2. * 开启n个线程远程拷贝Map中的输出数据
3. * @return
4. * @throws IOException
5. */
6. public boolean fetchOutputs() throws IOException {
7. int totalFailures = 0;
8. int numInFlight = 0, numCopied = 0;
9. DecimalFormat mbpsFormat = new DecimalFormat("0.00");
10. final Progress copyPhase =
11. reduceTask.getProgress().phase();
12. //单独的线程用于对本地磁盘的文件进行定期的合并
13. LocalFSMerger localFSMergerThread = null;
14. //单独的线程用于对内存上的文件进行进行定期的合并
15. InMemFSMergeThread inMemFSMergeThread = null;
16. GetMapEventsThread getMapEventsThread = null;
18. for (int i = 0; i < numMaps; i++) {
19. copyPhase.addPhase(); // add sub-phase per file
20. }
22. //建立拷贝线程列表容器
23. copiers = new ArrayList<MapOutputCopier>(numCopiers);
25. // start all the copying threads
26. for (int i=0; i < numCopiers; i++) {
27. //新建拷贝线程，逐一开启拷贝线程
28. MapOutputCopier copier = new MapOutputCopier(conf, reporter,
29. reduceTask.getJobTokenSecret());
30. copiers.add(copier);
31. //添加到列表容器中，并开启此线程
32. copier.start();
33. }
35. //start the on-disk-merge thread
36. localFSMergerThread = new LocalFSMerger((LocalFileSystem)localFileSys);
37. //start the in memory merger thread
38. inMemFSMergeThread = new InMemFSMergeThread();
39. //定期合并的2个线程也开启，也就是说copy阶段和merge阶段是并行操作的
40. localFSMergerThread.start();
41. inMemFSMergeThread.start();
43. // start the map events thread
44. getMapEventsThread = new GetMapEventsThread();
45. getMapEventsThread.start();
46. .....
在上面的代码中出现很多陌生的Thread的定义，这个可以先不用管，我们发现getMapEventsThread就是在这里开启的，去获取了最新的位置，位置获取完成当然是要启动很多的拷贝线程了，这里叫做MapOutputCopier线程，作者是把他放入一个线程列表中，逐个开启。看看里面的具体实现，他是如何进行远程拷贝的呢。
[java] view plain copy print ?
1. @Override
2. public void run() {
3. while (true) {
4. try {
5. MapOutputLocation loc = null;
6. long size = -1;
8. synchronized (scheduledCopies) {
9. //从scheduledCopies列表中获取获取map Task的输出数据的位置
10. while (scheduledCopies.isEmpty()) {
11. //如果scheduledCopies我空，则等待
12. scheduledCopies.wait();
13. }
14. //获取列表中的第一个数据作为拷贝的地址
15. loc = scheduledCopies.remove(0);
16. }
18. CopyOutputErrorType error = CopyOutputErrorType.OTHER_ERROR;
19. readError = false;
20. try {
21. shuffleClientMetrics.threadBusy();
22. //标记当前的map输出位置为loc
23. start(loc);
24. //进行只要的copy操作，返回拷贝字节数的大小
25. size = copyOutput(loc);
26. shuffleClientMetrics.successFetch();
27. //如果进行到这里，说明拷贝成功吗，标记此error的标记为NO_ERROR
28. error = CopyOutputErrorType.NO_ERROR;
29. } catch (IOException e) {
30. //抛出异常，做异常处理
31. ....
从location列表中去取出，然后进行拷贝操作，核心方法在copyOutput()，接着往里跟踪:
[java] view plain copy print ?
1. .....
2. // Copy the map output
3. //根据loc Map任务的数据输出位置，进行RPC的拷贝
4. MapOutput mapOutput = getMapOutput(loc, tmpMapOutput,
5. reduceId.getTaskID().getId());
继续往里：
[java] view plain copy print ?
1. private MapOutput getMapOutput(MapOutputLocation mapOutputLoc,
2. Path filename, int reduce)
3. throws IOException, InterruptedException {
4. // Connect
5. //打开url资源定位符的连接
6. URL url = mapOutputLoc.getOutputLocation();
7. URLConnection connection = url.openConnection();
9. //得到远程数据的输入流
10. InputStream input = setupSecureConnection(mapOutputLoc, connection);
12. ......
13. //We will put a file in memory if it meets certain criteria:
14. //1. The size of the (decompressed) file should be less than 25% of
15. // the total inmem fs
16. //2. There is space available in the inmem fs
18. // Check if this map-output can be saved in-memory
19. //向ShuffleRamManager申请内存存放拷贝的数据，判断内存是否内存是否装得下，装不下则放入DISK磁盘
20. boolean shuffleInMemory = ramManager.canFitInMemory(decompressedLength);
22. // Shuffle
23. MapOutput mapOutput = null;
24. if (shuffleInMemory) {
25. if (LOG.isDebugEnabled()) {
26. LOG.debug("Shuffling " + decompressedLength + " bytes (" +
27. compressedLength + " raw bytes) " +
28. "into RAM from " + mapOutputLoc.getTaskAttemptId());
29. }
31. //如果内存装得下，则将输入流中的数据放入内存
32. mapOutput = shuffleInMemory(mapOutputLoc, connection, input,
33. (int)decompressedLength,
34. (int)compressedLength);
35. } else {
36. if (LOG.isDebugEnabled()) {
37. LOG.debug("Shuffling " + decompressedLength + " bytes (" +
38. compressedLength + " raw bytes) " +
39. "into Local-FS from " + mapOutputLoc.getTaskAttemptId());
40. }
42. //装不下，则放入文件中
43. mapOutput = shuffleToDisk(mapOutputLoc, input, filename,
44. compressedLength);
45. }
47. return mapOutput;
48. }
在这里我们看到了，Hadoop通过URL资源定位符，获取远程输入流，进行操作的，在拷贝到本地的时候，还分了2种情况处理，当当前的内存能方得下当前数据的时候，放入内存中，放不下则写入到磁盘中。这里还出现了ShuffleRamManager的用法。至此，Shuffle阶段宣告完成。还是比较深的，一层，又一层的。

Merger阶段。Merge阶段其实是和Shuffle阶段并行进行的，刚刚也看到了，在fetchOutputs中，这些相关进程都是同时开启的，
[java] view plain copy print ?
1. public boolean fetchOutputs() throws IOException {
2. int totalFailures = 0;
3. int numInFlight = 0, numCopied = 0;
4. DecimalFormat mbpsFormat = new DecimalFormat("0.00");
5. final Progress copyPhase =
6. reduceTask.getProgress().phase();
7. //单独的线程用于对本地磁盘的文件进行定期的合并
8. LocalFSMerger localFSMergerThread = null;
9. //单独的线程用于对内存上的文件进行进行定期的合并
10. InMemFSMergeThread inMemFSMergeThread = null;
11. ....
Merge的主要工作就是合并数据，当内存中或者磁盘中的文件比较多的时候，将小文件进行合并变成大文件。挑出其中的一个run方法
[java] view plain copy print ?
1. ....
2. public void run() {
3. LOG.info(reduceTask.getTaskID() + " Thread started: " + getName());
4. try {
5. boolean exit = false;
6. do {
7. exit = ramManager.waitForDataToMerge();
8. if (!exit) {
9. //进行内存merger操作
10. doInMemMerge();
目的非常明确，就是Merge操作，这是内存文件的合并线程的run方法，LocalFSMerger与此类似，不分析了。这个Mergr处理是并与Shuffle阶段的。在这里这2个阶段都完成了。还是有点复杂的。下面是相关的一些类关系图，主要要搞清4个线程是什么作用的。

4个线程的调用都是在ReduceCopier.fetchOutput()方法中进行的。在Shuffle，Merge阶段的后面就来到了，Sort阶段。

Sort阶段，的任务和轻松，就是完成一次对内存和磁盘总的一次Merge合并操作，其中还会对其中进行一次sort排序操作。
[java] view plain copy print ?
1. ....
2. //标识copy操作已经完成
3. copyPhase.complete(); // copy is already complete
4. setPhase(TaskStatus.Phase.SORT);
5. statusUpdate(umbilical);
7. //进行内存和磁盘中的总的merge阶段的操作，Sort包含其中执行
8. final FileSystem rfs = FileSystem.getLocal(job).getRaw();
9. RawKeyValueIterator rIter = isLocal
10. ? Merger.merge(job, rfs, job.getMapOutputKeyClass(),
11. job.getMapOutputValueClass(), codec, getMapFiles(rfs, true),
12. !conf.getKeepFailedTaskFiles(), job.getInt("io.sort.factor", 100),
13. new Path(getTaskID().toString()), job.getOutputKeyComparator(),
14. reporter, spilledRecordsCounter, null)
15. : reduceCopier.createKVIterator(job, rfs, reporter);
那么Sort操作在哪里呢，就在最下面的createKVIterator中:
[java] view plain copy print ?
1. private RawKeyValueIterator createKVIterator(
2. JobConf job, FileSystem fs, Reporter reporter) throws IOException {
4. .....
5. //在Merge阶段对所有的数据进行归并排序
6. Collections.sort(diskSegments, new Comparator<Segment<K,V>>() {
7. public int compare(Segment<K, V> o1, Segment<K, V> o2) {
8. if (o1.getLength() == o2.getLength()) {
9. return 0;
10. }
11. return o1.getLength() < o2.getLength() ? -1 : 1;
12. }
13. });
15. // build final list of segments from merged backed by disk + in-mem
16. List<Segment<K,V>> finalSegments = new ArrayList<Segment<K,V>>();
，Sort阶段的任务就是这么简单。下面看一下前3个阶段主要的执行流程，这3个阶段构成了Reduce Task的核心。

Reduce阶段，跟随这个图的执行方向，接下来我们应该执行的是key-value的reduce()函数了，没错就是循环键值对，执行此函数
[java] view plain copy print ?
1. ....
2. //判断执行的是新的API还是旧的API
3. if (useNewApi) {
4. runNewReducer(job, umbilical, reporter, rIter, comparator,
5. keyClass, valueClass);
6. } else {
7. runOldReducer(job, umbilical, reporter, rIter, comparator,
8. keyClass, valueClass);
9. }
在这里我们执行的就是runReducer方法了，我们往老的API跳:
[java] view plain copy print ?
1. private <INKEY,INVALUE,OUTKEY,OUTVALUE>
2. void runOldReducer(JobConf job,
3. TaskUmbilicalProtocol umbilical,
4. final TaskReporter reporter,
5. RawKeyValueIterator rIter,
6. RawComparator<INKEY> comparator,
7. Class<INKEY> keyClass,
8. Class<INVALUE> valueClass) throws IOException {
9. Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
10. ReflectionUtils.newInstance(job.getReducerClass(), job);
11. // make output collector
12. String finalName = getOutputName(getPartition());
14. //获取输出的key，value
15. final RecordWriter<OUTKEY, OUTVALUE> out = new OldTrackingRecordWriter<OUTKEY, OUTVALUE>(
16. reduceOutputCounter, job, reporter, finalName);
18. OutputCollector<OUTKEY,OUTVALUE> collector =
19. new OutputCollector<OUTKEY,OUTVALUE>() {
20. public void collect(OUTKEY key, OUTVALUE value)
21. throws IOException {
22. //将处理后的key,value写入输出流中，最后写入HDFS作为最终结果
23. out.write(key, value);
24. // indicate that progress update needs to be sent
25. reporter.progress();
26. }
27. };
29. // apply reduce function
30. try {
31. //increment processed counter only if skipping feature is enabled
32. boolean incrProcCount = SkipBadRecords.getReducerMaxSkipGroups(job)>0 &&
33. SkipBadRecords.getAutoIncrReducerProcCount(job);
35. //判断是否为跳过错误记录模式
36. ReduceValuesIterator<INKEY,INVALUE> values = isSkipping() ?
37. new SkippingReduceValuesIterator<INKEY,INVALUE>(rIter,
38. comparator, keyClass, valueClass,
39. job, reporter, umbilical) :
40. new ReduceValuesIterator<INKEY,INVALUE>(rIter,
41. job.getOutputValueGroupingComparator(), keyClass, valueClass,
42. job, reporter);
43. values.informReduceProgress();
44. while (values.more()) {
45. reduceInputKeyCounter.increment(1);
46. //Record迭代器中获取每一对，执行用户定义的Reduce函数，此阶段为Reduce阶段
47. reducer.reduce(values.getKey(), values, collector, reporter);
48. if(incrProcCount) {
49. reporter.incrCounter(SkipBadRecords.COUNTER_GROUP,
50. SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS, 1);
51. }
52. //获取下一个key，value
53. values.nextKey();
54. values.informReduceProgress();
55. }
56. //...
和Map Task的过程很类似，也正如我们预期的那样，循环迭代执行，这就是Reduce阶段。

Write阶段。Write阶段是最后一个阶段，在用户自定义的reduce中，一般用户都会调用collect.collect方法，这时候就是写入的操作了。这时的写入就是将最后的结果写入HDFS作为最终结果了。这里先定义了OutputCollector的collect方法:
[java] view plain copy print ?
1. OutputCollector<OUTKEY,OUTVALUE> collector =
2. new OutputCollector<OUTKEY,OUTVALUE>() {
3. public void collect(OUTKEY key, OUTVALUE value)
4. throws IOException {
5. //将处理后的key,value写入输出流中，最后写入HDFS作为最终结果
6. out.write(key, value);
7. // indicate that progress update needs to be sent
8. reporter.progress();
9. }
10. };
至此，完成了Reduce任务的所有阶段。下面是一张时序图，便于理解：

掌握了Map ，Reduce2个过程核心实现的过程将会帮助我们更加理解Hadoop作业运行的整个流程。整个分析的过程也许会有点枯燥，但是苦中作乐。
相关阅读:
DTM initialization: failure during startup recovery, retry failed, check segment status (cdbtm.c:1603)
gpexpand error:Do not have enough valid segments to start the array.
ubuntu使用postgist，pgrouting
ubuntu15.04安装hexo
linux修改shell为zsh
linux命令sysctl使用
 配置greenplum参数
 gcc支持c99验证
 Linux：sudo，没有找到有效的 sudoers 资源。
super
原文地址：https://www.cnblogs.com/cxzdy/p/5043630.html