Spark源码分析之四：Stage提交

Spark源码分析之四：Stage提交
各位看官，上一篇《Spark源码分析之Stage划分》详细讲述了Spark中Stage的划分，下面，我们进入第三个阶段--Stage提交。

Stage提交阶段的主要目的就一个，就是将每个Stage生成一组Task，即TaskSet，其处理流程如下图所示：

与Stage划分阶段一样，我们还是从handleJobSubmitted()方法入手，在Stage划分阶段，包括最好的ResultStage和前面的若干ShuffleMapStage均已生成，那么顺理成章的下一步便是Stage的提交。在handleJobSubmitted()方法的最后两行代码，便是Stage提交的处理。代码如下：
[java] view plain copy
1. // 提交最后一个stage
2. submitStage(finalStage)
4. // 提交其他正在等待的stage
5. submitWaitingStages()
从代码我们可以看出，Stage提交的逻辑顺序，是由后往前，即先提交最后一个finalStage，即ResultStage，然后再提交其parent stages，但是实际物理顺序是否如此呢？我们首先看下finalStage的提交，方法submitStage()代码如下：
[java] view plain copy
1. /** Submits stage, but first recursively submits any missing parents. */
2. // 提交stage，但是首先要递归的提交所有的missing父stage
3. private def submitStage(stage: Stage) {
5. // 根据stage获取jobId
6. val jobId = activeJobForStage(stage)
8. if (jobId.isDefined) {// 如果jobId已定义
10. // 记录Debug日志信息：submitStage(stage)
11. logDebug("submitStage(" + stage + ")")
13. // 如果在waitingStages、runningStages或
14. // failedStages任意一个中，不予处理
16. // 既不在waitingStages中，也不在runningStages中，还不在failedStages中
17. // 说明未处理过
18. if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
20. // 调用getMissingParentStages()方法，获取stage还没有提交的parent
21. val missing = getMissingParentStages(stage).sortBy(_.id)
23. logDebug("missing: " + missing)
24. if (missing.isEmpty) {
25. // 如果missing为空，说明是没有parent的stage或者其parent stages已提交，
26. // 则调用submitMissingTasks()方法，提交tasks
27. logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
28. submitMissingTasks(stage, jobId.get)
29. } else {
30. // 否则，说明其parent还没有提交，递归，循环missing，提交每个stage
31. for (parent <- missing) {
32. submitStage(parent)
33. }
35. // 将该stage加入到waitingStages中
36. waitingStages += stage
37. }
38. }
39. } else {
40. // 放弃该Stage
41. abortStage(stage, "No active job for stage " + stage.id, None)
42. }
43. }
代码逻辑比较简单。根据stage获取到jobId，如果jobId未定义，说明该stage不属于明确的Job，则调用abortStage()方法放弃该stage。如果jobId已定义的话，则需要判断该stage属于waitingStages、runningStages、failedStages中任意一个，则该stage忽略，不被处理。顾名思义，waitingStages为等待处理的stages，spark采取由后往前的顺序处理stage提交，即先处理child stage，然后再处理parent stage，所以位于waitingStages中的stage，由于其child stage尚未处理，所以必须等待，runningStages为正在运行的stages，正在运行意味着已经提交了，所以无需再提交，而最后的failedStages就是失败的stages，既然已经失败了，再提交也还是会失败，徒劳无益啊~

此时，如果stage不位于上述三个数据结构中，则可以继续执行提交流程。接下来该怎么做呢？

首先调用getMissingParentStages()方法，获取stage还没有提交的parent，即missing；如果missing为空，说明该stage要么没有parent stage，要么其parent stages都已被提交，此时该stage就可以被提交，用于提交的方法submitMissingTasks()我们稍后分析。

如果missing不为空，则说明该stage还存在尚未被提交的parent stages，那么，我们就需要遍历missing，循环提交每个stage，并将该stage添加到waitingStages中，等待其parent stages都被提交后再被提交。

我们先看下这个missing是如何获取的。进入getMissingParentStages()方法，代码如下：
[java] view plain copy
1. private def getMissingParentStages(stage: Stage): List[Stage] = {
3. // 存储尚未提交的parent stages，用于最后结果的返回
4. val missing = new HashSet[Stage]
6. // 已被处理的RDD集合
7. val visited = new HashSet[RDD[_]]
9. // We are manually maintaining a stack here to prevent StackOverflowError
10. // caused by recursively visiting
11. // 待处理RDD栈，后入先出
12. val waitingForVisit = new Stack[RDD[_]]
14. // 定义函数visit
15. def visit(rdd: RDD[_]) {
17. // 通过visited判断rdd是否已处理
18. if (!visited(rdd)) {
19. // 添加到visited，下次不会再处理
20. visited += rdd
22. val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
23. if (rddHasUncachedPartitions) {
24. // 循环rdd的dependencies
25. for (dep <- rdd.dependencies) {
26. dep match {
27. // 宽依赖
28. case shufDep: ShuffleDependency[_, _, _] =>
29. // 调用getShuffleMapStage，获取ShuffleMapStage
30. val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
31. if (!mapStage.isAvailable) {
32. missing += mapStage
33. }
34. // 窄依赖，直接将RDD压入waitingForVisit栈
35. case narrowDep: NarrowDependency[_] =>
36. waitingForVisit.push(narrowDep.rdd)
37. }
38. }
39. }
40. }
41. }
43. // 将stage的rdd压入到waitingForVisit顶部
44. waitingForVisit.push(stage.rdd)
45. // 循环处理waitingForVisit，对弹出的每个rdd调用函数visit
46. while (waitingForVisit.nonEmpty) {
47. visit(waitingForVisit.pop())
48. }
50. // 返回stage列表
51. missing.toList
52. }
有没有些似曾相识的感觉呢？对了，和《Spark源码分析之Stage划分》一文中getParentStages()方法、getAncestorShuffleDependencies()方法结构类似，也是定义了三个数据结构和一个visit()方法。三个数据结构分别是：

1、missing：HashSet[Stage]类型，存储尚未提交的parent stages，用于最后结果的返回；

2、visited：HashSet[RDD[_]]类型，已被处理的RDD集合，位于其中的RDD不会被重复处理；

3、waitingForVisit：Stack[RDD[_]]类型，等待被处理的RDD栈，后入先出。

visit()方法的处理逻辑也比较简单，大致如下：

通过RDD是否在visited中判断RDD是否已处理，若未被处理，添加到visited中，然后循环rdd的dependencies，如果是宽依赖ShuffleDependency，调用getShuffleMapStage()，获取ShuffleMapStage（此次调用则是直接取出已生成的stage，因为划分阶段已将stage全部生成，拿来主义即可），判断该stage的isAvailable标志位，若为false，则说明该stage未被提交过，加入到missing集合，如果是窄依赖NarrowDependency，直接将RDD压入waitingForVisit栈，等待后续处理，因为窄依赖的RDD同属于同一个stage，加入waitingForVisit只是为了后续继续沿着DAG图继续往上处理。

那么，整个missing的获取就一目了然，将final stage即ResultStage的RDD压入到waitingForVisit顶部，循环处理即可得到missing。

至此，各位可能有个疑问，这个ShuffleMapStage的isAvailable为什么能决定该stage是否已被提交呢？卖个关子，后续再分析。

submitStage()方法已分析完毕，go on，我们再回归到handleJobSubmitted()方法，在调用submitStage()方法提交finalStage之后，实际上只是将最原始的parent stage提交，其它child stage均存储在了waitingStages中，那么，接下来，我们就要调用submitWaitingStages()方法提交其中的stage。代码如下：
[java] view plain copy
1. /**
2. * Check for waiting or failed stages which are now eligible for resubmission.
3. * Ordinarily run on every iteration of the event loop.
4. */
5. private def submitWaitingStages() {
6. // TODO: We might want to run this less often, when we are sure that something has become
7. // runnable that wasn't before.
8. logTrace("Checking for newly runnable parent stages")
9. logTrace("running: " + runningStages)
10. logTrace("waiting: " + waitingStages)
11. logTrace("failed: " + failedStages)
13. // 将waitingStages转换为数组
14. val waitingStagesCopy = waitingStages.toArray
16. // 清空waitingStages
17. waitingStages.clear()
19. // 循环waitingStagesCopy，挨个调用submitStage()方法进行提交
20. for (stage <- waitingStagesCopy.sortBy(_.firstJobId)) {
21. submitStage(stage)
22. }
23. }
很简单，既然stages的顺序已经梳理正确，将waitingStages转换为数组waitingStagesCopy，针对每个stage挨个调用submitStage()方法进行提交即可。

还记得我卖的那个关子吗？ShuffleMapStage的isAvailable为什么能决定该stage是否已被提交呢？现在来解开这个谜团。首先，看下ShuffleMapStage的isAvailable是如何定义的，在ShuffleMapStage中，代码如下：
[java] view plain copy
1. /**
2. * Returns true if the map stage is ready, i.e. all partitions have shuffle outputs.
3. * This should be the same as `outputLocs.contains(Nil)`.
4. * 如果map stage已就绪的话返回true，即所有分区均有shuffle输出。这个将会和outputLocs.contains保持一致。
5. */
6. def isAvailable: Boolean = _numAvailableOutputs == numPartitions
它是通过判断_numAvailableOutputs和numPartitions是否相等来确定stage是否已被提交（或者说准备就绪可以提交is ready）的，而numPartitions很好理解，就是stage中的全部分区数目，那么_numAvailableOutputs是什么呢？
[java] view plain copy
1. private[this] var _numAvailableOutputs: Int = 0
3. /**
4. * Number of partitions that have shuffle outputs.
5. * When this reaches [[numPartitions]], this map stage is ready.
6. * This should be kept consistent as `outputLocs.filter(!_.isEmpty).size`.
7. *
8. * 拥有shuffle的分区数量。
9. * 当这个numAvailableOutputs达到numPartitions时，这个map stage也就准备好了。
10. * 这个应与outputLocs.filter(!_.isEmpty).size保持一致
11. */
12. def numAvailableOutputs: Int = _numAvailableOutputs
可以看出，_numAvailableOutputs就是拥有shuffle outputs的分区数量，当这个numAvailableOutputs达到numPartitions时，这个map stage也就准备好了。

那么这个_numAvailableOutputs开始时默认为0，它是在何时被赋值的呢？通篇看完ShuffleMapStage的源码，只有两个方法对_numAvailableOutputs的值做修改，代码如下：
[java] view plain copy
1. def addOutputLoc(partition: Int, status: MapStatus): Unit = {
2. val prevList = outputLocs(partition)
3. outputLocs(partition) = status :: prevList
4. if (prevList == Nil) {
5. _numAvailableOutputs += 1
6. }
7. }
9. def removeOutputLoc(partition: Int, bmAddress: BlockManagerId): Unit = {
10. val prevList = outputLocs(partition)
11. val newList = prevList.filterNot(_.location == bmAddress)
12. outputLocs(partition) = newList
13. if (prevList != Nil && newList == Nil) {
14. _numAvailableOutputs -= 1
15. }
16. }
什么时候调用的这个addOutputLoc()方法呢？答案就在DAGScheduler的newOrUsedShuffleStage()方法中。方法主要逻辑如下：
[java] view plain copy
1. if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
2. // 如果mapOutputTracker中存在
4. // 根据shuffleId从mapOutputTracker中获取序列化的多个MapOutputStatus对象
5. val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
7. // 反序列化
8. val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
10. // 循环
11. (0 until locs.length).foreach { i =>
12. if (locs(i) ne null) {
13. // locs(i) will be null if missing
14. // 将
15. stage.addOutputLoc(i, locs(i))
16. }
17. }
18. } else {
19. // 如果mapOutputTracker中不存在，注册一个
21. // Kind of ugly: need to register RDDs with the cache and map output tracker here
22. // since we can't do it in the RDD constructor because # of partitions is unknown
23. logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
24. // 注册的内容为
25. // 1、根据shuffleDep获取的shuffleId；
26. // 2、rdd中分区的个数
27. mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
28. }
这个方法在stage划分过程中，第一轮被调用，此时mapOutputTracker中并没有注册shuffle相关信息，所以走的是else分支，调用mapOutputTracker的registerShuffle()方法注册shuffle，而在stage提交过程中，第二轮被调用，此时shuffle已在mapOutputTracker中注册，则会根据shuffleId从mapOutputTracker中获取序列化的多个MapOutputStatus对象，反序列化并循环调用stage的addOutputLoc()方法，更新stage的outputLocs，并累加_numAvailableOutputs，至此，关子卖完，再有疑问，后续再慢慢分析吧。

到了这里，就不得不分析下真正提交stage的方法submitMissingTasks()了。莫慌，慢慢看，代码如下：
[java] view plain copy
1. /** Called when stage's parents are available and we can now do its task. */
2. private def submitMissingTasks(stage: Stage, jobId: Int) {
3. logDebug("submitMissingTasks(" + stage + ")")
5. // Get our pending tasks and remember them in our pendingTasks entry
6. // 清空stage的pendingPartitions
7. stage.pendingPartitions.clear()
9. // First figure out the indexes of partition ids to compute.
10. // 首先确定该stage需要计算的分区ID索引
11. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
13. // Create internal accumulators if the stage has no accumulators initialized.
14. // Reset internal accumulators only if this stage is not partially submitted
15. // Otherwise, we may override existing accumulator values from some tasks
16. if (stage.internalAccumulators.isEmpty || stage.numPartitions == partitionsToCompute.size) {
17. stage.resetInternalAccumulators()
18. }
20. // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
21. // with this Stage
22. val properties = jobIdToActiveJob(jobId).properties
24. // 将stage加入到runningStages中
25. runningStages += stage
27. // SparkListenerStageSubmitted should be posted before testing whether tasks are
28. // serializable. If tasks are not serializable, a SparkListenerStageCompleted event
29. // will be posted, which should always come after a corresponding SparkListenerStageSubmitted
30. // event.
31. // 开启一个stage时，需要调用outputCommitCoordinator的stageStart()方法，
32. stage match {
33. // 如果为ShuffleMapStage
34. case s: ShuffleMapStage =>
35. outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
36. // 如果为ResultStage
37. case s: ResultStage =>
38. outputCommitCoordinator.stageStart(
39. stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
40. }
42. // 创建一个Map：taskIdToLocations，存储的是id->Seq[TaskLocation]的映射关系
43. // 对stage中指定RDD的每个分区获取位置信息，映射成id->Seq[TaskLocation]的关系
44. val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
45. stage match {
46. // 如果是ShuffleMapStage
47. case s: ShuffleMapStage =>
48. partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
49. // 如果是ResultStage
50. case s: ResultStage =>
51. val job = s.activeJob.get
52. partitionsToCompute.map { id =>
53. val p = s.partitions(id)
54. (id, getPreferredLocs(stage.rdd, p))
55. }.toMap
56. }
57. } catch {
58. case NonFatal(e) =>
59. stage.makeNewStageAttempt(partitionsToCompute.size)
60. listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
61. abortStage(stage, s"Task creation failed: $e ${e.getStackTraceString}", Some(e))
62. runningStages -= stage
63. return
64. }
66. // 标记新的stage attempt
67. stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
68. // 发送一个SparkListenerStageSubmitted事件
69. listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
71. // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
72. // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
73. // the serialized copy of the RDD and for each task we will deserialize it, which means each
74. // task gets a different copy of the RDD. This provides stronger isolation between tasks that
75. // might modify state of objects referenced in their closures. This is necessary in Hadoop
76. // where the JobConf/Configuration object is not thread-safe.
77. // 对stage进行序列化，如果是ShuffleMapStage，序列化rdd和shuffleDep，如果是ResultStage，序列化rdd和func
78. var taskBinary: Broadcast[Array[Byte]] = null
79. try {
80. // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
81. // 对于ShuffleMapTask，序列化并广播，广播的是rdd和shuffleDep
82. // For ResultTask, serialize and broadcast (rdd, func).
83. // 对于ResultTask，序列化并广播，广播的是rdd和func
84. val taskBinaryBytes: Array[Byte] = stage match {
85. case stage: ShuffleMapStage =>
86. // 序列化ShuffleMapStage
87. closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
88. case stage: ResultStage =>
89. // 序列化ResultStage
90. closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array()
91. }
93. // 通过sc广播序列化的task
94. taskBinary = sc.broadcast(taskBinaryBytes)
96. } catch {
97. // In the case of a failure during serialization, abort the stage.
98. case e: NotSerializableException =>
99. abortStage(stage, "Task not serializable: " + e.toString, Some(e))
100. runningStages -= stage
102. // Abort execution
103. return
104. case NonFatal(e) =>
105. abortStage(stage, s"Task serialization failed: $e ${e.getStackTraceString}", Some(e))
106. runningStages -= stage
107. return
108. }
110. // 针对stage的每个分区构造task，形成tasks:ShuffleMapStage生成ShuffleMapTasks，ResultStage生成ResultTasks
111. val tasks: Seq[Task[_]] = try {
112. stage match {
113. // 如果是ShuffleMapStage
114. case stage: ShuffleMapStage =>
115. partitionsToCompute.map { id =>
116. // 位置信息
117. val locs = taskIdToLocations(id)
118. val part = stage.rdd.partitions(id)
119. // 创建ShuffleMapTask，其中包括位置信息
120. new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
121. taskBinary, part, locs, stage.internalAccumulators)
122. }
123. // 如果是ResultStage
124. case stage: ResultStage =>
125. val job = stage.activeJob.get
126. partitionsToCompute.map { id =>
127. val p: Int = stage.partitions(id)
128. val part = stage.rdd.partitions(p)
129. val locs = taskIdToLocations(id)
130. // 创建ResultTask
131. new ResultTask(stage.id, stage.latestInfo.attemptId,
132. taskBinary, part, locs, id, stage.internalAccumulators)
133. }
134. }
135. } catch {
136. case NonFatal(e) =>
137. abortStage(stage, s"Task creation failed: $e ${e.getStackTraceString}", Some(e))
138. runningStages -= stage
139. return
140. }
142. // 如果存在tasks，则利用taskScheduler.submitTasks()提交task，否则标记stage已完成
143. if (tasks.size > 0) {
144. logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
146. // 赋值pendingPartitions
147. stage.pendingPartitions ++= tasks.map(_.partitionId)
149. logDebug("New pending partitions: " + stage.pendingPartitions)
151. // 利用taskScheduler.submitTasks()提交task
152. taskScheduler.submitTasks(new TaskSet(
153. tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
154. // 记录提交时间
155. stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
156. } else {
157. // Because we posted SparkListenerStageSubmitted earlier, we should mark
158. // the stage as completed here in case there are no tasks to run
159. // 标记stage已完成
160. markStageAsFinished(stage, None)
162. val debugString = stage match {
163. case stage: ShuffleMapStage =>
164. s"Stage ${stage} is actually done; " +
165. s"(available: ${stage.isAvailable}," +
166. s"available outputs: ${stage.numAvailableOutputs}," +
167. s"partitions: ${stage.numPartitions})"
168. case stage : ResultStage =>
169. s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
170. }
171. logDebug(debugString)
172. }
173. }
submitMissingTasks()方法，最主要的就是针对每个stage生成一组Tasks，即TaskSet，并调用TaskScheduler的submitTasks()方法提交tasks。它主要做了以下几件事情：

1、清空stage的pendingPartitions；

2、首先确定该stage需要计算的分区ID索引，保存至partitionsToCompute；

3、将stage加入到runningStages中，标记stage正在运行，与上面的阐述对应；

4、开启一个stage时，需要调用outputCommitCoordinator的stageStart()方法；

5、创建一个Map：taskIdToLocations，存储的是id->Seq[TaskLocation]的映射关系，并对stage中指定RDD的每个分区获取位置信息，映射成id->Seq[TaskLocation]的关系；

6、标记新的stage attempt，并发送一个SparkListenerStageSubmitted事件；

7、对stage进行序列化并广播，如果是ShuffleMapStage，序列化rdd和shuffleDep，如果是ResultStage，序列化rdd和func；

8、最重要的，针对stage的每个分区构造task，形成tasks:ShuffleMapStage生成ShuffleMapTasks，ResultStage生成ResultTasks；

9、如果存在tasks，则利用taskScheduler.submitTasks()提交task，否则标记stage已完成。

至此，stage提交的主体流程已全部分析完毕，后续的Task调度与执行留待以后分析，而stage提交部分细节或者遗漏之处，特别是task生成时的部分细节，也留待以后再细细琢磨吧~

晚安！

博客原地址：http://blog.csdn.net/lipeng_bigdata/article/details/50679842
相关阅读:
通过JavaScript垃圾回收机制来理解WeakSet/WeakMap中对象的弱引用
 json处理
 dotenv 加载本地环境变量
 各种ast库
 类型检测库
 npm 加解密库
 用计算机语言的爱情表白
 情侣在招聘会上搂抱招致企业反感
 《软件性能测试与LoadRunner实战》网上订购问题
 F1赛车的起源
原文地址：https://www.cnblogs.com/jirimutu01/p/5274457.html