• Spark 源码解析 : DAGScheduler中的DAG划分与提交


    一、Spark 运行架构

    Spark 运行架构如下图:
    各个RDD之间存在着依赖关系,这些依赖关系形成有向无环图DAG,DAGScheduler对这些依赖关系形成的DAG,进行Stage划分,划分的规则很简单,从后往前回溯,遇到窄依赖加入本stage,遇见宽依赖进行Stage切分。完成了Stage的划分,DAGScheduler基于每个Stage生成TaskSet,并将TaskSet提交给TaskScheduler。TaskScheduler 负责具体的task调度,在Worker节点上启动task。




    二、源码解析:DAGScheduler中的DAG划分
        当RDD触发一个Action操作(如:colllect)后,导致SparkContext.runJob的执行。而在SparkContext的run方法中会调用DAGScheduler的run方法最终调用了DAGScheduler的submit方法:
    1. def submitJob[T, U](
    2. rdd: RDD[T],
    3. func: (TaskContext, Iterator[T]) => U,
    4. partitions: Seq[Int],
    5. callSite: CallSite,
    6. resultHandler: (Int, U) => Unit,
    7. properties: Properties): JobWaiter[U] = {
    8. // Check to make sure we are not launching a task on a partition that does not exist.
    9. val maxPartitions = rdd.partitions.length
    10. partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
    11. throw new IllegalArgumentException(
    12. "Attempting to access a non-existent partition: " + p + ". " +
    13. "Total number of partitions: " + maxPartitions)
    14. }
    15. val jobId = nextJobId.getAndIncrement()
    16. if (partitions.size == 0) {
    17. // Return immediately if the job is running 0 tasks
    18. return new JobWaiter[U](this, jobId, 0, resultHandler)
    19. }
    20. assert(partitions.size > 0)
    21. val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    22. val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    23. //给eventProcessLoop发送JobSubmitted消息
    24. eventProcessLoop.post(JobSubmitted(
    25. jobId, rdd, func2, partitions.toArray, callSite, waiter,
    26. SerializationUtils.clone(properties)))
    27. waiter
    28. }

    DAGScheduler的submit方法中,像eventProcessLoop对象发送了JobSubmitted消息。eventProcessLoop是DAGSchedulerEventProcessLoop类的对象

    1. private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

    DAGSchedulerEventProcessLoop,接收各种消息并进行处理,处理的逻辑在其doOnReceive方法中:

    1. private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
    2.    //Job提交
    1. case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
    2. dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
    3. case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
    4. dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)
    5. case StageCancelled(stageId) =>
    6. dagScheduler.handleStageCancellation(stageId)
    7. case JobCancelled(jobId) =>
    8. dagScheduler.handleJobCancellation(jobId)
    9. case JobGroupCancelled(groupId) =>
    10. dagScheduler.handleJobGroupCancelled(groupId)
    11. case AllJobsCancelled =>
    12. dagScheduler.doCancelAllJobs()
    13. case ExecutorAdded(execId, host) =>
    14. dagScheduler.handleExecutorAdded(execId, host)
    15. case ExecutorLost(execId) =>
    16. dagScheduler.handleExecutorLost(execId, fetchFailed = false)
    17. case BeginEvent(task, taskInfo) =>
    18. dagScheduler.handleBeginEvent(task, taskInfo)
    19. case GettingResultEvent(taskInfo) =>
    20. dagScheduler.handleGetTaskResult(taskInfo)
    21. case completion: CompletionEvent =>
    22. dagScheduler.handleTaskCompletion(completion)
    23. case TaskSetFailed(taskSet, reason, exception) =>
    24. dagScheduler.handleTaskSetFailed(taskSet, reason, exception)
    25. case ResubmitFailedStages =>
    26. dagScheduler.resubmitFailedStages()
    27. }

    可以把DAGSchedulerEventProcessLoop理解成DAGScheduler的对外的功能接口。它对外隐藏了自己内部实现的细节。无论是内部还是外部消息,DAGScheduler可以共用同一消息处理代码,逻辑清晰,处理方式统一。

    接下来分析DAGScheduler的Stage划分,handleJobSubmitted方法首先创建ResultStage

    1. try {
    2. //创建新stage可能出现异常,比如job运行依赖hdfs文文件被删除
    3. finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
    4. } catch {
    5. case e: Exception =>
    6. logWarning("Creating new stage failed due to exception - job: " + jobId, e)
    7. listener.jobFailed(e)
    8. return
    9. }

    然后调用submitStage方法,进行stage的划分。




    首先由finalRDD获取它的父RDD依赖,判断依赖类型,如果是窄依赖,则将父RDD压入栈中,如果是宽依赖,则作为父Stage。

    看一下源码的具体过程:

    1. private def getMissingParentStages(stage: Stage): List[Stage] = {
    2. val missing = new HashSet[Stage] //存储需要返回的父Stage
    3. val visited = new HashSet[RDD[_]] //存储访问过的RDD
    4. //自己建立栈,以免函数的递归调用导致
    5. val waitingForVisit = new Stack[RDD[_]]

    6. def visit(rdd: RDD[_]) {
    7. if (!visited(rdd)) {
    8. visited += rdd
    9. val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
    10. if (rddHasUncachedPartitions) {
    11. for (dep <- rdd.dependencies) {
    12. dep match {
    13. case shufDep: ShuffleDependency[_, _, _] =>
    14. val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
    15. if (!mapStage.isAvailable) {
    16. missing += mapStage //遇到宽依赖,加入父stage
    17. }
    18. case narrowDep: NarrowDependency[_] =>
    19. waitingForVisit.push(narrowDep.rdd) //窄依赖入栈,
    20. }
    21. }
    22. }
    23. }
    24. }

    25.    //回溯的起始RDD入栈
    26. waitingForVisit.push(stage.rdd)
    27. while (waitingForVisit.nonEmpty) {
    28. visit(waitingForVisit.pop())
    29. }
    30. missing.toList
    31. }

    getMissingParentStages方法是由当前stage,返回他的父stage,父stage的创建由getShuffleMapStage返回,最终会调用newOrUsedShuffleStage方法返回ShuffleMapStage

    1. private def newOrUsedShuffleStage(
    2. shuffleDep: ShuffleDependency[_, _, _],
    3. firstJobId: Int): ShuffleMapStage = {
    4. val rdd = shuffleDep.rdd
    5. val numTasks = rdd.partitions.length
    6. val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite)
    7. if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
    8. //Stage已经被计算过,从MapOutputTracker中获取计算结果
    9. val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
    10. val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
    11. (0 until locs.length).foreach { i =>
    12. if (locs(i) ne null) {
    13. // locs(i) will be null if missing
    14. stage.addOutputLoc(i, locs(i))
    15. }
    16. }
    17. } else {
    18. // Kind of ugly: need to register RDDs with the cache and map output tracker here
    19. // since we can't do it in the RDD constructor because # of partitions is unknown
    20. logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
    21. mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
    22. }
    23. stage
    24. }

    现在父Stage已经划分好,下面看看你Stage的提交逻辑

    1. /** Submits stage, but first recursively submits any missing parents. */
    2. private def submitStage(stage: Stage) {
    3. val jobId = activeJobForStage(stage)
    4. if (jobId.isDefined) {
    5. logDebug("submitStage(" + stage + ")")
    6. if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
    7. val missing = getMissingParentStages(stage).sortBy(_.id)
    8. logDebug("missing: " + missing)
    9. if (missing.isEmpty) {
    10. logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
    11. //如果没有父stage,则提交当前stage
    12. submitMissingTasks(stage, jobId.get)
    13. } else {
    14. for (parent <- missing) {
    15. //如果有父stage,则递归提交父stage
    16. submitStage(parent)
    17. }
    18. waitingStages += stage
    19. }
    20. }
    21. } else {
    22. abortStage(stage, "No active job for stage " + stage.id, None)
    23. }
    24. }

    提交的过程很简单,首先当前stage获取父stage,如果父stage为空,则当前Stage为起始stage,交给submitMissingTasks处理,如果当前stage不为空,则递归调用submitStage进行提交。

    到这里,DAGScheduler中的DAG划分与提交就讲完了,下次解析这些stage是如果封装成TaskSet交给TaskScheduler以及TaskSchedule的调度过程。

















  • 相关阅读:
    数据库使用动态监听导致EM起不来的解决方法
    OCP-1Z0-053-V12.02-115题
    OCP-1Z0-053-V12.02-150题
    OCP-1Z0-053-V12.02-136题
    OCP-1Z0-053-V12.02-154题
    OCP-1Z0-053-V12.02-149题
    OCP-1Z0-053-V12.02-146题
    OCP-1Z0-053-V12.02-160题
    OCP-1Z0-053-V12.02-157题
    OCP-1Z0-053-V12.02-164题
  • 原文地址:https://www.cnblogs.com/zhouyf/p/5687071.html
Copyright © 2020-2023  润新知