• dagScheduler


    由一个action动作触发sparkcontext的runjob,再由此触发dagScheduler.runJob,然后触发submitJob,封装一个JobSubmitted放入一个队列。然后再通过doOnReceive里面的dagScheduler.handleJobSubmitted提交。

    1:由action动作触发工作的提交。

    2:sparkcontext提交job。

    3:调用DagScheduler提交job。

    4:调用DagScheduler的submitJob。

    5:生成一个JobSubmit对象,通过DAGSchedulerEventProcessLoop的post把JobSubmit加入到队列。

    6:DAGSchedulerEventProcessLoop执行doOnReceive,调用handleJobSubmitted。

    stage  

    通过handleJobSubmitted将会划分stage。

    首先看下stage的源码

    private[scheduler] abstract class Stage(
        val id: Int,
        val rdd: RDD[_],
        val numTasks: Int,
        val parents: List[Stage],
        val firstJobId: Int,
        val callSite: CallSite)

    stage有两个子类,分别是ResultStage和ShfflemapStage。

    通过对比源码发现

    ResultStage 多了一个 val func: (TaskContext, Iterator[_]) => _,   保存action对应的处理函数

    ShfflemapStage多了一个 val shuffleDep: ShuffleDependency[_, _, _])  保存Dependency信息

    stage的划分

    stage的划分

    stage的划分是Spark作业调度的关键一步,它基于DAG确定依赖关系,借此来划分stage,将依赖链断开,每个stage内部可以并行运行,整个作业按照stage顺序依次执行,最终完成整个Job。实际应用提交的Job中RDD依赖关系是十分复杂的,依据这些依赖关系来划分stage自然是十分困难的,Spark此时就利用了前文提到的依赖关系,调度器从DAG图末端出发,逆向遍历整个依赖关系链,遇到ShuffleDependency(宽依赖关系的一种叫法)就断开,遇到NarrowDependency就将其加入到当前stage。stage中task数目由stage末端的RDD分区个数来决定,RDD转换是基于分区的一种粗粒度计算,一个stage执行的结果就是这几个分区构成的RDD。

    回到刚才DagSchedler的handleJobSubmitted。因为rdd是倒序遍历的,所以首先生成一个名为finalStage的ResultStage。

     var finalStage: ResultStage = null
        try {
          // New stage creation may throw an exception if, for example, jobs are run on a
          // HadoopRDD whose underlying HDFS files have been deleted.
          finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
        } catch {
          case e: Exception =>
            logWarning("Creating new stage failed due to exception - job: " + jobId, e)
            listener.jobFailed(e)
            return
        }

     stage的划分的关键代码

    /**
       * Returns shuffle dependencies that are immediate parents of the given RDD.
       *
       * This function will not return more distant ancestors.  For example, if C has a shuffle
       * dependency on B which has a shuffle dependency on A:
       *
       * A <-- B <-- C
       *
       * calling this function with rdd C will only return the B <-- C dependency.
       *
       * This function is scheduler-visible for the purpose of unit testing.
       */
      private[scheduler] def getShuffleDependencies(
          rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
        val parents = new HashSet[ShuffleDependency[_, _, _]]
        val visited = new HashSet[RDD[_]]
        val waitingForVisit = new Stack[RDD[_]]
        waitingForVisit.push(rdd)
        while (waitingForVisit.nonEmpty) {
          val toVisit = waitingForVisit.pop()
          if (!visited(toVisit)) {
            visited += toVisit
            toVisit.dependencies.foreach {
              case shuffleDep: ShuffleDependency[_, _, _] =>
                parents += shuffleDep  //如果是宽依赖
              case dependency =>
                waitingForVisit.push(dependency.rdd)
            }
          }
        }
        parents
      }

    如果是宽依赖,直接把当前RDD加入parent并返回。这个parent即为每个stage的边界点。这里并没有得到每个stage的依赖。真正获取每个stage的依赖是在submitStage。

    对于任何的job都会产生出一个finalStage来产生和提交task。其次对于某些简单的job,它没有依赖关系,并且只有一个partition,这样的job会使用local thread处理而并非提交到TaskScheduler上处理。

    接下来产生finalStage后,需要调用submitStage(),它根据stage之间的依赖关系得出stage DAG,并以依赖关系进行处理:

    /** Submits stage, but first recursively submits any missing parents. */
      private def submitStage(stage: Stage) {
        val jobId = activeJobForStage(stage)
        if (jobId.isDefined) {
          logDebug("submitStage(" + stage + ")")
          if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
            val missing = getMissingParentStages(stage).sortBy(_.id)
            logDebug("missing: " + missing)
            if (missing.isEmpty) {
              logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
              submitMissingTasks(stage, jobId.get)
            } else {
              for (parent <- missing) {
                submitStage(parent)
              }
              waitingStages += stage
            }
          }
        } else {
          abortStage(stage, "No active job for stage " + stage.id, None)
        }
      }

    对于新提交的job,finalStage的parent stage还未获得,因此submitStage会调用getMissingParentStages()来获得依赖关系:

    private def getMissingParentStages(stage: Stage): List[Stage] = {
        val missing = new HashSet[Stage]
        val visited = new HashSet[RDD[_]]
        // We are manually maintaining a stack here to prevent StackOverflowError
        // caused by recursively visiting
        val waitingForVisit = new Stack[RDD[_]]
        def visit(rdd: RDD[_]) {
          if (!visited(rdd)) {
            visited += rdd
            val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
            if (rddHasUncachedPartitions) {
              for (dep <- rdd.dependencies) {
                dep match {
                  case shufDep: ShuffleDependency[_, _, _] =>
                    val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
                    if (!mapStage.isAvailable) {
                      missing += mapStage
                    }
                  case narrowDep: NarrowDependency[_] =>
                    waitingForVisit.push(narrowDep.rdd)
                }
              }
            }
          }
        }
        waitingForVisit.push(stage.rdd)
        while (waitingForVisit.nonEmpty) {
          visit(waitingForVisit.pop())
        }
        missing.toList
      }

    这里parent stage是通过RDD的依赖关系递归遍历获得。对于Wide Dependecy也就是Shuffle Dependecy,Spark会产生新的mapStage作为finalStage的parent,而对于Narrow Dependecy Spark则不会产生新的stage。这里对stage的划分是按照上面提到的作为划分依据的,因此对于本段开头提到的两种job,第一种job只会产生一个finalStage,而第二种job会产生finalStagemapStage

    当stage DAG产生以后,针对每个stage需要产生task去执行,故在这会调用submitMissingTasks()

     /** Called when stage's parents are available and we can now do its task. */
      private def submitMissingTasks(stage: Stage, jobId: Int) {
        logDebug("submitMissingTasks(" + stage + ")")
        // Get our pending tasks and remember them in our pendingTasks entry
        stage.pendingPartitions.clear()
    
        // First figure out the indexes of partition ids to compute.
        val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
    
        // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
        // with this Stage
        val properties = jobIdToActiveJob(jobId).properties
    
        runningStages += stage
        // SparkListenerStageSubmitted should be posted before testing whether tasks are
        // serializable. If tasks are not serializable, a SparkListenerStageCompleted event
        // will be posted, which should always come after a corresponding SparkListenerStageSubmitted
        // event.
        stage match {
          case s: ShuffleMapStage =>
            outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
          case s: ResultStage =>
            outputCommitCoordinator.stageStart(
              stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
        }
        val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
          stage match {
            case s: ShuffleMapStage =>
              partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
            case s: ResultStage =>
              partitionsToCompute.map { id =>
                val p = s.partitions(id)
                (id, getPreferredLocs(stage.rdd, p))
              }.toMap
          }
        } catch {
          case NonFatal(e) =>
            stage.makeNewStageAttempt(partitionsToCompute.size)
            listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
            abortStage(stage, s"Task creation failed: $e
    ${Utils.exceptionString(e)}", Some(e))
            runningStages -= stage
            return
        }
    
        stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
    
        // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
        // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
        // the serialized copy of the RDD and for each task we will deserialize it, which means each
        // task gets a different copy of the RDD. This provides stronger isolation between tasks that
        // might modify state of objects referenced in their closures. This is necessary in Hadoop
        // where the JobConf/Configuration object is not thread-safe.
        var taskBinary: Broadcast[Array[Byte]] = null
        try {
          // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
          // For ResultTask, serialize and broadcast (rdd, func).
          val taskBinaryBytes: Array[Byte] = stage match {
            case stage: ShuffleMapStage =>
              JavaUtils.bufferToArray(
                closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
            case stage: ResultStage =>
              JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
          }
    
          taskBinary = sc.broadcast(taskBinaryBytes)
        } catch {
          // In the case of a failure during serialization, abort the stage.
          case e: NotSerializableException =>
            abortStage(stage, "Task not serializable: " + e.toString, Some(e))
            runningStages -= stage
    
            // Abort execution
            return
          case NonFatal(e) =>
            abortStage(stage, s"Task serialization failed: $e
    ${Utils.exceptionString(e)}", Some(e))
            runningStages -= stage
            return
        }
    
        val tasks: Seq[Task[_]] = try {
          stage match {
            case stage: ShuffleMapStage =>
              partitionsToCompute.map { id =>
                val locs = taskIdToLocations(id)
                val part = stage.rdd.partitions(id)
                new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
                  taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),
                  Option(sc.applicationId), sc.applicationAttemptId)
              }
    
            case stage: ResultStage =>
              partitionsToCompute.map { id =>
                val p: Int = stage.partitions(id)
                val part = stage.rdd.partitions(p)
                val locs = taskIdToLocations(id)
                new ResultTask(stage.id, stage.latestInfo.attemptId,
                  taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics,
                  Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
              }
          }
        } catch {
          case NonFatal(e) =>
            abortStage(stage, s"Task creation failed: $e
    ${Utils.exceptionString(e)}", Some(e))
            runningStages -= stage
            return
        }
    
        if (tasks.size > 0) {
          logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
          stage.pendingPartitions ++= tasks.map(_.partitionId)
          logDebug("New pending partitions: " + stage.pendingPartitions)
          taskScheduler.submitTasks(new TaskSet(
            tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
          stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
        } else {
          // Because we posted SparkListenerStageSubmitted earlier, we should mark
          // the stage as completed here in case there are no tasks to run
          markStageAsFinished(stage, None)
    
          val debugString = stage match {
            case stage: ShuffleMapStage =>
              s"Stage ${stage} is actually done; " +
                s"(available: ${stage.isAvailable}," +
                s"available outputs: ${stage.numAvailableOutputs}," +
                s"partitions: ${stage.numPartitions})"
            case stage : ResultStage =>
              s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
          }
          logDebug(debugString)
    
          submitWaitingChildStages(stage)
        }
      }

    首先根据stage所依赖的RDD的partition的分布,会产生出与partition数量相等的task,这些task根据partition的locality进行分布;其次对于finalStage或是mapStage会产生不同的task;最后所有的task会封装到TaskSet内提交到TaskScheduler去执行。

    至此job在DAGScheduler内的启动过程全部完成,交由TaskScheduler执行task,当task执行完后会将结果返回给DAGSchedulerDAGScheduler调用handleTaskComplete()处理task返回:

    private def handleTaskCompletion(event: CompletionEvent) {
      val task = event.task
      val stage = idToStage(task.stageId)
      def markStageAsFinished(stage: Stage) = {
        val serviceTime = stage.submissionTime match {
          case Some(t) => "%.03f".format((System.currentTimeMillis() - t) / 1000.0)
          case _ => "Unkown"
        }
        logInfo("%s (%s) finished in %s s".format(stage, stage.origin, serviceTime))
        running -= stage
      }
      event.reason match {
        case Success =>
            ...
          task match {
            case rt: ResultTask[_, _] =>
              ...
            case smt: ShuffleMapTask =>
              ...
          }
        case Resubmitted =>
          ...
        case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
          ...
        case other =>
          abortStage(idToStage(task.stageId), task + " failed: " + other)
      }
    }

    每个执行完成的task都会将结果返回给DAGSchedulerDAGScheduler根据返回结果来进行进一步的动作。

  • 相关阅读:
    hive 之start hiveServer2 ,thriftServer失败
    sqoop 导入mysql中表存在联合主键
    hive metastore Server 出现异常
    hiveF 函数解析时间问题
    hive 动态分区数设置
    sqoop 操作从hdfs 导入到mysql中语句
    hive 锁表问题
    在hive中直接对timestamp类型取max报错
    Qt程序crash信息的捕捉与跟踪(转)
    功能快捷键如注释、声明和实现之间切换(转)
  • 原文地址:https://www.cnblogs.com/eryuan/p/7219136.html
Copyright © 2020-2023  润新知