• Spark应用程序-任务的划分


    任务的划分

    ​ DAGScheduler类的handleJobSubmitted方法中,有一个提交阶段的的方法:

    var finalStage: ResultStage = null
    	……
    finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
    	……
    submitStage(finalStage)
    

    ​ submitStage方法用于提交最终的ResultStage阶段,由于在最终的ResultStage可能包含了多个上级阶段,所以此处就相当于是提交整个应用程序的全部阶段。查看一下该方法的源码:

    private def submitStage(stage: Stage): Unit = {
      val jobId = activeJobForStage(stage)
      if (jobId.isDefined) {
        logDebug(s"submitStage($stage (name=${stage.name};" +
          s"jobs=${stage.jobIds.toSeq.sorted.mkString(",")}))")
        if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
          val missing = getMissingParentStages(stage).sortBy(_.id)
          logDebug("missing: " + missing)
          if (missing.isEmpty) {
            logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
            submitMissingTasks(stage, jobId.get)
          } else {
            for (parent <- missing) {
              submitStage(parent)
            }
            waitingStages += stage
          }
        }
      } else {
        abortStage(stage, "No active job for stage " + stage.id, None)
      }
    }
    

    ​ 该方法的内部核心逻辑是先获取当前阶段的的所有父级阶段,如果其父级阶段为空那么直接执行submitMissingTasks方法,如果不为空,那么递归执行submitStage方法,只不过传入的参数是当前阶段的父级阶段,一直递归直到找到没有上级阶段的阶段,最终没有上级阶段的那个阶段会执行submitMissingTasks方法。下面查看一下该方法的核心源码部分:

    private def submitMissingTasks(stage: Stage, jobId: Int): Unit = {
        	 ……
      val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
    		 ……
      val tasks: Seq[Task[_]] = try {
        val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
        stage match {
          case stage: ShuffleMapStage =>
            stage.pendingPartitions.clear()
            partitionsToCompute.map { id =>
              val locs = taskIdToLocations(id)
              val part = partitions(id)
              stage.pendingPartitions += id
              new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
                taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
                Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
            }
    
          case stage: ResultStage =>
            partitionsToCompute.map { id =>
              val p: Int = stage.partitions(id)
              val part = partitions(p)
              val locs = taskIdToLocations(id)
              new ResultTask(stage.id, stage.latestInfo.attemptNumber,
                taskBinary, part, locs, id, properties, serializedTaskMetrics,
                Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
                stage.rdd.isBarrier())
            }
        }
      } catch {
        case NonFatal(e) =>
          abortStage(stage, s"Task creation failed: $e
    ${Utils.exceptionString(e)}", Some(e))
          runningStages -= stage
          return
      }
    
     	……
    }
    

    ​ 核心代码的逻辑在于根据传入的stage进行模式匹配,会根据不同类型的Satge创建的不同的Task,那么首先会计算分区得到分区索引集合,然后使用map方法将根据分区id创建xxxMapTask对象,有几个分区id就创建几个xxxMapTask对象。partitionsToCompute是stage.findMissingPartitions()的返回值,那么查看其源码,stage是一个抽象类的引用,调用的这个方法具体的实现在具体的xxxMapStage类中。分别查看一下在resultstage和中的源码:

    ResultStage:

    override def findMissingPartitions(): Seq[Int] = {
      val job = activeJob.get
      (0 until job.numPartitions).filter(id => !job.finished(id))
    }
    

    ShuffleMapStage:

    override def findMissingPartitions(): Seq[Int] = {
      mapOutputTrackerMaster
        .findMissingPartitions(shuffleDep.shuffleId)
        .getOrElse(0 until numPartitions)
    }
    

    ​ 所以可以看出,partitionsToCompute就是一个分区索引的集合。ResultStage和ShuffleMapStage的numPartitions的值计算方式一样,都是来自于它们所处阶段的最后一个rdd的分区数量值:

    job.numPartitions值:

    val numPartitions = finalStage match {
      case r: ResultStage => r.partitions.length
      case m: ShuffleMapStage => m.rdd.partitions.length
    }
    

    numPartitions:

    val numPartitions = rdd.partitions.length
    

    所以总结一下,应用程序的总任务数量等于每个阶段的最后一个rdd的分区数量之和。

  • 相关阅读:
    文本文件、二进制文件
    trunc()
    字符集、编码
    windows注册表:扫盲
    decode() & sign()
    移动前端工作的那些事前端制作之自适应制作篇
    css hack知识详解
    IE6兼容性大全
    JS正则匹配入门基础!
    [转载]Javascript中批量定义CSS样式 cssText属性
  • 原文地址:https://www.cnblogs.com/yxym2016/p/14254225.html
Copyright © 2020-2023  润新知