• Spark 任务划分&作业提交


    1、Stage任务划分

    DAG有向无环图:DAG(Directed Acyclic Graph)有向无环图是由点和线组成的拓扑图形,该图形具有方向,不会闭环。原始的RDD通过一系列的转换就形成了DAG,根据RDD之间的依赖关系的不同将DAG划分成不同的Stage,对于窄依赖,partition的转换处理在Stage中完成计算。对于宽依赖,由于有Shuffle的存在,只能在parent RDD处理完成后,才能开始接下来的计算,因此宽依赖是划分Stage的依据。例如,DAG记录了RDD的转换过程和任务的阶段。

    RDD任务切分中间分为:Application、Job、Stage和Task

    (1)Application:初始化一个SparkContext即生成一个Application;
    (2)Job:一个Action算子就会生成一个Job;
    (3)Stage:Stage等于宽依赖的个数加1;
    (4)Task:一个Stage阶段中,最后一个RDD的分区个数就是Task的个数。
    注意:Application->Job->Stage->Task每一层都是1对n的关系。

    代码实现

    /**
     * Job 调度
     * 概念:
     * 集群:Yarn || Standalone
     * 应用:一个spark程序,一般创建一个SparkContex,表示创建一个应用
     * 一个集群可以创建多个应用
     * Job Spark 应用可以并发运行多个Job,每次触发行动操作都会提交一个Job,
     * 一个Spark应用可以有多个Job
     * Stage 根据job 中宽依赖的数量划分,Stage 数量 = 宽依赖数量 + 1
     * Task:每个Stage 由多个 task 组成,每个stage 的最后一个RDD的分区的数量就是当前stage的 task 数量
     */
    object Spark02_Task {
      def main(args: Array[String]): Unit = {
        //创建 配置对象
        val conf: SparkConf = new SparkConf().setAppName("wordcount").setMaster("local[2]")
        //配置上下文对象
        var sc: SparkContext = new SparkContext(conf)
        val rdd: RDD[String] = sc.makeRDD(List("hello spark", "flink", "hello", "word hello"))
        //rdd 血缘关系
        println(rdd.toDebugString)
        println(rdd.dependencies)
        println("**************************************")
        val flatMap: RDD[String] = rdd.flatMap(_.split(" "))
        println(flatMap.toDebugString)
        println(flatMap.dependencies)
        println("**************************************")
        val map: RDD[(String, Int)] = flatMap.map((_, 1))
        println(map.toDebugString)
        println(map.dependencies)
        println("**************************************")
        val resRDD: RDD[(String, Int)] = map.reduceByKey(_ + _)
        /*println(resRDD.toDebugString)
        println(resRDD.dependencies)*/
        resRDD.collect().foreach(println)
        resRDD.saveAsTextFile("E:\\workspace_idea\\spark0520\\output")
        Thread.sleep(900000000 )
        //关闭
        sc.stop()
      }
    }
    
    运行查看   http://localhost:4040/

    job 数量

    Stage 数量

    查看Job0的Stage。由于只有1个Shuffle阶段,所以Stage个数为2

    查看Job1的Stage。由于只有1个Shuffle阶段,所以Stage个数为2。

    Task 数量:查看Job0的Stage0的Task个数

    查看Job0的Stage1的Task个数

    注意:如果存在shuffle过程,系统会自动进行缓存,UI界面显示skipped的部分

    2、作业提交源码

    1、以collect()为例来跟踪作业提交流程,按住 ctrl 点击 collect() 
    resRDD.collect().foreach(println)
    
    2、此时进入 RDD 抽象类的 collect() 方法,可以看到,这里调用了sc的runJob方法,点击 runJob
    /**
     * Return an array that contains all of the elements in this RDD.
     *
     * @note This method should only be used if the resulting array is expected to be small, as
     * all the data is loaded into the driver's memory.
     */
    def collect(): Array[T] = withScope {
      val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
      Array.concat(results: _*)
    }
    
    3、此时进入里了 SparkContex的runJob方法,只是这里有调用了一个 runJob 方法,再次点击runJob
    /**
     * Run a job on all partitions in an RDD and return the results in an array.
     *
     * @param rdd target RDD to run tasks on
     * @param func a function to run on each partition of the RDD
     * @return in-memory collection with a result of the job (each collection element will contain
     * a result from one partition)
     */
    def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
      runJob(rdd, func, 0 until rdd.partitions.length)
    }
    
    4、这里调用了SparkContext 的 runJob 方法。此时点击 这里的 runJob 方法
    /**
     * Run a function on a given set of partitions in an RDD and return the results as an array.
     *
     * @param rdd target RDD to run tasks on
     * @param func a function to run on each partition of the RDD
     * @param partitions set of partitions to run on; some jobs may not want to compute on all
     * partitions of the target RDD, e.g. for operations like `first()`
     * @return in-memory collection with a result of the job (each collection element will contain
     * a result from one partition)
     */
    def runJob[T, U: ClassTag](
        rdd: RDD[T],
        func: Iterator[T] => U,
        partitions: Seq[Int]): Array[U] = {
      val cleanedFunc = clean(func)
      runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
    }
    
    5、又是一个 runJob在,继续点击 
    /**
     * Run a function on a given set of partitions in an RDD and return the results as an array.
     * The function that is run against each partition additionally takes `TaskContext` argument.
     *
     * @param rdd target RDD to run tasks on
     * @param func a function to run on each partition of the RDD
     * @param partitions set of partitions to run on; some jobs may not want to compute on all
     * partitions of the target RDD, e.g. for operations like `first()`
     * @return in-memory collection with a result of the job (each collection element will contain
     * a result from one partition)
     */
    def runJob[T, U: ClassTag](
        rdd: RDD[T],
        func: (TaskContext, Iterator[T]) => U,
        partitions: Seq[Int]): Array[U] = {
      val results = new Array[U](partitions.size)
      runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
      results
    }
    
    6、发现还是调用了SparkContext内部的一个 runJob,不过此时可以大概看出一点端倪,最后是调用了dagScheduler 的一个 runJob 方法,下面看下 dagScheduler 内部的 runJob 长什么样

    /**
     * Run a function on a given set of partitions in an RDD and pass the results to the given
     * handler function. This is the main entry point for all actions in Spark.
     *
     * @param rdd target RDD to run tasks on
     * @param func a function to run on each partition of the RDD
     * @param partitions set of partitions to run on; some jobs may not want to compute on all
     * partitions of the target RDD, e.g. for operations like `first()`
     * @param resultHandler callback to pass each result to
     */
    def runJob[T, U: ClassTag](
        rdd: RDD[T],
        func: (TaskContext, Iterator[T]) => U,
        partitions: Seq[Int],
        resultHandler: (Int, U) => Unit): Unit = {
      if (stopped.get()) {
        throw new IllegalStateException("SparkContext has been shutdown")
      }
      val callSite = getCallSite
      val cleanedFunc = clean(func)
      logInfo("Starting job: " + callSite.shortForm)
      if (conf.getBoolean("spark.logLineage", false)) {
        logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
      }
      dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
      progressBar.foreach(_.finishAll())
      rdd.doCheckpoint()
    }
    
    7、下面是 DAGScheduler 调用的方法,此处重点可以关注下,作业提交 submitJob,val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    /**
     * Run an action job on the given RDD and pass all the results to the resultHandler function as
     * they arrive.
     *
     * @param rdd target RDD to run tasks on
     * @param func a function to run on each partition of the RDD
     * @param partitions set of partitions to run on; some jobs may not want to compute on all
     *   partitions of the target RDD, e.g. for operations like first()
     * @param callSite where in the user program this job was called
     * @param resultHandler callback to pass each result to
     * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
     *
     * @note Throws `Exception` when the job fails
     */
    def runJob[T, U](
        rdd: RDD[T],
        func: (TaskContext, Iterator[T]) => U,
        partitions: Seq[Int],
        callSite: CallSite,
        resultHandler: (Int, U) => Unit,
        properties: Properties): Unit = {
      val start = System.nanoTime
      val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
      ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
      waiter.completionFuture.value.get match {
        case scala.util.Success(_) =>
          logInfo("Job %d finished: %s, took %f s".format
            (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        case scala.util.Failure(exception) =>
          logInfo("Job %d failed: %s, took %f s".format
            (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
          // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
          val callerStackTrace = Thread.currentThread().getStackTrace.tail
          exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
          throw exception
      }
    }
    
    8、作业提交逻辑:关注下 eventProcessLoop.post(JobSubmitted

    def submitJob[T, U](
        rdd: RDD[T],
        func: (TaskContext, Iterator[T]) => U,
        partitions: Seq[Int],
        callSite: CallSite,
        resultHandler: (Int, U) => Unit,
        properties: Properties): JobWaiter[U] = {
      // Check to make sure we are not launching a task on a partition that does not exist.
      val maxPartitions = rdd.partitions.length
      partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
        throw new IllegalArgumentException(
          "Attempting to access a non-existent partition: " + p + ". " +
            "Total number of partitions: " + maxPartitions)
      }
    
      val jobId = nextJobId.getAndIncrement()
      if (partitions.isEmpty) {
        val clonedProperties = Utils.cloneProperties(properties)
        if (sc.getLocalProperty(SparkContext.SPARK_JOB_DESCRIPTION) == null) {
          clonedProperties.setProperty(SparkContext.SPARK_JOB_DESCRIPTION, callSite.shortForm)
        }
        val time = clock.getTimeMillis()
        listenerBus.post(
          SparkListenerJobStart(jobId, time, Seq.empty, clonedProperties))
        listenerBus.post(
          SparkListenerJobEnd(jobId, time, JobSucceeded))
        // Return immediately if the job is running 0 tasks
        return new JobWaiter[U](this, jobId, 0, resultHandler)
      }
    
      assert(partitions.nonEmpty)
      val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
      val waiter = new JobWaiter[U](this, jobId, partitions.size, resultHandler)
      eventProcessLoop.post(JobSubmitted(
        jobId, rdd, func2, partitions.toArray, callSite, waiter,
        Utils.cloneProperties(properties)))
      waiter
    }
    
    9、提交作业到队列 eventQueue.put(event),查看 eventQueue 逻辑

    /**
     * Put the event into the event queue. The event thread will process it later.
     */
    def post(event: E): Unit = {
      if (!stopped.get) {
        if (eventThread.isAlive) {
          eventQueue.put(event)
        } else {
          onError(new IllegalStateException(s"$name has already been stopped accidentally."))
        }
      }
    }
    
    9、底层调用了 doOnReceive(event) 方法,但是这里是一个抽象类,查看其实现类,ctrl+H  
    private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler)
      extends EventLoop[DAGSchedulerEvent]("dag-scheduler-event-loop") with Logging {
    
      private[this] val timer = dagScheduler.metricsSource.messageProcessingTimer
    
      /**
       * The main event loop of the DAG scheduler.
       */
      override def onReceive(event: DAGSchedulerEvent): Unit = {
        val timerContext = timer.time()
        try {
          doOnReceive(event)
        } finally {
          timerContext.stop()
        }
      }
    

    复制 doOnReceive ,去实现类里查看实现逻辑

     

    10、实现方法调用了 doOnReceive 

    /**
     * The main event loop of the DAG scheduler.
     */
    override def onReceive(event: DAGSchedulerEvent): Unit = {
      val timerContext = timer.time()
      try {
        doOnReceive(event)
      } finally {
        timerContext.stop()
      }
    }
    
    11、查看 doOnReceive 逻辑 
    private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
      case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
        dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
    
    12、handleJobSubmitted 在内部创建了 ActiveJob,至此完成落叶提交流程 
    private[scheduler] def handleJobSubmitted(jobId: Int,
        finalRDD: RDD[_],
        func: (TaskContext, Iterator[_]) => _,
        partitions: Array[Int],
        callSite: CallSite,
        listener: JobListener,
        properties: Properties): Unit = {
      var finalStage: ResultStage = null
      try {
        // New stage creation may throw an exception if, for example, jobs are run on a
        // HadoopRDD whose underlying HDFS files have been deleted.
        finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
      } catch {
        case e: BarrierJobSlotsNumberCheckFailed =>
          // If jobId doesn't exist in the map, Scala coverts its value null to 0: Int automatically.
          val numCheckFailures = barrierJobIdToNumTasksCheckFailures.compute(jobId,
            (_: Int, value: Int) => value + 1)
    
          logWarning(s"Barrier stage in job $jobId requires ${e.requiredConcurrentTasks} slots, " +
            s"but only ${e.maxConcurrentTasks} are available. " +
            s"Will retry up to ${maxFailureNumTasksCheck - numCheckFailures + 1} more times")
    
          if (numCheckFailures <= maxFailureNumTasksCheck) {
            messageScheduler.schedule(
              new Runnable {
                override def run(): Unit = eventProcessLoop.post(JobSubmitted(jobId, finalRDD, func,
                  partitions, callSite, listener, properties))
              },
              timeIntervalNumTasksCheck,
              TimeUnit.SECONDS
            )
            return
          } else {
            // Job failed, clear internal data.
            barrierJobIdToNumTasksCheckFailures.remove(jobId)
            listener.jobFailed(e)
            return
          }
    
        case e: Exception =>
          logWarning("Creating new stage failed due to exception - job: " + jobId, e)
          listener.jobFailed(e)
          return
      }
      // Job submitted, clear internal data.
      barrierJobIdToNumTasksCheckFailures.remove(jobId)
    
      val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
      clearCacheLocs()
      logInfo("Got job %s (%s) with %d output partitions".format(
        job.jobId, callSite.shortForm, partitions.length))
      logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
      logInfo("Parents of final stage: " + finalStage.parents)
      logInfo("Missing parents: " + getMissingParentStages(finalStage))
    
      val jobSubmissionTime = clock.getTimeMillis()
      jobIdToActiveJob(jobId) = job
      activeJobs += job
      finalStage.setActiveJob(job)
      val stageIds = jobIdToStageIds(jobId).toArray
      val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
      listenerBus.post(
        SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
      submitStage(finalStage)
    }
    

  • 相关阅读:
    v-distpicker 一个好用的三级联动的插件
    截取url参数
    position: relative 和 position: absoution 的详解
    自己封装一个下拉列表
    临界区访问的两个类:临界区类,共享临界区类——多平台版本
    临界区访问的两个类:临界区类,共享临界区类
    beyond compare 4.2.9桌面右键集成的问题修复
    递归创建、删除目录的几个函数
    读取资源中的GIF文件相应像素宽高度
    在关于对话框中,自动获取当前程序的版本,以显示
  • 原文地址:https://www.cnblogs.com/wdh01/p/16319945.html
Copyright © 2020-2023  润新知