• Spark-RDD-弹性解析


    1.对数据存储自动进行内存和磁盘的切换

    Spark优先将数据放入内存中,如果内存不够,放到磁盘里面,如果实际数据大于内存,需要考虑数据放置的策略和优化的算法

    2.基于Lineage的容错机制

    Lineage基于Spark RDD的依赖关系,对于每个操作只需要关联父操作,每个分片之间互不影响,出现错误只需要回复单个分片的特定部分

    3.对失败Task进行重试

    默认重试次数为4次。TaskSchedulerImpl源码如下:

    /**
     * Schedules tasks for multiple types of clusters by acting through a SchedulerBackend.
     * It can also work with a local setup by using a `LocalSchedulerBackend` and setting
     * isLocal to true. It handles common logic, like determining a scheduling order across jobs, waking
     * up to launch speculative tasks, etc.
     *
     * Clients should first call initialize() and start(), then submit task sets through the
     * runTasks method.
     *
     * THREADING: [[SchedulerBackend]]s and task-submitting clients can call this class from multiple
     * threads, so it needs locks in public API methods to maintain its state. In addition, some
     * [[SchedulerBackend]]s synchronize on themselves when they want to send events here, and then
     * acquire a lock on us, so we need to make sure that we don't try to lock the backend while
     * we are holding a lock on ourselves.
     */
    private[spark] class TaskSchedulerImpl private[scheduler](//限制在[scheduler]包内访问
        val sc: SparkContext,
        val maxTaskFailures: Int,
        private[scheduler] val blacklistTrackerOpt: Option[BlacklistTracker],//黑名单列表跟踪变量,跟踪问题executors和nodes节点
        isLocal: Boolean = false)
      extends TaskScheduler with Logging {
    
      import TaskSchedulerImpl._
    
      def this(sc: SparkContext) = {
        this(
          sc,
          sc.conf.get(config.MAX_TASK_FAILURES),
          TaskSchedulerImpl.maybeCreateBlacklistTracker(sc))
      }
    
      def this(sc: SparkContext, maxTaskFailures: Int, isLocal: Boolean) = {
        this(
          sc,
          maxTaskFailures,
          TaskSchedulerImpl.maybeCreateBlacklistTracker(sc),
          isLocal = isLocal)
      }
    

    4.对失败Stage进行重试

    默认重试次数为4次,并且可以直接运行失败的Stage,只计算失败的数据分片,Stage.scala源码如下:

    /**
     * A stage is a set of parallel tasks all computing the same function that need to run as part
     * of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run
     * by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the
     * DAGScheduler runs these stages in topological order.
     *
     * Each Stage can either be a shuffle map stage, in which case its tasks' results are input for
     * other stage(s), or a result stage, in which case its tasks directly compute a Spark action
     * (e.g. count(), save(), etc) by running a function on an RDD. For shuffle map stages, we also
     * track the nodes that each output partition is on.
     *
     * Each Stage also has a firstJobId, identifying the job that first submitted the stage.  When FIFO
     * scheduling is used, this allows Stages from earlier jobs to be computed first or recovered
     * faster on failure.
     *
     * Finally, a single stage can be re-executed in multiple attempts due to fault recovery. In that
     * case, the Stage object will track multiple StageInfo objects to pass to listeners or the web UI.
     * The latest one will be accessible through latestInfo.
     *
     * @param id Unique stage ID
     * @param rdd RDD that this stage runs on: for a shuffle map stage, it's the RDD we run map tasks
     *   on, while for a result stage, it's the target RDD that we ran an action on
     * @param numTasks Total number of tasks in stage; result stages in particular may not need to
     *   compute all partitions, e.g. for first(), lookup(), and take().
     * @param parents List of stages that this stage depends on (through shuffle dependencies).
     * @param firstJobId ID of the first job this stage was part of, for FIFO scheduling.
     * @param callSite Location in the user program associated with this stage: either where the target
     *   RDD was created, for a shuffle map stage, or where the action for a result stage was called.
     */
    private[scheduler] abstract class Stage(
        val id: Int,
        val rdd: RDD[_],
        val numTasks: Int,
        val parents: List[Stage],
        val firstJobId: Int,
        val callSite: CallSite)
      extends Logging {
    
      //partition数量
      val numPartitions = rdd.partitions.length
    
      /** Set of jobs that this stage belongs to.
       * 属于这个工作集的stage
       * */
      val jobIds = new HashSet[Int]
    
      /** The ID to use for the next new attempt for this stage.
       * 用于该Stage的下一个新的attempt的id
       * */
      private var nextAttemptId: Int = 0
    
      val name: String = callSite.shortForm
      val details: String = callSite.longForm
    
      /**
       * Pointer to the [[StageInfo]] object for the most recent attempt. This needs to be initialized
       * here, before any attempts have actually been created, because the DAGScheduler uses this
       * StageInfo to tell SparkListeners when a job starts (which happens before any stage attempts
       * have been created).
       * 指向StageInfo对象的最近一次尝试的指针。需要在这里初始化,在任何已经创建的尝试之前,因为DAGScheduler使用此StageInfo
       * 告诉SparkListeners什么时候开始工作(发生在任何尝试恰好已经创建之前)
       */
      private var _latestInfo: StageInfo = StageInfo.fromStage(this, nextAttemptId)
    
      /**
       * Set of stage attempt IDs that have failed with a FetchFailure. We keep track of these
       * failures in order to avoid endless retries if a stage keeps failing with a FetchFailure.
       * We keep track of each attempt ID that has failed to avoid recording duplicate failures if
       * multiple tasks from the same stage attempt fail (SPARK-5945).
       * 设置stage attempt IDs当失败时候读取失败信息,我们可以跟踪这些失败以避免无休止的重复失败
       * 我们可以对每个attempt ID进行跟踪,以便避免记录重复失败如果从同一个stage创建多个任务失败的话
       */
      val fetchFailedAttemptIds = new HashSet[Int]
    
      private[scheduler] def clearFailures() : Unit = {
        fetchFailedAttemptIds.clear()
      }
    
      /** Creates a new attempt for this stage by creating a new StageInfo with a new attempt ID.
       * 创建一个新的attemp在这个stage,通过创建一个新的StageInfo并且有一个新的attempt ID
       * */
      def makeNewStageAttempt(
          numPartitionsToCompute: Int,
          taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit = {
        val metrics = new TaskMetrics
        metrics.register(rdd.sparkContext)
        _latestInfo = StageInfo.fromStage(
          this, nextAttemptId, Some(numPartitionsToCompute), metrics, taskLocalityPreferences)
        nextAttemptId += 1
      }
    
      /** Returns the StageInfo for the most recent attempt for this stage.
       * 返回当前stage中最新的StageInfo
       * */
      def latestInfo: StageInfo = _latestInfo
    
      override final def hashCode(): Int = id
    
      override final def equals(other: Any): Boolean = other match {
        case stage: Stage => stage != null && stage.id == id
        case _ => false
      }
    
      /** Returns the sequence of partition ids that are missing (i.e. needs to be computed).
       * 返回需要重新计算的分区标识的序列
       * */
      def findMissingPartitions(): Seq[Int]
    }
    

    重试次数在DAGScheduler.scala中

    private[spark] object DAGScheduler {
      // The time, in millis, to wait for fetch failure events to stop coming in after one is detected;
      // this is a simplistic way to avoid resubmitting tasks in the non-fetchable map stage one by one
      // as more failure events come in
      val RESUBMIT_TIMEOUT = 200
    
      // Number of consecutive stage attempts allowed before a stage is aborted
      val DEFAULT_MAX_CONSECUTIVE_STAGE_ATTEMPTS = 4
    }
    

    5.检查点和持久化可以主动或被动触发

    checkpoint是对RDD进行标记,产生一系列文件,是整个lineage的终点

    cache内部调用persist,相当于persist(StorageLevel.MEMORY_ONLY),persist可以穿入存储级别

    如果RDD会被重复使用,可以将其cache,存储级别如下:

      val NONE = new StorageLevel(false, false, false, false)
      val DISK_ONLY = new StorageLevel(true, false, false, false)
      val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
      val MEMORY_ONLY = new StorageLevel(false, true, false, true)
      val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
      val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
      val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
      val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
      val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
      val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
      val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
      val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
    
    

    如何选择存储级别?

    1. 如果RDD适合默认存储级别,选择默认存储级别(MEMORY_ONLY)
    2. 如果不适合默认存储级别,选择更快的序列化库(MEMORY_ONLY_SER)
    3. 如果有大量数据才选择磁盘级别的存储
    4. 如果希望更快恢复错误,可以利用replicated存储机制
    5. OFF_HEAP优势在于共享内存池,减少GC

    6.数据调度弹性

    Spark将执行模型抽象成DAG,可以将多Stage的任务串联或者并行执行,不需要将Stage中间结果输出到HDFS,如果有节点故障可以用其他节点代替运行

    7.数据分片弹性

    通过coalesce算子可以进行重分区,源码如下:

    /**
       * Return a new RDD that is reduced into `numPartitions` partitions.
       *
       * This results in a narrow dependency, e.g. if you go from 1000 partitions
       * to 100 partitions, there will not be a shuffle, instead each of the 100
       * new partitions will claim 10 of the current partitions. If a larger number
       * of partitions is requested, it will stay at the current number of partitions.
       *
       * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
       * this may result in your computation taking place on fewer nodes than
       * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
       * you can pass shuffle = true. This will add a shuffle step, but means the
       * current upstream partitions will be executed in parallel (per whatever
       * the current partitioning is).
       *
       * @note With shuffle = true, you can actually coalesce to a larger number
       * of partitions. This is useful if you have a small number of partitions,
       * say 100, potentially with a few partitions being abnormally large. Calling
       * coalesce(1000, shuffle = true) will result in 1000 partitions with the
       * data distributed using a hash partitioner. The optional partition coalescer
       * passed in must be serializable.
       */
      def coalesce(numPartitions: Int, shuffle: Boolean = false,
                   partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
                  (implicit ord: Ordering[T] = null)
          : RDD[T] = withScope {
        require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
        if (shuffle) {
          /** Distributes elements evenly across output partitions, starting from a random partition.
           * 从一个随机的分区开始,将元素均匀分布在输出分区上
           * */
          val distributePartition = (index: Int, items: Iterator[T]) => {
            var position = (new Random(index)).nextInt(numPartitions)
            items.map { t =>
              // Note that the hash code of the key will just be the key itself. The HashPartitioner
              // will mod it with the number of total partitions.
              position = position + 1
              (position, t)
            }
          } : Iterator[(Int, T)]
    
          // include a shuffle step so that our upstream tasks are still distributed
          // 包括一个shuffle步骤,使得我们的上游任务仍然是分布式的
          new CoalescedRDD(
            new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
            new HashPartitioner(numPartitions)),
            numPartitions,
            partitionCoalescer).values
        } else {
          new CoalescedRDD(this, numPartitions, partitionCoalescer)
        }
      }
    
  • 相关阅读:
    20165215 2017-2018-2 《Java程序设计》第4周学习总结
    20165215 2017-2018-2 《Java程序设计》第3周学习总结
    20165215 2017-2018-2 《Java程序设计》第2周学习总结
    20165215 第一次测试总结
    20165215 2017-2018-2《Java程序设计》第一周学习总结
    20165215 预备作业3 Linux安装及学习
    20165215 学习基础和c语言基础调查
    20165215 我期望的师生关系
    20165220 Java第三周学习总结
    20165220预备作业3 Linux安装及学习
  • 原文地址:https://www.cnblogs.com/jordan95225/p/13457674.html
Copyright © 2020-2023  润新知