• Spark-RDD-宽窄依赖解析


    1.窄依赖

    窄依赖表示一个父RDD中的Partition最多被子RDD的一个Partition使用

    窄依赖分为两种:

    一种是一对一的依赖关系,比如map、filter等(OneToOneDependency)

    另一种是范围依赖关系,比如union(RangeDependency)

    OneToOneDependency源码如下:

    /**
     * :: DeveloperApi ::
     * Represents a one-to-one dependency between partitions of the parent and child RDDs.
     */
    @DeveloperApi
    class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
      override def getParents(partitionId: Int): List[Int] = List(partitionId)
    }
    

    getParents方法引入了partitionId,表明子RDD在使用该方法的时候,查询的是相同partitionId的内容

    RangeDependency源码如下:

    /**
     * :: DeveloperApi ::
     * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
     * @param rdd the parent RDD
     * @param inStart the start of the range in the parent RDD
     * @param outStart the start of the range in the child RDD
     * @param length the length of the range
     */
    @DeveloperApi
    class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
      extends NarrowDependency[T](rdd) {
    
      override def getParents(partitionId: Int): List[Int] = {
        if (partitionId >= outStart && partitionId < outStart + length) {
          List(partitionId - outStart + inStart)
        } else {
          Nil
        }
      }
    }
    

    子RDD通过getParents方法查询对应的Partition时,根据该partitionId减去插入时的开始ID,再加上在父RDD中的位置ID,简单来说就是把父RDD中的Partition,根据partitionId的顺序依次插入到子RDD中

    2.宽依赖

    宽依赖表示一个父RDD的Partition都会被多个子RDD的Partition所使用,会导致计算时产生shuffle操作,比如GroupByKey操作

    源码位于ShuffleDependency中:

    /**
     * :: DeveloperApi ::
     * Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
     * the RDD is transient since we don't need it on the executor side.
     *
     * @param _rdd the parent RDD
     * @param partitioner partitioner used to partition the shuffle output
     * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If not set
     *                   explicitly then the default serializer, as specified by `spark.serializer`
     *                   config option, will be used.
     * @param keyOrdering key ordering for RDD's shuffles
     * @param aggregator map/reduce-side aggregator for RDD's shuffle
     * @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
     */
    @DeveloperApi
    class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
        @transient private val _rdd: RDD[_ <: Product2[K, V]],
        val partitioner: Partitioner,
        val serializer: Serializer = SparkEnv.get.serializer,
        val keyOrdering: Option[Ordering[K]] = None,
        val aggregator: Option[Aggregator[K, V, C]] = None,
        val mapSideCombine: Boolean = false)
      extends Dependency[Product2[K, V]] {
    
      override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
    
      private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
      private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
      // Note: It's possible that the combiner class tag is null, if the combineByKey
      // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
      private[spark] val combinerClassName: Option[String] =
        Option(reflect.classTag[C]).map(_.runtimeClass.getName)
    
      //产生新的shuffleId
      val shuffleId: Int = _rdd.context.newShuffleId()
    
      //向shuffleManager注册信息
      val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
        shuffleId, _rdd.partitions.length, this)
    
      _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
    }
    
  • 相关阅读:
    iOS 关于使用xib创建cell的两种初始化方式
    KVO的初级使用
    通知的初级使用
    C语言的变量 常量
    C语言的编译 链接
    1 hello word
    java 中 == 与 equals引出的字符串比较
    02PSP0级及登陆界面开发
    00软工课程引言
    06动手动脑
  • 原文地址:https://www.cnblogs.com/jordan95225/p/13457897.html
Copyright © 2020-2023  润新知