• scala光速入门第二天


    spark RDD源码阅读笔记


    RDD

    在开始跳进去看RDD的方法之前,我们应该先了解一下RDD的一些基本信息。

    首先,我们先来看看RDD的构造方法:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    abstract class RDD[T: ClassTag](
        @transient private var _sc: SparkContext,
        @transient private var deps: Seq[Dependency[_]]
      ) extends Serializable with Logging {
     
      if (classOf[RDD[_]].isAssignableFrom(elementClassTag.runtimeClass)) {
        // This is a warning instead of an exception in order to avoid breaking user programs that
        // might have defined nested RDDs without running jobs with them.
        logWarning("Spark does not support nested RDDs (see SPARK-5063)")
      }
       
      /** Construct an RDD with just a one-to-one dependency on one parent */
      def this(@transient oneParent: RDD[_]) =
        this(oneParent.context , List(new OneToOneDependency(oneParent)))
    }

    这里我们看到,RDD在创建时便会放入一个SparkContext和它的Dependency们。 关于Dependency类,在上面的论文中有介绍,它包含了当前RDD的父RDD的引用, 以及足够从父RDD恢复丢失的partition的信息。

    接下来我们看看RDD需要子类实现的虚函数:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    // 由子类实现来计算一个给定的Partition
    def compute(split: Partition, context: TaskContext): Iterator[T]
     
    // 由子类实现,返回这个RDD的Partition集合
    protected def getPartitions: Array[Partition]
     
    // 由子类实现,返回这个RDD的Dependency集合
    protected def getDependencies: Seq[Dependency[_]] = deps
     
    // 可由子类重载,以提供更加偏好的Partition放置策略
    protected def getPreferredLocations(split: Partition): Seq[String] = Nil
     
    // 可由子类重载来改变partition的方式
    @transient val partitioner: Option[Partitioner] = None

    这些函数基本都是用于执行Spark计算的方法,也包括了论文中提到的三大RDD接口中的两个, 即getPartitions以及getPreferredLocations。其中有两个函数是子类必须实现的,即 computegetPartitions。我们记住它们的功能定义,以免它们在子类中再次出现时一时想不起来它们的功能。

    继续往下,我们看到除了包含SparkContext变量和Dependency们,一个RDD还包含了自己的id 以及name

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    // 创建该RDD的SparkContext
    def sparkContext: SparkContext = sc
     
    // SparkContext内部的唯一ID
    val id: Int = sc.newRddId()
     
    // RDD的名字
    @transient var name: String = null
     
    // 给RDD一个新的名字
    def setName(_name: String): this.type = {
      name = _name
      this
    }

    再继续往下,便是RDD的公用API了。


    RDD Action

    RDD提供了大量的API供我们使用。通过浏览RDD的ScalaDoc,不难发现RDD拥有数十种public的接口, 更不要提那些我们即将面对的非public的接口了。因此直接跳进RDD.scala从上往下阅读源代码是不科学的。 这里我使用另外一种阅读方式。

    正如Spark的论文中所描述的,RDD的API并不是每一个都会启动Spark的计算。被称之为Transformation的操作可以用一个RDD产生另一个RDD, 但这样的操作实际上是lazy的:它们并不会被立即计算,而是当你真正触发了计算动作的时候,所有你提交过的Transformation们会在经过Spark优化以后再顺序执行。 那么怎么样的操作会触发Spark的计算呢?

    这些被称之为Action的RDD操作便会触发Spark的计算动作。根据上图,Action包括countcollectreducelookupsave(已被更名为saveAsTextFilesaveAsObjectFile)。不难发现,除了save, 其他四个操作都是将结果直接获取到driver程序中的操作,由这些操作来启动Spark的计算也是十分合理的。

    那么我们不妨先来看一下这几个函数的源代码:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    // RDD.scala
     
    def collect(): Array[T] = withScope {
      val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
      Array.concat(results: _*)
    }
     
    // 返回RDD的元素个数
    def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
     
    // 使用给定的二元运算符来reduce该RDD
    def reduce(f: (T, T) => T): T = withScope {
      // Clean一下用户传入的closure,以准备将其序列化
      val cleanF = sc.clean(f)
      // 应用在每个partition上的reduce函数。相当于Hadoop MR中的combine
      val reducePartition: Iterator[T] => Option[T] = iter => {
        if (iter.hasNext) {
          Some(iter.reduceLeft(cleanF)) // 在单个Partition内部使用Iterator#reduceLeft来计算结果
        } else {
          None
        }
      }
       
      var jobResult: Option[T] = None
      // 合并每个partition的reduce结果
      val mergeResult = (index: Int, taskResult: Option[T]) => {
        if (taskResult.isDefined) {
          jobResult = jobResult match {
            case Some(value) => Some(f(value, taskResult.get))
            case None => taskResult
          }
        }
      }
      // 启动Spark Job
      sc.runJob(this, reducePartition, mergeResult)
     
      jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
    }
     
    // PairRDDFunctions.scala
     
    // 根据给定的RDD的key来查找它对应的Seq[value]
    // 如果该RDD有给定的Partitioner,该方法会先利用getPartition方法定位Partition再进行搜索,
    // 如此一来便能提高效率
    def lookup(key: K): Seq[V] = self.withScope {
      self.partitioner match {
        case Some(p) => // 存在特定的Partitioner
          val index = p.getPartition(key)  // 定位具体的Partition
          val process = (it: Iterator[(K, V)]) => {
            val buf = new ArrayBuffer[V]
            for (pair <- it if pair._1 == key) {
              buf += pair._2
            }
            buf
          } : Seq[V]
          // 仅在该Partition上查找
          val res = self.context.runJob(self, process, Array(index), false)
          res(0
        case None =>
          // 若找不到特定的Partitioner,则使用RDD#filter来查找
          self.filter(_._1 == key).map(_._2).collect()
      }

    上述四个函数都有一个特点:它们都直接或间接地调用了sparkContext.runJob方法来获取结果。 可见这个方法便是启动Spark计算任务的入口。我们记下这个入口,留到研读SparkContext源代码的时候再进行解析。


    RDD Transformations

    讲完了Action,自然就轮到了Transformation了。可是有那~么多的Transformation啊。我们就一个一个地看看这些常用的Transformation吧。

    map

    我们先从用得最多的开始。我们直接看源码:

    1
    2
    3
    4
    5
    6
    7
    /**
     * Return a new RDD by applying a function to all elements of this RDD.
     */
    def map[U: ClassTag](f: T => U): RDD[U] = withScope {
      val cleanF = sc.clean(f)
      new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
    }

    和论文中说的一样,map函数会利用当前RDD以及用户传入的匿名函数构建出一个MapPartitionsRDD。 毋庸置疑这个东西肯定是继承自RDD类的。我们可以看看它的源代码:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
        prev: RDD[T],
        f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
        preservesPartitioning: Boolean = false)
      extends RDD[U](prev) {
     
      override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
     
      override def getPartitions: Array[Partition] = firstParent[T].partitions
     
      override def compute(split: Partition, context: TaskContext): Iterator[U] =
        f(context, split.index, firstParent[T].iterator(split, context))
    }

    可以看到,MapPartitionsRDD实现了getPartitionscompute方法。

    getPartitions方法直接返回了它的firstParent的partition。实际上MapPartitionsRDD也只会有一个parent, 也就是构造函数传入的prev

    compute方法在这里直接应用了构造参数传入的方法f。我们看回RDD#map, 传入的方法是(context, pid, iter) => iter.map(cleanF)。结合到MapPartitionsRDD的源代码里就不难看出其实现原理了。 这里我们最好记住匿名函数的contextTaskContextpidPartition的id、 iter即该Partitioniterator。记住这些以免后面再次出现的时候一时晕菜。

    注意到,MapPartitionsRDD还重载了partitioner变量, 其值取决于构造函数传入的preservesPartitioning参数,该参数默认为false。 在RDD#map方法里并未对该参数赋值。

    withScope

    我们回到刚才的RDD#map方法,注意到它还调用了一个函数,就是withScope。 这个函数出现的次数相当多,你在很多RDD API里都能发现它。我们来看看它的源代码:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    // RDD.scala
     
    private[spark] def withScope[U](body: => U): U = RDDOperationScope.withScope[U](sc)(body)
     
    // RDDOperationScope.scala
     
    /**
     * A general, named code block representing an operation that instantiates RDDs.
     *
     * All RDDs instantiated in the corresponding code block will store a pointer to this object.
     * Examples include, but will not be limited to, existing RDD operations, such as textFile,
     * reduceByKey, and treeAggregate.
     *
     * An operation scope may be nested in other scopes. For instance, a SQL query may enclose
     * scopes associated with the public RDD APIs it uses under the hood.
     *
     * There is no particular relationship between an operation scope and a stage or a job.
     * A scope may live inside one stage (e.g. map) or span across multiple jobs (e.g. take).
     */
    @JsonInclude(Include.NON_NULL)
    @JsonPropertyOrder(Array("id", "name", "parent"))
    private[spark] class RDDOperationScope(
        val name: String,
        val parent: Option[RDDOperationScope] = None,
        val id: String = RDDOperationScope.nextScopeId().toString) {
     
      def toJson: String = {
        RDDOperationScope.jsonMapper.writeValueAsString(this)
      }
     
      /**
       * Return a list of scopes that this scope is a part of, including this scope itself.
       * The result is ordered from the outermost scope (eldest ancestor) to this scope.
       */
      @JsonIgnore
      def getAllScopes: Seq[RDDOperationScope] = {
        parent.map(_.getAllScopes).getOrElse(Seq.empty) ++ Seq(this)
      }
     
      override def equals(other: Any): Boolean = {
        other match {
          case s: RDDOperationScope =>
            id == s.id && name == s.name && parent == s.parent
          case _ => false
        }
      }
     
      override def toString: String = toJson
    }
     
    /**
     * A collection of utility methods to construct a hierarchical representation of RDD scopes.
     * An RDD scope tracks the series of operations that created a given RDD.
     */
    private[spark] object RDDOperationScope extends Logging {
      private val jsonMapper = new ObjectMapper().registerModule(DefaultScalaModule)
      private val scopeCounter = new AtomicInteger(0)
     
      def fromJson(s: String): RDDOperationScope = {
        jsonMapper.readValue(s, classOf[RDDOperationScope])
      }
     
      /** Return a globally unique operation scope ID. */
      def nextScopeId(): Int = scopeCounter.getAndIncrement
     
      /**
       * Execute the given body such that all RDDs created in this body will have the same scope.
       * The name of the scope will be the first method name in the stack trace that is not the
       * same as this method's.
       *
       * Note: Return statements are NOT allowed in body.
       */
      private[spark] def withScope[T](
          sc: SparkContext,
          allowNesting: Boolean = false)(body: => T): T = {
        val ourMethodName = "withScope"
        val callerMethodName = Thread.currentThread.getStackTrace()
          .dropWhile(_.getMethodName != ourMethodName)  // 去掉了withScope之后的所有函数调用
          .find(_.getMethodName != ourMethodName)   // 找到调用withScope的函数,如RDD#withScope
          .map(_.getMethodName)
          .getOrElse {
            // Log a warning just in case, but this should almost certainly never happen
            logWarning("No valid method name for this RDD operation scope!")
            "N/A"
          }
        withScope[T](sc, callerMethodName, allowNesting, ignoreParent = false)(body)
      }
     
      /**
       * Execute the given body such that all RDDs created in this body will have the same scope.
       *
       * If nesting is allowed, any subsequent calls to this method in the given body will instantiate
       * child scopes that are nested within our scope. Otherwise, these calls will take no effect.
       *
       * Additionally, the caller of this method may optionally ignore the configurations and scopes
       * set by the higher level caller. In this case, this method will ignore the parent caller's
       * intention to disallow nesting, and the new scope instantiated will not have a parent. This
       * is useful for scoping physical operations in Spark SQL, for instance.
       *
       * Note: Return statements are NOT allowed in body.
       */
      private[spark] def withScope[T](
          sc: SparkContext,
          name: String,
          allowNesting: Boolean,
          ignoreParent: Boolean)(body: => T): T = {
        // Save the old scope to restore it later
        val scopeKey = SparkContext.RDD_SCOPE_KEY
        val noOverrideKey = SparkContext.RDD_SCOPE_NO_OVERRIDE_KEY
        val oldScopeJson = sc.getLocalProperty(scopeKey)
        val oldScope = Option(oldScopeJson).map(RDDOperationScope.fromJson)
        val oldNoOverride = sc.getLocalProperty(noOverrideKey)
        try {
          if (ignoreParent) {
            // Ignore all parent settings and scopes and start afresh with our own root scope
            sc.setLocalProperty(scopeKey, new RDDOperationScope(name).toJson)
          } else if (sc.getLocalProperty(noOverrideKey) == null) {
            // Otherwise, set the scope only if the higher level caller allows us to do so
            sc.setLocalProperty(scopeKey, new RDDOperationScope(name, oldScope).toJson)
          }
          // Optionally disallow the child body to override our scope
          if (!allowNesting) {
            sc.setLocalProperty(noOverrideKey, "true")
          }
          // 在执行传入的函数前先将一个新的RDDOperationScope设定到sc中
          body
        } finally {
          // 执行完毕后再还原
          // Remember to restore any state that was modified before exiting
          sc.setLocalProperty(scopeKey, oldScopeJson)
          sc.setLocalProperty(noOverrideKey, oldNoOverride)
        }
      }
    }

    暂时来讲,withScope方法所涉及到的环境变量包括scopeKeynoOverrideKey。 以我们目前的高度,这两个变量的具体使用应该是不会接触到的,我们不妨留到深入探讨SparkContext的时候再仔细研究这两个变量。

    filter

    1
    2
    3
    4
    5
    6
    7
    def filter(f: T => Boolean): RDD[T] = withScope {
      val cleanF = sc.clean(f)
      new MapPartitionsRDD[T, T](
        this,
        (context, pid, iter) => iter.filter(cleanF),
        preservesPartitioning = true)
    }

    可见,filter本质上也是一种map。

    flatMap

    1
    2
    3
    4
    def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
      val cleanF = sc.clean(f)
      new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
    }

    基本同上。

    sample

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    /**
     * Return a sampled subset of this RDD.
     *
     * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
     * @param fraction expected size of the sample as a fraction of this RDD's size
     *  without replacement: probability that each element is chosen; fraction must be [0, 1]
     *  with replacement: expected number of times each element is chosen; fraction must be >= 0
     * @param seed seed for the random number generator
     */
    def sample(
        withReplacement: Boolean,
        fraction: Double,
        seed: Long = Utils.random.nextLong): RDD[T] = withScope {
      require(fraction >= 0.0, "Negative fraction value: " + fraction)
      if (withReplacement) {
        new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
      } else {
        new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
      }
    }

    可见,sample方法生成了一个PartitionwiseSampledRDD,并根据参数的不同分别传入PoissonSamplerBernoulliSampler。 从名字上看,这两个Sampler自然是对应着泊松分布和贝努利分布,只是两种不同的随机采样器。因此这里我们就不解析这两个采样器了。我们来看一下这个PartitionwiseSampledRDD

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    private[spark]
    class PartitionwiseSampledRDDPartition(val prev: Partition, val seed: Long)
      extends Partition with Serializable {
      override val index: Int = prev.index
    }
     
    /**
     * A RDD sampled from its parent RDD partition-wise. For each partition of the parent RDD,
     * a user-specified [[org.apache.spark.util.random.RandomSampler]] instance is used to obtain
     * a random sample of the records in the partition. The random seeds assigned to the samplers
     * are guaranteed to have different values.
     *
     * @param prev RDD to be sampled
     * @param sampler a random sampler
     * @param preservesPartitioning whether the sampler preserves the partitioner of the parent RDD
     * @param seed random seed
     * @tparam T input RDD item type
     * @tparam U sampled RDD item type
     */
    private[spark] class PartitionwiseSampledRDD[T: ClassTag, U: ClassTag](
        prev: RDD[T],
        sampler: RandomSampler[T, U],
        @transient preservesPartitioning: Boolean,
        @transient seed: Long = Utils.random.nextLong)
      extends RDD[U](prev) {
     
      @transient override val partitioner = if (preservesPartitioning) prev.partitioner else None
     
      override def getPartitions: Array[Partition] = {
        val random = new Random(seed)
        firstParent[T].partitions.map(x => new PartitionwiseSampledRDDPartition(x, random.nextLong()))
      }
     
      override def getPreferredLocations(split: Partition): Seq[String] =
        firstParent[T].preferredLocations(split.asInstanceOf[PartitionwiseSampledRDDPartition].prev)
     
      override def compute(splitIn: Partition, context: TaskContext): Iterator[U] = {
        val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
        val thisSampler = sampler.clone
        thisSampler.setSeed(split.seed)
        thisSampler.sample(firstParent[T].iterator(split.prev, context))
      }
    }

    实现逻辑也十分直观:getPartitions方法表明PartitionwiseSampledRDD直接利用它的parent RDD的partition作为自己的partition; compute方法则表明PartitionwiseSampledRDD将通过调用RandomSamplersample方法来对Iterator进行取样。

    cartesian

    1
    2
    3
    4
    5
    6
    7
    /**
     * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
     * elements (a, b) where a is in `this` and b is in `other`.
     */
    def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
      new CartesianRDD(sc, this, other)
    }

    使用两个RDD构建了一个CartesianRDD,似乎也十分合理。那我们来看一下这个CartesianRDD

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    private[spark] class CartesianPartition(
        idx: Int,
        @transient rdd1: RDD[_],
        @transient rdd2: RDD[_],
        s1Index: Int,
        s2Index: Int
      ) extends Partition {
      var s1 = rdd1.partitions(s1Index)
      var s2 = rdd2.partitions(s2Index)
      override val index: Int = idx
     
      // 重载了Serializable的writeObject方法,在任务序列化时更新s1、s2
      @throws(classOf[IOException])
      private def writeObject(oos: ObjectOutputStream): Unit = Utils.tryOrIOException {
        // Update the reference to parent split at the time of task serialization
        s1 = rdd1.partitions(s1Index)
        s2 = rdd2.partitions(s2Index)
        oos.defaultWriteObject()
      }
    }
     
    private[spark]
    class CartesianRDD[T: ClassTag, U: ClassTag](
        sc: SparkContext,
        var rdd1 : RDD[T],
        var rdd2 : RDD[U])
      extends RDD[Pair[T, U]](sc, Nil)
      with Serializable {
     
      val numPartitionsInRdd2 = rdd2.partitions.length
     
      // 以rdd1与rdd2的partition来生成自己的partition
      override def getPartitions: Array[Partition] = {
        // create the cross product split
        val array = new Array[Partition](rdd1.partitions.length * rdd2.partitions.length)
        for (s1 <- rdd1.partitions; s2 <- rdd2.partitions) {
          val idx = s1.index * numPartitionsInRdd2 + s2.index
          array(idx) = new CartesianPartition(idx, rdd1, rdd2, s1.index, s2.index)
        }
        array
      }
     
      // preferredLocations依赖于rdd1和rdd2的preferredLocations
      override def getPreferredLocations(split: Partition): Seq[String] = {
        val currSplit = split.asInstanceOf[CartesianPartition]
        (rdd1.preferredLocations(currSplit.s1) ++ rdd2.preferredLocations(currSplit.s2)).distinct
      }
     
      // 直接使用rdd1和rdd2生成自身结果
      override def compute(split: Partition, context: TaskContext): Iterator[(T, U)] = {
        val currSplit = split.asInstanceOf[CartesianPartition]
        for (x <- rdd1.iterator(currSplit.s1, context);
             y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
      }
     
      // 指明自己依赖于rdd1和rdd2
      override def getDependencies: Seq[Dependency[_]] = List(
        new NarrowDependency(rdd1) {
          def getParents(id: Int): Seq[Int] = List(id / numPartitionsInRdd2)
        },
        new NarrowDependency(rdd2) {
          def getParents(id: Int): Seq[Int] = List(id % numPartitionsInRdd2)
        }
      )
     
      override def clearDependencies() {
        super.clearDependencies()
        rdd1 = null
        rdd2 = null
      }
    }

    也比较直观。

    distinct

    1
    2
    3
    4
    5
    6
    /**
     * Return a new RDD containing the distinct elements in this RDD.
     */
    def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
      map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
    }

    使用了reduceByKey的功能实现了distinct,可以理解。

    groupBy

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
      groupBy[K](f, defaultPartitioner(this))
    }
     
    def groupBy[K](f: T => K, numPartitions: Int)
                  (implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
      groupBy(f, new HashPartitioner(numPartitions))
    }
     
    def groupBy[K](f: T => K, p: Partitioner)
                  (implicit kt: ClassTag[K], ord: Ordering[K] = null) : RDD[(K, Iterable[T])] = withScope {
      val cleanF = sc.clean(f)
      // 利用传入的f为每个记录生成key以后再groupByKey
      this.map(t => (cleanF(t), t)).groupByKey(p)
    }

    总结

    至此,我们便基本能够理解了:RDD Transformation将以原本的RDD作为parent来构造一个新的RDD,不断地调用Transformation Operation就可以产生出一条RDD操作链, 但整条流水线的启动被一直延后到RDD Action;RDD Action通过调用SparkContext#runJob启动整条流水线。

  • 相关阅读:
    如何复制百度文库中的文章转的,不用担心下载要币了[转]
    什么是中间件(转)
    android实用代码 (转)
    [Java]读取文件方法大全(转)
    Solaris下查看磁盘、内存、CPU使用程度
    Gene Ontology (GO) 简介
    如何在网上查某个基因的转录因子及启动子
    kmeans k均值聚类的弱点/缺点
    层次聚类
    什么是非负矩阵分解 NMF(Nonnegative Matrix Factorization )
  • 原文地址:https://www.cnblogs.com/qq852667211/p/5096890.html
Copyright © 2020-2023  润新知