scala光速入门第二天

spark RDD源码阅读笔记

RDD

在开始跳进去看RDD的方法之前，我们应该先了解一下RDD的一些基本信息。

首先，我们先来看看RDD的构造方法：

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
 
  if (classOf[RDD[_]].isAssignableFrom(elementClassTag.runtimeClass)) {
    // This is a warning instead of an exception in order to avoid breaking user programs that
    // might have defined nested RDDs without running jobs with them.
    logWarning("Spark does not support nested RDDs (see SPARK-5063)")
  }
   
  /** Construct an RDD with just a one-to-one dependency on one parent */
  def this(@transient oneParent: RDD[_]) =
    this(oneParent.context , List(new OneToOneDependency(oneParent)))
}

这里我们看到，RDD在创建时便会放入一个SparkContext和它的Dependency们。关于Dependency类，在上面的论文中有介绍，它包含了当前RDD的父RDD的引用，以及足够从父RDD恢复丢失的partition的信息。

接下来我们看看RDD需要子类实现的虚函数：

// 由子类实现来计算一个给定的Partition
def compute(split: Partition, context: TaskContext): Iterator[T]
 
// 由子类实现，返回这个RDD的Partition集合
protected def getPartitions: Array[Partition]
 
// 由子类实现，返回这个RDD的Dependency集合
protected def getDependencies: Seq[Dependency[_]] = deps
 
// 可由子类重载，以提供更加偏好的Partition放置策略
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
 
// 可由子类重载来改变partition的方式
@transient val partitioner: Option[Partitioner] = None

这些函数基本都是用于执行Spark计算的方法，也包括了论文中提到的三大RDD接口中的两个，即getPartitions以及getPreferredLocations。其中有两个函数是子类必须实现的，即 compute和getPartitions。我们记住它们的功能定义，以免它们在子类中再次出现时一时想不起来它们的功能。

继续往下，我们看到除了包含SparkContext变量和Dependency们，一个RDD还包含了自己的id 以及name：

// 创建该RDD的SparkContext
def sparkContext: SparkContext = sc
 
// SparkContext内部的唯一ID
val id: Int = sc.newRddId()
 
// RDD的名字
@transient var name: String = null
 
// 给RDD一个新的名字
def setName(_name: String): this.type = {
  name = _name
  this
}

再继续往下，便是RDD的公用API了。

RDD Action

RDD提供了大量的API供我们使用。通过浏览RDD的ScalaDoc，不难发现RDD拥有数十种public的接口，更不要提那些我们即将面对的非public的接口了。因此直接跳进RDD.scala从上往下阅读源代码是不科学的。这里我使用另外一种阅读方式。

正如Spark的论文中所描述的，RDD的API并不是每一个都会启动Spark的计算。被称之为Transformation的操作可以用一个RDD产生另一个RDD，但这样的操作实际上是lazy的：它们并不会被立即计算，而是当你真正触发了计算动作的时候，所有你提交过的Transformation们会在经过Spark优化以后再顺序执行。那么怎么样的操作会触发Spark的计算呢？

这些被称之为Action的RDD操作便会触发Spark的计算动作。根据上图，Action包括count、collect、reduce、 lookup和save（已被更名为saveAsTextFile和saveAsObjectFile）。不难发现，除了save，其他四个操作都是将结果直接获取到driver程序中的操作，由这些操作来启动Spark的计算也是十分合理的。

那么我们不妨先来看一下这几个函数的源代码：

// RDD.scala
 
def collect(): Array[T] = withScope {
  val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
  Array.concat(results: _*)
}
 
// 返回RDD的元素个数
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
 
// 使用给定的二元运算符来reduce该RDD
def reduce(f: (T, T) => T): T = withScope {
  // Clean一下用户传入的closure，以准备将其序列化
  val cleanF = sc.clean(f)
  // 应用在每个partition上的reduce函数。相当于Hadoop MR中的combine
  val reducePartition: Iterator[T] => Option[T] = iter => {
    if (iter.hasNext) {
      Some(iter.reduceLeft(cleanF)) // 在单个Partition内部使用Iterator#reduceLeft来计算结果
    } else {
      None
    }
  }
   
  var jobResult: Option[T] = None
  // 合并每个partition的reduce结果
  val mergeResult = (index: Int, taskResult: Option[T]) => {
    if (taskResult.isDefined) {
      jobResult = jobResult match {
        case Some(value) => Some(f(value, taskResult.get))
        case None => taskResult
      }
    }
  }
  // 启动Spark Job
  sc.runJob(this, reducePartition, mergeResult)
 
  jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}
 
// PairRDDFunctions.scala
 
// 根据给定的RDD的key来查找它对应的Seq[value]
// 如果该RDD有给定的Partitioner，该方法会先利用getPartition方法定位Partition再进行搜索，
// 如此一来便能提高效率 
def lookup(key: K): Seq[V] = self.withScope {
  self.partitioner match {
    case Some(p) => // 存在特定的Partitioner
      val index = p.getPartition(key)  // 定位具体的Partition
      val process = (it: Iterator[(K, V)]) => {
        val buf = new ArrayBuffer[V]
        for (pair <- it if pair._1 == key) {
          buf += pair._2
        }
        buf
      } : Seq[V]
      // 仅在该Partition上查找
      val res = self.context.runJob(self, process, Array(index), false)
      res(0)  
    case None =>
      // 若找不到特定的Partitioner，则使用RDD#filter来查找
      self.filter(_._1 == key).map(_._2).collect()
  }
}  

上述四个函数都有一个特点：它们都直接或间接地调用了sparkContext.runJob方法来获取结果。可见这个方法便是启动Spark计算任务的入口。我们记下这个入口，留到研读SparkContext源代码的时候再进行解析。

RDD Transformations

讲完了Action，自然就轮到了Transformation了。可是有那~么多的Transformation啊。我们就一个一个地看看这些常用的Transformation吧。

map

我们先从用得最多的开始。我们直接看源码：

/**
 * Return a new RDD by applying a function to all elements of this RDD.
 */
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

和论文中说的一样，map函数会利用当前RDD以及用户传入的匿名函数构建出一个MapPartitionsRDD。毋庸置疑这个东西肯定是继承自RDD类的。我们可以看看它的源代码：

private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev) {
 
  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
 
  override def getPartitions: Array[Partition] = firstParent[T].partitions
 
  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))
}

可以看到，MapPartitionsRDD实现了getPartitions和compute方法。

getPartitions方法直接返回了它的firstParent的partition。实际上MapPartitionsRDD也只会有一个parent，也就是构造函数传入的prev。

compute方法在这里直接应用了构造参数传入的方法f。我们看回RDD#map，传入的方法是(context, pid, iter) => iter.map(cleanF)。结合到MapPartitionsRDD的源代码里就不难看出其实现原理了。这里我们最好记住匿名函数的context是TaskContext、pid是Partition的id、 iter即该Partition的iterator。记住这些以免后面再次出现的时候一时晕菜。

注意到，MapPartitionsRDD还重载了partitioner变量，其值取决于构造函数传入的preservesPartitioning参数，该参数默认为false。在RDD#map方法里并未对该参数赋值。

withScope

我们回到刚才的RDD#map方法，注意到它还调用了一个函数，就是withScope。这个函数出现的次数相当多，你在很多RDD API里都能发现它。我们来看看它的源代码：

// RDD.scala
 
private[spark] def withScope[U](body: => U): U = RDDOperationScope.withScope[U](sc)(body)
 
// RDDOperationScope.scala
 
/**
 * A general, named code block representing an operation that instantiates RDDs.
 *
 * All RDDs instantiated in the corresponding code block will store a pointer to this object.
 * Examples include, but will not be limited to, existing RDD operations, such as textFile,
 * reduceByKey, and treeAggregate.
 *
 * An operation scope may be nested in other scopes. For instance, a SQL query may enclose
 * scopes associated with the public RDD APIs it uses under the hood.
 *
 * There is no particular relationship between an operation scope and a stage or a job.
 * A scope may live inside one stage (e.g. map) or span across multiple jobs (e.g. take).
 */
@JsonInclude(Include.NON_NULL)
@JsonPropertyOrder(Array("id", "name", "parent"))
private[spark] class RDDOperationScope(
    val name: String,
    val parent: Option[RDDOperationScope] = None,
    val id: String = RDDOperationScope.nextScopeId().toString) {
 
  def toJson: String = {
    RDDOperationScope.jsonMapper.writeValueAsString(this)
  }
 
  /**
   * Return a list of scopes that this scope is a part of, including this scope itself.
   * The result is ordered from the outermost scope (eldest ancestor) to this scope.
   */
  @JsonIgnore
  def getAllScopes: Seq[RDDOperationScope] = {
    parent.map(_.getAllScopes).getOrElse(Seq.empty) ++ Seq(this)
  }
 
  override def equals(other: Any): Boolean = {
    other match {
      case s: RDDOperationScope =>
        id == s.id && name == s.name && parent == s.parent
      case _ => false
    }
  }
 
  override def toString: String = toJson
}
 
/**
 * A collection of utility methods to construct a hierarchical representation of RDD scopes.
 * An RDD scope tracks the series of operations that created a given RDD.
 */
private[spark] object RDDOperationScope extends Logging {
  private val jsonMapper = new ObjectMapper().registerModule(DefaultScalaModule)
  private val scopeCounter = new AtomicInteger(0)
 
  def fromJson(s: String): RDDOperationScope = {
    jsonMapper.readValue(s, classOf[RDDOperationScope])
  }
 
  /** Return a globally unique operation scope ID. */
  def nextScopeId(): Int = scopeCounter.getAndIncrement
 
  /**
   * Execute the given body such that all RDDs created in this body will have the same scope.
   * The name of the scope will be the first method name in the stack trace that is not the
   * same as this method's.
   *
   * Note: Return statements are NOT allowed in body.
   */
  private[spark] def withScope[T](
      sc: SparkContext,
      allowNesting: Boolean = false)(body: => T): T = {
    val ourMethodName = "withScope"
    val callerMethodName = Thread.currentThread.getStackTrace()
      .dropWhile(_.getMethodName != ourMethodName)  // 去掉了withScope之后的所有函数调用
      .find(_.getMethodName != ourMethodName)   // 找到调用withScope的函数，如RDD#withScope
      .map(_.getMethodName)
      .getOrElse {
        // Log a warning just in case, but this should almost certainly never happen
        logWarning("No valid method name for this RDD operation scope!")
        "N/A"
      }
    withScope[T](sc, callerMethodName, allowNesting, ignoreParent = false)(body)
  }
 
  /**
   * Execute the given body such that all RDDs created in this body will have the same scope.
   *
   * If nesting is allowed, any subsequent calls to this method in the given body will instantiate
   * child scopes that are nested within our scope. Otherwise, these calls will take no effect.
   *
   * Additionally, the caller of this method may optionally ignore the configurations and scopes
   * set by the higher level caller. In this case, this method will ignore the parent caller's
   * intention to disallow nesting, and the new scope instantiated will not have a parent. This
   * is useful for scoping physical operations in Spark SQL, for instance.
   *
   * Note: Return statements are NOT allowed in body.
   */
  private[spark] def withScope[T](
      sc: SparkContext,
      name: String,
      allowNesting: Boolean,
      ignoreParent: Boolean)(body: => T): T = {
    // Save the old scope to restore it later
    val scopeKey = SparkContext.RDD_SCOPE_KEY
    val noOverrideKey = SparkContext.RDD_SCOPE_NO_OVERRIDE_KEY
    val oldScopeJson = sc.getLocalProperty(scopeKey)
    val oldScope = Option(oldScopeJson).map(RDDOperationScope.fromJson)
    val oldNoOverride = sc.getLocalProperty(noOverrideKey)
    try {
      if (ignoreParent) {
        // Ignore all parent settings and scopes and start afresh with our own root scope
        sc.setLocalProperty(scopeKey, new RDDOperationScope(name).toJson)
      } else if (sc.getLocalProperty(noOverrideKey) == null) {
        // Otherwise, set the scope only if the higher level caller allows us to do so
        sc.setLocalProperty(scopeKey, new RDDOperationScope(name, oldScope).toJson)
      }
      // Optionally disallow the child body to override our scope
      if (!allowNesting) {
        sc.setLocalProperty(noOverrideKey, "true")
      }
      // 在执行传入的函数前先将一个新的RDDOperationScope设定到sc中
      body
    } finally {
      // 执行完毕后再还原
      // Remember to restore any state that was modified before exiting
      sc.setLocalProperty(scopeKey, oldScopeJson)
      sc.setLocalProperty(noOverrideKey, oldNoOverride)
    }
  }
}

暂时来讲，withScope方法所涉及到的环境变量包括scopeKey和noOverrideKey。以我们目前的高度，这两个变量的具体使用应该是不会接触到的，我们不妨留到深入探讨SparkContext的时候再仔细研究这两个变量。

filter

def filter(f: T => Boolean): RDD[T] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[T, T](
    this,
    (context, pid, iter) => iter.filter(cleanF),
    preservesPartitioning = true)
}

可见，filter本质上也是一种map。

flatMap

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}

基本同上。

sample

/**
 * Return a sampled subset of this RDD.
 *
 * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
 * @param fraction expected size of the sample as a fraction of this RDD's size
 *  without replacement: probability that each element is chosen; fraction must be [0, 1]
 *  with replacement: expected number of times each element is chosen; fraction must be >= 0
 * @param seed seed for the random number generator
 */
def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T] = withScope {
  require(fraction >= 0.0, "Negative fraction value: " + fraction)
  if (withReplacement) {
    new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
  } else {
    new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
  }
}

可见，sample方法生成了一个PartitionwiseSampledRDD，并根据参数的不同分别传入PoissonSampler或BernoulliSampler。从名字上看，这两个Sampler自然是对应着泊松分布和贝努利分布，只是两种不同的随机采样器。因此这里我们就不解析这两个采样器了。我们来看一下这个PartitionwiseSampledRDD：

private[spark]
class PartitionwiseSampledRDDPartition(val prev: Partition, val seed: Long)
  extends Partition with Serializable {
  override val index: Int = prev.index
}
 
/**
 * A RDD sampled from its parent RDD partition-wise. For each partition of the parent RDD,
 * a user-specified [[org.apache.spark.util.random.RandomSampler]] instance is used to obtain
 * a random sample of the records in the partition. The random seeds assigned to the samplers
 * are guaranteed to have different values.
 *
 * @param prev RDD to be sampled
 * @param sampler a random sampler
 * @param preservesPartitioning whether the sampler preserves the partitioner of the parent RDD
 * @param seed random seed
 * @tparam T input RDD item type
 * @tparam U sampled RDD item type
 */
private[spark] class PartitionwiseSampledRDD[T: ClassTag, U: ClassTag](
    prev: RDD[T],
    sampler: RandomSampler[T, U],
    @transient preservesPartitioning: Boolean,
    @transient seed: Long = Utils.random.nextLong)
  extends RDD[U](prev) {
 
  @transient override val partitioner = if (preservesPartitioning) prev.partitioner else None
 
  override def getPartitions: Array[Partition] = {
    val random = new Random(seed)
    firstParent[T].partitions.map(x => new PartitionwiseSampledRDDPartition(x, random.nextLong()))
  }
 
  override def getPreferredLocations(split: Partition): Seq[String] =
    firstParent[T].preferredLocations(split.asInstanceOf[PartitionwiseSampledRDDPartition].prev)
 
  override def compute(splitIn: Partition, context: TaskContext): Iterator[U] = {
    val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
    val thisSampler = sampler.clone
    thisSampler.setSeed(split.seed)
    thisSampler.sample(firstParent[T].iterator(split.prev, context))
  }
}

实现逻辑也十分直观：getPartitions方法表明PartitionwiseSampledRDD直接利用它的parent RDD的partition作为自己的partition； compute方法则表明PartitionwiseSampledRDD将通过调用RandomSampler的sample方法来对Iterator进行取样。

cartesian

/**
 * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
 * elements (a, b) where a is in `this` and b is in `other`.
 */
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
  new CartesianRDD(sc, this, other)
}

使用两个RDD构建了一个CartesianRDD，似乎也十分合理。那我们来看一下这个CartesianRDD：

private[spark] class CartesianPartition(
    idx: Int,
    @transient rdd1: RDD[_],
    @transient rdd2: RDD[_],
    s1Index: Int,
    s2Index: Int
  ) extends Partition {
  var s1 = rdd1.partitions(s1Index)
  var s2 = rdd2.partitions(s2Index)
  override val index: Int = idx
 
  // 重载了Serializable的writeObject方法，在任务序列化时更新s1、s2
  @throws(classOf[IOException])
  private def writeObject(oos: ObjectOutputStream): Unit = Utils.tryOrIOException {
    // Update the reference to parent split at the time of task serialization
    s1 = rdd1.partitions(s1Index)
    s2 = rdd2.partitions(s2Index)
    oos.defaultWriteObject()
  }
}
 
private[spark]
class CartesianRDD[T: ClassTag, U: ClassTag](
    sc: SparkContext,
    var rdd1 : RDD[T],
    var rdd2 : RDD[U])
  extends RDD[Pair[T, U]](sc, Nil)
  with Serializable {
 
  val numPartitionsInRdd2 = rdd2.partitions.length
 
  // 以rdd1与rdd2的partition来生成自己的partition
  override def getPartitions: Array[Partition] = {
    // create the cross product split
    val array = new Array[Partition](rdd1.partitions.length * rdd2.partitions.length)
    for (s1 <- rdd1.partitions; s2 <- rdd2.partitions) {
      val idx = s1.index * numPartitionsInRdd2 + s2.index
      array(idx) = new CartesianPartition(idx, rdd1, rdd2, s1.index, s2.index)
    }
    array
  }
 
  // preferredLocations依赖于rdd1和rdd2的preferredLocations
  override def getPreferredLocations(split: Partition): Seq[String] = {
    val currSplit = split.asInstanceOf[CartesianPartition]
    (rdd1.preferredLocations(currSplit.s1) ++ rdd2.preferredLocations(currSplit.s2)).distinct
  }
 
  // 直接使用rdd1和rdd2生成自身结果
  override def compute(split: Partition, context: TaskContext): Iterator[(T, U)] = {
    val currSplit = split.asInstanceOf[CartesianPartition]
    for (x <- rdd1.iterator(currSplit.s1, context);
         y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
  }
 
  // 指明自己依赖于rdd1和rdd2
  override def getDependencies: Seq[Dependency[_]] = List(
    new NarrowDependency(rdd1) {
      def getParents(id: Int): Seq[Int] = List(id / numPartitionsInRdd2)
    },
    new NarrowDependency(rdd2) {
      def getParents(id: Int): Seq[Int] = List(id % numPartitionsInRdd2)
    }
  )
 
  override def clearDependencies() {
    super.clearDependencies()
    rdd1 = null
    rdd2 = null
  }
}

也比较直观。

distinct

/**
 * Return a new RDD containing the distinct elements in this RDD.
 */
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}

使用了reduceByKey的功能实现了distinct，可以理解。

groupBy

def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
  groupBy[K](f, defaultPartitioner(this))
}
 
def groupBy[K](f: T => K, numPartitions: Int)
              (implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
  groupBy(f, new HashPartitioner(numPartitions))
}
 
def groupBy[K](f: T => K, p: Partitioner)
              (implicit kt: ClassTag[K], ord: Ordering[K] = null) : RDD[(K, Iterable[T])] = withScope {
  val cleanF = sc.clean(f)
  // 利用传入的f为每个记录生成key以后再groupByKey
  this.map(t => (cleanF(t), t)).groupByKey(p)
}

总结

至此，我们便基本能够理解了：RDD Transformation将以原本的RDD作为parent来构造一个新的RDD，不断地调用Transformation Operation就可以产生出一条RDD操作链，但整条流水线的启动被一直延后到RDD Action；RDD Action通过调用SparkContext#runJob启动整条流水线。

相关阅读:
JS 日期加多少天，减多少天
 SQL 触发器
 SGU100
连续子数组的最大和
 字符串的排列
 二叉搜索树与双向链表
 数组中出现次数超过一半的数字
 复杂链表的复制
 二叉树中和为某一值的路径
 二叉搜索树的后序遍历序列
原文地址：https://www.cnblogs.com/qq852667211/p/5096890.html