• Spark 源码分析 -- RDD


    关于RDD, 详细可以参考Spark的论文, 下面看下源码
    A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
    Represents an immutable, partitioned collection of elements that can be operated on in parallel.

    * Internally, each RDD is characterized by five main properties:
    *  - A list of partitions
    *  - A function for computing each split
    *  - A list of dependencies on other RDDs
    *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
    *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

    RDD分为一下几类,

    basic(org.apache.spark.rdd.RDD): This class contains the basic operations available on all RDDs, such as `map`, `filter`, and `persist`.

    org.apache.spark.rdd.PairRDDFunctions: contains operations available only on RDDs of key-value pairs, such as `groupByKey` and `join`

    org.apache.spark.rdd.DoubleRDDFunctions: contains operations available only on RDDs of Doubles

    org.apache.spark.rdd.SequenceFileRDDFunctions: contains operations available on RDDs that can be saved as SequenceFiles

     

    RDD首先是泛型类, T表示存放数据的类型, 在处理数据是都是基于Iterator[T]
    以SparkContext和依赖关系Seq deps为初始化参数
    从RDD提供的这些接口大致就可以知道, 什么是RDD
    1. RDD是一块数据, 可能比较大的数据, 所以不能保证可以放在一个机器的memory中, 所以需要分成partitions, 分布在集群的机器的memory
    所以自然需要getPartitions, partitioner如果分区, getPreferredLocations分区如何考虑locality

    Partition的定义很简单, 只有id, 不包含data

    trait Partition extends Serializable {
      /**
       * Get the split's index within its parent RDD
       */
      def index: Int
      // A better default implementation of HashCode
      override def hashCode(): Int = index
    }

    2. RDD之间是有关联的, 一个RDD可以通过compute逻辑把父RDD的数据转化成当前RDD的数据, 所以RDD之间有因果关系
    并且通过getDependencies, 可以取到所有的dependencies

    3. RDD是可以被persisit的, 常用的是cache, 即StorageLevel.MEMORY_ONLY

    4. RDD是可以被checkpoint的, 以提高failover的效率, 当有很长的RDD链时, 单纯的依赖replay会比较低效

    5. RDD.iterator可以产生用于迭代真正数据的Iterator[T]

    6. 在RDD上可以做各种transforms和actions

    abstract class RDD[T: ClassManifest](
        @transient private var sc: SparkContext, //@transient, 不需要序列化
        @transient private var deps: Seq[Dependency[_]]
      ) extends Serializable with Logging {
      
      /**辅助构造函数, 专门用于初始化1对1依赖关系的RDD,这种还是很多的, filter, map... 

    Construct an RDD with just a one-to-one dependency on one parent */
    def this(@transient oneParent: RDD[_]) = this(oneParent.context , List(new OneToOneDependency(oneParent))) // 不同于一般的RDD, 这种情况因为只有一个parent, 所以直接传入parent RDD对象即可

      // =======================================================================
      // Methods that should be implemented by subclasses of RDD
      // =======================================================================
      /** Implemented by subclasses to compute a given partition. */
      def compute(split: Partition, context: TaskContext): Iterator[T]
    
      /**
       * Implemented by subclasses to return the set of partitions in this RDD. This method will only
       * be called once, so it is safe to implement a time-consuming computation in it.
       */
      protected def getPartitions: Array[Partition]
    
      /**
       * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
       * be called once, so it is safe to implement a time-consuming computation in it.
       */
      protected def getDependencies: Seq[Dependency[_]] = deps
    
      /** Optionally overridden by subclasses to specify placement preferences. */
      protected def getPreferredLocations(split: Partition): Seq[String] = Nil
    
      /** Optionally overridden by subclasses to specify how they are partitioned. */
      val partitioner: Option[Partitioner] = None
    
      // =======================================================================
      // Methods and fields available on all RDDs
      // =======================================================================
    
      /** The SparkContext that created this RDD. */
      def sparkContext: SparkContext = sc
    
      /** A unique ID for this RDD (within its SparkContext). */
      val id: Int = sc.newRddId()
    
      /** A friendly name for this RDD */
      var name: String = null
    
      /**
       * Set this RDD's storage level to persist its values across operations after the first time
       * it is computed. This can only be used to assign a new storage level if the RDD does not
       * have a storage level set yet..
       */
      def persist(newLevel: StorageLevel): RDD[T] = {
        // TODO: Handle changes of StorageLevel
        if (storageLevel != StorageLevel.NONE && newLevel != storageLevel) {
          throw new UnsupportedOperationException(
            "Cannot change storage level of an RDD after it was already assigned a level")
        }
        storageLevel = newLevel
        // Register the RDD with the SparkContext
        sc.persistentRdds(id) = this
        this
      }
    
      /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
      def persist(): RDD[T] = persist(StorageLevel.MEMORY_ONLY)
    
      /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
      def cache(): RDD[T] = persist()
    
    
      /** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
      def getStorageLevel = storageLevel
    
      // Our dependencies and partitions will be gotten by calling subclass's methods below, and will
      // be overwritten when we're checkpointed
      private var dependencies_ : Seq[Dependency[_]] = null  
      @transient private var partitions_ : Array[Partition] = null 
    
      /** An Option holding our checkpoint RDD, if we are checkpointed 
    * checkpoint就是把RDD存到磁盘文件中, 以提高failover的效率, 虽然也可以选择replay
    * 并且在RDD的实现中, 如果存在checkpointRDD, 则可以直接从中读到RDD数据, 而不需要compute */
    private def checkpointRDD: Option[RDD[T]] = checkpointData.flatMap(_.checkpointRDD)
      
      /**
       * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
       * This should ''not'' be called by users directly, but is available for implementors of custom
       * subclasses of RDD.
       */
      /** 这是RDD访问数据的核心, 在RDD中的Partition中只包含id而没有真正数据
    * 那么如果获取RDD的数据? 参考storage模块
    * 在cacheManager.getOrCompute中, 会将RDD和Partition id对应到相应的block, 并从中读出数据*/
    final def iterator(split: Partition, context: TaskContext): Iterator[T] = { if (storageLevel != StorageLevel.NONE) {//StorageLevel不为None,说明这个RDD persist过, 可以直接读出来 SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel) } else { computeOrReadCheckpoint(split, context) //如果没有persisit过, 只有从新计算出, 或从checkpoint中读出 } }

      // Transformations (return a new RDD)
      //...... 各种transformations的接口,map, union...
      /**
       * Return a new RDD by applying a function to all elements of this RDD.
       */
      def map[U: ClassManifest](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))
      // Actions (launch a job to return a value to the user program)
      //......各种actions的接口,count, collect...
      /**
       * Return the number of elements in the RDD.
       */
      def count(): Long = {// 只有在action中才会真正调用runJob, 所以transform都是lazy的
        sc.runJob(this, (iter: Iterator[T]) => {
          var result = 0L
          while (iter.hasNext) {
            result += 1L
            iter.next()
          }
          result
        }).sum
      }
      // =======================================================================
      // Other internal methods and fields
      // =======================================================================  
      /** Returns the first parent RDD 
    返回第一个parent RDD*/
    protected[spark] def firstParent[U: ClassManifest] = { dependencies.head.rdd.asInstanceOf[RDD[U]] }
      //................
    }

     

    这里先只讨论一些basic的RDD, pairRDD会单独讨论

    FilteredRDD

    One-to-one Dependency, FilteredRDD

    使用FilteredRDD, 将当前RDD作为第一个参数, f函数作为第二个参数, 返回值是filter过后的RDD

      /**
       * Return a new RDD containing only the elements that satisfy a predicate.
       */
      def filter(f: T => Boolean): RDD[T] = new FilteredRDD(this, sc.clean(f))

    在compute中, 对parent RDD的Iterator[T]进行filter操作

    private[spark] class FilteredRDD[T: ClassManifest]( //filter是典型的one-to-one dependency, 使用辅助构造函数 
        prev: RDD[T],   //parent RDD
        f: T => Boolean) //f,过滤函数
      extends RDD[T](prev) {
      //firstParent会从deps中取出第一个RDD对象, 就是传入的prev RDD, 在One-to-one Dependency中,parent和child的partition信息相同
      override def getPartitions: Array[Partition] = firstParent[T].partitions
    
      override val partitioner = prev.partitioner    // Since filter cannot change a partition's keys
    
      override def compute(split: Partition, context: TaskContext) =
        firstParent[T].iterator(split, context).filter(f) //compute就是真正产生RDD的逻辑
    }

     

    UnionRDD

    Range Dependency, 仍然是narrow的

    先看看如果使用union的, 第二个参数是, 两个RDD的array, 返回值就是把这两个RDD union后产生的新的RDD

      /**
       * Return the union of this RDD and another one. Any identical elements will appear multiple
       * times (use `.distinct()` to eliminate them).
       */
      def union(other: RDD[T]): RDD[T] = new UnionRDD(sc, Array(this, other))

     

    先定义UnionPartition, Union操作的特点是, 只是把多个RDD的partition合并到一个RDD中, 而partition本身没有变化, 所以可以直接重用parent partition

    3个参数
    idx, partition id, 在当前UnionRDD中的序号
    rdd, parent RDD
    splitIndex, parent partition的id

    private[spark] class UnionPartition[T: ClassManifest](idx: Int, rdd: RDD[T], splitIndex: Int)
      extends Partition {
    
      var split: Partition = rdd.partitions(splitIndex)//从parent RDD中取出相应的partition, 重用
    
      def iterator(context: TaskContext) = rdd.iterator(split, context)//Iterator也可以重用
    
      def preferredLocations() = rdd.preferredLocations(split)
    
      override val index: Int = idx//partition id是新的, 因为多个合并后, 序号肯定会发生变化
    }

    定义UnionRDD

    class UnionRDD[T: ClassManifest](
        sc: SparkContext,
        @transient var rdds: Seq[RDD[T]])  //parent RDD Seq
      extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
    
      override def getPartitions: Array[Partition] = {
        val array = new Array[Partition](rdds.map(_.partitions.size).sum) //UnionRDD的partition数,是所有parent RDD中的partition数目的和
        var pos = 0
        for (rdd <- rdds; split <- rdd.partitions) {
          array(pos) = new UnionPartition(pos, rdd, split.index) //创建所有的UnionPartition
          pos += 1
        }
        array
      }
    
      override def getDependencies: Seq[Dependency[_]] = {
        val deps = new ArrayBuffer[Dependency[_]]
        var pos = 0
        for (rdd <- rdds) { 
          deps += new RangeDependency(rdd, 0, pos, rdd.partitions.size)//创建RangeDependency
          pos += rdd.partitions.size)//由于是RangeDependency, 所以pos的递增是加上整个区间size
        }
        deps
      }
    
      override def compute(s: Partition, context: TaskContext): Iterator[T] =
        s.asInstanceOf[UnionPartition[T]].iterator(context)//Union的compute非常简单,什么都不需要做
    
      override def getPreferredLocations(s: Partition): Seq[String] =
        s.asInstanceOf[UnionPartition[T]].preferredLocations()
    }
  • 相关阅读:
    Android-调用优酷SDK上传视频
    新浪微博客户端(16)-获得并显示用户昵称
    新浪微博客户端(15)-保存用户名和密码
    转:Java NIO系列教程(九) Pipe
    新浪微博客户端(14)-截取回调地址中的授权成功的请求标记,换取access_token
    iOS-AFN "Request failed: unacceptable content-type: text/plain"
    新浪微博客户端(13)-使用UIWebView加载OAuth授权界面
    iOS-(kCFStreamErrorDomainSSL, -9802)
    转:Java NIO系列教程(八) DatagramChannel
    转:Java NIO系列教程(七) Socket Channel
  • 原文地址:https://www.cnblogs.com/fxjwind/p/3489107.html
Copyright © 2020-2023  润新知