• spark streaming 2: DStream


    DStream是类似于RDD概念,是对数据的抽象封装。它是一序列的RDD,事实上,它大部分的操作都是对RDD支持的操作的封装,不同的是,每次DStream都要遍历它内部所有的RDD执行这些操作。它可以由StreamingContext通过流数据产生或者其他DStream使用map方法产生(与RDD一样)
    time属性对DStream而言非常重要,DStream里面的RDD就是通过某个时间间隔产生的,而且以产生的时间为索引。所以在访问DStream的某个RDD时,实际上是访问它在某个时间点的RDD。




    /**
    * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
    * sequence of RDDs (of the same type) representing a continuous stream of data
    (see
    * org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
    * DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,
    * etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by
    * transforming existing DStreams
    using operations such as `map`,
    * `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream
    * periodically generates a RDD, either from live data or by transforming the RDD generated by a
    * parent DStream.
    *
    * This class contains the basic operations available on all DStreams, such as `map`, `filter` and
    * `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains
    * operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and
    * `join`. These operations are automatically available on any DStream of pairs
    * (e.g., DStream[(Int, Int)] through implicit conversions when
    * `org.apache.spark.streaming.StreamingContext._` is imported.
    *
    * DStreams internally is characterized by a few basic properties:
    * - A list of other DStreams that the DStream depends on
    * - A time interval at which the DStream generates an RDD
    * - A function that is used to generate an RDD after each time interval
    */

    abstract class DStream[T: ClassTag] (
    @transient private[streaming] var ssc: StreamingContext
    )
    extends Serializable with Logging {
    重要属性:
    // =======================================================================
    // Methods that should be implemented by subclasses of DStream
    // =======================================================================
    /** Time interval after which the DStream generates a RDD */
    def slideDuration: Duration
    /** List of parent DStreams on which this DStream depends on */
    def dependencies: List[DStream[_]]
    /** Method that generates a RDD for the given time */
    def compute (validTime: Time): Option[RDD[T]]
    当前已经产生了的RDD,以产生的时间为索引
    // =======================================================================
    // Methods and fields available on all DStreams
    // =======================================================================

    // RDDs generated, marked as private[streaming] so that testsuites can access it
    @transient
    private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()
    为某个时间点产生一个RDD
    /**
    * Get the RDD corresponding to the given time; either retrieve it from cache
    * or compute-and-cache it.
    */
    private[streaming] def getOrCompute(time: Time): Option[RDD[T]] = {


















  • 相关阅读:
    窗口
    DataTemplateSelector
    CompositeCollection
    Drawing
    模板
    集合视图
    绑定
    动画
    【数据结构初学】(java实现篇)——队列(转)
    慕课学习手记!(完成查找书籍小程序~)
  • 原文地址:https://www.cnblogs.com/zwCHAN/p/4274804.html
Copyright © 2020-2023  润新知