• spark streaming 的batchDuration slideDuration windowDuration关系


    batchDuration:尝试提交Job任务的间隔,请注意这里是尝试。具体代码如下

      /** Checks whether the 'time' is valid wrt slideDuration for generating RDD */
      private[streaming] def isTimeValid(time: Time): Boolean = {
        if (!isInitialized) {
          throw new SparkException (this + " has not been initialized")
        } else if (time <= zeroTime || ! (time - zeroTime).isMultipleOf(slideDuration)) {
          logInfo("Time " + time + " is invalid as zeroTime is " + zeroTime +
            " and slideDuration is " + slideDuration + " and difference is " + (time - zeroTime))
          false
        } else {
          logDebug("Time " + time + " is valid")
          true
        }
      }

    假设slideDuration的大小是batchDuration的N倍,那么前N-1次尝试都会无法创建job去执行。

    只有第N次尝试的时候,才会提交job。

    默认情况下,batchDuration和slideDuration值是相等的,因此每次尝试的时候都会成功。

    InputDStream
      override def slideDuration: Duration = {
        if (ssc == null) throw new Exception("ssc is null")
        if (ssc.graph.batchDuration == null) throw new Exception("batchDuration is null")
        ssc.graph.batchDuration
      }
    
    MappedDStream
      override def slideDuration: Duration = parent.slideDuration

    但是有一个意外,哪就是如果有window窗口的时候,情况就不一样了。

      def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
        new WindowedDStream(this, windowDuration, slideDuration)
      }
    
      /**
       * Return a new DStream in which each RDD has a single element generated by reducing all
       * elements in a sliding window over this DStream.
       * @param reduceFunc associative reduce function
       * @param windowDuration width of the window; must be a multiple of this DStream's
       *                       batching interval
       * @param slideDuration  sliding interval of the window (i.e., the interval after which
       *                       the new DStream will generate RDDs); must be a multiple of this
       *                       DStream's batching interval
       */
      def reduceByWindow(
          reduceFunc: (T, T) => T,
          windowDuration: Duration,
          slideDuration: Duration
        ): DStream[T] = ssc.withScope {
        this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
      }

    可以看到的是,诸如需要window的方法,都可以自定义slideDuration,可以是slideDuration的值是batchDuration的倍数的任何值。这个值的修改只会影响之后的DStream,之前的DStream的slideDuration还是和batchDuration相等。

    那么当任务是如何执行的呢?

    每个DStream都会存在一个方法

     override def compute(validTime: Time): Option[RDD[T]]

    比如 MappedDStream,他的实现是

      override def compute(validTime: Time): Option[RDD[U]] = {
        parent.getOrCompute(validTime).map(_.map[U](mapFunc))
      }

    很简单,就是调用父DStream的getOrCompute,然后在执行map方法。然后逐级调用,直到没有父DStream为止。

    我们知道slideDuration的值是在windowDStream才被改变的,那么它会有什么实现呢?

      override def compute(validTime: Time): Option[RDD[T]] = {
        val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
        val rddsInWindow = parent.slice(currentWindow)
        val windowRDD = if (rddsInWindow.flatMap(_.partitioner).distinct.length == 1) {
          logDebug("Using partition aware union for windowing at " + validTime)
          new PartitionerAwareUnionRDD(ssc.sc, rddsInWindow)
        } else {
          logDebug("Using normal union for windowing at " + validTime)
          new UnionRDD(ssc.sc, rddsInWindow)
        }
        Some(windowRDD)
      }

    我们看到,WindowedDStream中,会首先获取到此window的范围

    val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)

    然后调用父DStream的slice方法

    val rddsInWindow = parent.slice(currentWindow)

    再次提醒一下,此时父DStream的slideDuration已经变了,变成和batchDuration一样了。

    slice的具体实现是

      /**
       * Return all the RDDs between 'fromTime' to 'toTime' (both included)
       */
      def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = ssc.withScope {
        if (!isInitialized) {
          throw new SparkException(this + " has not been initialized")
        }
    
        val alignedToTime = if ((toTime - zeroTime).isMultipleOf(slideDuration)) {
          toTime
        } else {
          logWarning("toTime (" + toTime + ") is not a multiple of slideDuration ("
              + slideDuration + ")")
            toTime.floor(slideDuration, zeroTime)
        }
    
        val alignedFromTime = if ((fromTime - zeroTime).isMultipleOf(slideDuration)) {
          fromTime
        } else {
          logWarning("fromTime (" + fromTime + ") is not a multiple of slideDuration ("
          + slideDuration + ")")
          fromTime.floor(slideDuration, zeroTime)
        }
    
        logInfo("Slicing from " + fromTime + " to " + toTime +
          " (aligned to " + alignedFromTime + " and " + alignedToTime + ")")
    
        alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => {
          if (time >= zeroTime) getOrCompute(time) else None
        })
      }

    我们只看最后一段代码就行。

     alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => {
          if (time >= zeroTime) getOrCompute(time) else None
        })

    将window范围的time,根据slideDuration,也就是batchDuration的步长,生成batch进行计算。

      /** Checks whether the 'time' is valid wrt slideDuration for generating RDD */
      private[streaming] def isTimeValid(time: Time): Boolean = {
        if (!isInitialized) {
          throw new SparkException (this + " has not been initialized")
        } else if (time <= zeroTime || ! (time - zeroTime).isMultipleOf(slideDuration)) {
          logInfo("Time " + time + " is invalid as zeroTime is " + zeroTime +
            " and slideDuration is " + slideDuration + " and difference is " + (time - zeroTime))
          false
        } else {
          logDebug("Time " + time + " is valid")
          true
        }
      }

    这个代码就返回结果就是true。

    等把所有执行的结果返回后,windowedDStream会把结果进行整合。

        val windowRDD = if (rddsInWindow.flatMap(_.partitioner).distinct.length == 1) {
          logDebug("Using partition aware union for windowing at " + validTime)
          new PartitionerAwareUnionRDD(ssc.sc, rddsInWindow)
        } else {
          logDebug("Using normal union for windowing at " + validTime)
          new UnionRDD(ssc.sc, rddsInWindow)
        }
        Some(windowRDD)
      }

    最终得到我们想要的rdd。

  • 相关阅读:
    85. Maximal Rectangle
    120. Triangle
    72. Edit Distance
    39. Combination Sum
    44. Wildcard Matching
    138. Copy List with Random Pointer
    91. Decode Ways
    142. Linked List Cycle II
    异或的性质及应用
    64. Minimum Path Sum
  • 原文地址:https://www.cnblogs.com/luckuan/p/5217585.html
Copyright © 2020-2023  润新知