• spark streaming 的batchDuration slideDuration windowDuration关系


    batchDuration:尝试提交Job任务的间隔,请注意这里是尝试。具体代码如下

      /** Checks whether the 'time' is valid wrt slideDuration for generating RDD */
      private[streaming] def isTimeValid(time: Time): Boolean = {
        if (!isInitialized) {
          throw new SparkException (this + " has not been initialized")
        } else if (time <= zeroTime || ! (time - zeroTime).isMultipleOf(slideDuration)) {
          logInfo("Time " + time + " is invalid as zeroTime is " + zeroTime +
            " and slideDuration is " + slideDuration + " and difference is " + (time - zeroTime))
          false
        } else {
          logDebug("Time " + time + " is valid")
          true
        }
      }

    假设slideDuration的大小是batchDuration的N倍,那么前N-1次尝试都会无法创建job去执行。

    只有第N次尝试的时候,才会提交job。

    默认情况下,batchDuration和slideDuration值是相等的,因此每次尝试的时候都会成功。

    InputDStream
      override def slideDuration: Duration = {
        if (ssc == null) throw new Exception("ssc is null")
        if (ssc.graph.batchDuration == null) throw new Exception("batchDuration is null")
        ssc.graph.batchDuration
      }
    
    MappedDStream
      override def slideDuration: Duration = parent.slideDuration

    但是有一个意外,哪就是如果有window窗口的时候,情况就不一样了。

      def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
        new WindowedDStream(this, windowDuration, slideDuration)
      }
    
      /**
       * Return a new DStream in which each RDD has a single element generated by reducing all
       * elements in a sliding window over this DStream.
       * @param reduceFunc associative reduce function
       * @param windowDuration width of the window; must be a multiple of this DStream's
       *                       batching interval
       * @param slideDuration  sliding interval of the window (i.e., the interval after which
       *                       the new DStream will generate RDDs); must be a multiple of this
       *                       DStream's batching interval
       */
      def reduceByWindow(
          reduceFunc: (T, T) => T,
          windowDuration: Duration,
          slideDuration: Duration
        ): DStream[T] = ssc.withScope {
        this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
      }

    可以看到的是,诸如需要window的方法,都可以自定义slideDuration,可以是slideDuration的值是batchDuration的倍数的任何值。这个值的修改只会影响之后的DStream,之前的DStream的slideDuration还是和batchDuration相等。

    那么当任务是如何执行的呢?

    每个DStream都会存在一个方法

     override def compute(validTime: Time): Option[RDD[T]]

    比如 MappedDStream,他的实现是

      override def compute(validTime: Time): Option[RDD[U]] = {
        parent.getOrCompute(validTime).map(_.map[U](mapFunc))
      }

    很简单,就是调用父DStream的getOrCompute,然后在执行map方法。然后逐级调用,直到没有父DStream为止。

    我们知道slideDuration的值是在windowDStream才被改变的,那么它会有什么实现呢?

      override def compute(validTime: Time): Option[RDD[T]] = {
        val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
        val rddsInWindow = parent.slice(currentWindow)
        val windowRDD = if (rddsInWindow.flatMap(_.partitioner).distinct.length == 1) {
          logDebug("Using partition aware union for windowing at " + validTime)
          new PartitionerAwareUnionRDD(ssc.sc, rddsInWindow)
        } else {
          logDebug("Using normal union for windowing at " + validTime)
          new UnionRDD(ssc.sc, rddsInWindow)
        }
        Some(windowRDD)
      }

    我们看到,WindowedDStream中,会首先获取到此window的范围

    val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)

    然后调用父DStream的slice方法

    val rddsInWindow = parent.slice(currentWindow)

    再次提醒一下,此时父DStream的slideDuration已经变了,变成和batchDuration一样了。

    slice的具体实现是

      /**
       * Return all the RDDs between 'fromTime' to 'toTime' (both included)
       */
      def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = ssc.withScope {
        if (!isInitialized) {
          throw new SparkException(this + " has not been initialized")
        }
    
        val alignedToTime = if ((toTime - zeroTime).isMultipleOf(slideDuration)) {
          toTime
        } else {
          logWarning("toTime (" + toTime + ") is not a multiple of slideDuration ("
              + slideDuration + ")")
            toTime.floor(slideDuration, zeroTime)
        }
    
        val alignedFromTime = if ((fromTime - zeroTime).isMultipleOf(slideDuration)) {
          fromTime
        } else {
          logWarning("fromTime (" + fromTime + ") is not a multiple of slideDuration ("
          + slideDuration + ")")
          fromTime.floor(slideDuration, zeroTime)
        }
    
        logInfo("Slicing from " + fromTime + " to " + toTime +
          " (aligned to " + alignedFromTime + " and " + alignedToTime + ")")
    
        alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => {
          if (time >= zeroTime) getOrCompute(time) else None
        })
      }

    我们只看最后一段代码就行。

     alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => {
          if (time >= zeroTime) getOrCompute(time) else None
        })

    将window范围的time,根据slideDuration,也就是batchDuration的步长,生成batch进行计算。

      /** Checks whether the 'time' is valid wrt slideDuration for generating RDD */
      private[streaming] def isTimeValid(time: Time): Boolean = {
        if (!isInitialized) {
          throw new SparkException (this + " has not been initialized")
        } else if (time <= zeroTime || ! (time - zeroTime).isMultipleOf(slideDuration)) {
          logInfo("Time " + time + " is invalid as zeroTime is " + zeroTime +
            " and slideDuration is " + slideDuration + " and difference is " + (time - zeroTime))
          false
        } else {
          logDebug("Time " + time + " is valid")
          true
        }
      }

    这个代码就返回结果就是true。

    等把所有执行的结果返回后,windowedDStream会把结果进行整合。

        val windowRDD = if (rddsInWindow.flatMap(_.partitioner).distinct.length == 1) {
          logDebug("Using partition aware union for windowing at " + validTime)
          new PartitionerAwareUnionRDD(ssc.sc, rddsInWindow)
        } else {
          logDebug("Using normal union for windowing at " + validTime)
          new UnionRDD(ssc.sc, rddsInWindow)
        }
        Some(windowRDD)
      }

    最终得到我们想要的rdd。

  • 相关阅读:
    Oracle分析函数大全
    Docker容器与容器云之Docker单机集群部署案例
    hive中的几个参数:元数据配置、仓库位置、打印表字段相关参数
    启用hive hwi方法
    hive进行词频统计
    Docker在centos上的安装
    Hive日志(Hive Logging)--hive GettingStarted翻译
    【RMAN】RMAN-05001: auxiliary filename conflicts with the target database
    简单示例用例(Simple Example Use Cases)--hive GettingStarted用例翻译
    hive分析nginx日志之UDF清洗数据
  • 原文地址:https://www.cnblogs.com/luckuan/p/5217585.html
Copyright © 2020-2023  润新知