• streaming窗口操作


    之前一直对窗口操作不太理解。认为spark streaming本身已经是分片计算,还需要窗口操作干啥。

    窗口操作最为简单易懂的场景就是,在M时间间隔计算一次N时间内的热搜。当M=N的时候,就像上述所说,窗口操作本身没什么优势;但当在M!=N的时候,窗口计算优势就体现出来了。

    之前用storm的窗口计算,实在是麻烦。而spark streaming则要简单许多。

    借用官网提供的图以及例子:

    简来说就是10秒钟计算30秒内的单词数。

    两个参数
    window length - The duration of the window (3 in the figure). -> N sliding interval - The interval at which the window operation is performed (2 in the figure). -> M

    在每一次十秒分别输入:

    1:
    
    sdf sdfsd sdf
    
    2:
    
    sdf sdfsd sdf
    
    3:
    
    sdf sdfsd sdf
    sdf sdfsd sdf 234
    输入

     期望得到的结果:

    1
    
    ----------------------------------------
    
    sdf : 2
    sdfsd : 1
    
    ----------------------------------------
    
    2:
    
    ----------------------------------------
    
    sdf : 4
    sdfsd : 2
    
    ----------------------------------------
    
    3:
    
    ----------------------------------------
    
    sdf : 8
    sdfsd : 4
    234:1
    
    ----------------------------------------
    
    4:
    
    ----------------------------------------
    
    sdf : 6
    sdfsd : 3
    234:1
    
    ----------------------------------------
    
    5:
    
    ----------------------------------------
    
    sdf : 4
    sdfsd : 2
    234:1
    
    ----------------------------------------
    
    6:
    
    --------------------------------------------------------------------------------
    输出

    代码如下:

     1 var maxRatePerPartition = "3700"
     2     if (args.length > 1) maxRatePerPartition = args(0)
     3     var master = "local"
     4     if (args.length > 2) master = args(1)
     5     var duration = 10
     6     if (args.length > 3) duration = args(2).toInt
     7     var group = "brokers"
     8     if (args.length > 4) group = args(3)
     9     var topic = "test"
    10     if (args.length > 5) topic = args(4)
    11     var brokerlist = "master:9092"
    12     if (args.length > 6) brokerlist = args(5)
    13 
    14     val sparkConf = new SparkConf()
    15       .setMaster(master)
    16       .setAppName("window")
    17 
    18     val ssc = new StreamingContext(sparkConf, Seconds(duration))
    19     val topicsSet = topic.split(",").toSet
    20     val kafkaParams = Map[String, String](
    21       "metadata.broker.list" -> brokerlist,
    22       "group.id" -> group,
    23       "enable.auto.commit" -> "false")
    24     val kc = new KafkaCluster(kafkaParams)
    25     val topicAndPartition2 = kc.getPartitions(topicsSet)
    26     var topicAndPartition: Set[TopicAndPartition] = Set[TopicAndPartition]()
    27 
    28     var fromOffsets: Map[TopicAndPartition, Long] = Map[TopicAndPartition, Long]()
    29 
    30     var bool = true
    31 
    32     if (topicAndPartition2.isRight) topicAndPartition = kc.getPartitions(topicsSet).right.get
    33 
    34     else {
    35       val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition)
    36       if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } }
    37       else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) }
    38       bool = false
    39     }
    40 
    41     if (bool) {
    42       var temp = kc.getConsumerOffsets(group, topicAndPartition)
    43       if (temp.isRight) {
    44         fromOffsets = temp.right.get
    45       } else if (temp.isLeft) {
    46         val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition)
    47         if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } }
    48         else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) }
    49       }
    50     } 
    51     val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message)
    52 
    53     val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)
    54 
    55     var offsetsList = Array[OffsetRange]()
    56     val msgContent = messages.transform(x => {
    57       offsetsList = x.asInstanceOf[HasOffsetRanges].offsetRanges
    58       x
    59     })
    60 
    61     val data = msgContent.flatMap(x => x._2.split(" "))
    62 
    63     val data2 = data.map(x => (x, 1)).reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(30), Seconds(10))
    64 
    65     data2.foreachRDD(x => {
    66       x.foreach(x => {
    67         println(x._1 + " = " + x._2)
    68       })
    69     })
    70 
    71     for (offsets <- offsetsList) {
    72       val topicAndPartition = TopicAndPartition(topic, offsets.partition)
    73       val o = kc.setConsumerOffsetMetadata(group, Map[TopicAndPartition, OffsetAndMetadata](topicAndPartition -> OffsetAndMetadata(offsets.untilOffset)))
    74       if (o.isLeft) {
    75         println(s"Error updating the offset to Kafka cluster: ${o.left.get}")
    76       }
    77     }
    78 
    79     ssc.start()
    80     ssc.awaitTermination()
    View Code

    输出结果达到预期效果。

    reduceByKeyAndWindow提供了两个重载函数。两个函数实现的功能是一样的,差别在于invReduceFunc。函数1直接计算窗口,函数2计算并集加差集,效率更高。

    函数1
    reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
    reduceByKeyAndWindow((a: Int, b: Int) => (a + b), (a: Int, b: Int) => (a - b),Seconds(30), Seconds(10))
    函数2
    reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
    reduceByKeyAndWindow((a: Int, b: Int) => (a + b),Seconds(30), Seconds(10))

     还是以上图为例

     根据zhouzhihubeyond的博客,实现原理如下:

    函数1(全量统计): 
    window3 : time1+time2+time3
    window5 : time3+time4+time5
    
    函数2(增量统计):
    window3 : (time1+time2)+time3   然后checkpoint -> (time1+time2)|window3 
    window5 : window3 +time4+time5 - (time1+time2)  只需计算 time4+time5,无需计算time3

    但是基于这种解释,有一个问题,为了便于window7的计算,在window5的时候,必须计算time3+time4并缓存,其实还是需要计算time3;而且需要缓存的大小(如window3需要缓存的大小是:(time1+time2)+(window3 = time1+time2+time3) = 2*time1+2*time2+time3=5/3*window3)是window本身的5/3。计算方式其实比函数1更复杂,而且更占用资源。

    先问是不是,再问为什么。

    上源码:

     1 /*
     2  * Licensed to the Apache Software Foundation (ASF) under one or more
     3  * contributor license agreements.  See the NOTICE file distributed with
     4  * this work for additional information regarding copyright ownership.
     5  * The ASF licenses this file to You under the Apache License, Version 2.0
     6  * (the "License"); you may not use this file except in compliance with
     7  * the License.  You may obtain a copy of the License at
     8  *
     9  *    http://www.apache.org/licenses/LICENSE-2.0
    10  *
    11  * Unless required by applicable law or agreed to in writing, software
    12  * distributed under the License is distributed on an "AS IS" BASIS,
    13  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    14  * See the License for the specific language governing permissions and
    15  * limitations under the License.
    16  */
    17 
    18 package org.apache.spark.streaming.dstream
    19 
    20 import scala.reflect.ClassTag
    21 
    22 import org.apache.spark.rdd.RDD
    23 import org.apache.spark.storage.StorageLevel
    24 import org.apache.spark.streaming._
    25 import org.apache.spark.streaming.Duration
    26 
    27 private[streaming]
    28 class WindowedDStream[T: ClassTag](
    29     parent: DStream[T],
    30     _windowDuration: Duration,
    31     _slideDuration: Duration)
    32   extends DStream[T](parent.ssc) {
    33 
    34   if (!_windowDuration.isMultipleOf(parent.slideDuration)) {
    35     throw new Exception("The window duration of windowed DStream (" + _windowDuration + ") " +
    36     "must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")")
    37   }
    38 
    39   if (!_slideDuration.isMultipleOf(parent.slideDuration)) {
    40     throw new Exception("The slide duration of windowed DStream (" + _slideDuration + ") " +
    41     "must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")")
    42   }
    43 
    44   // Persist parent level by default, as those RDDs are going to be obviously reused.
    45   parent.persist(StorageLevel.MEMORY_ONLY_SER)
    46 
    47   def windowDuration: Duration = _windowDuration
    48 
    49   override def dependencies: List[DStream[_]] = List(parent)
    50 
    51   override def slideDuration: Duration = _slideDuration
    52 
    53   override def parentRememberDuration: Duration = rememberDuration + windowDuration
    54 
    55   override def persist(level: StorageLevel): DStream[T] = {
    56     // Do not let this windowed DStream be persisted as windowed (union-ed) RDDs share underlying
    57     // RDDs and persisting the windowed RDDs would store numerous copies of the underlying data.
    58     // Instead control the persistence of the parent DStream.
    59     parent.persist(level)
    60     this
    61   }
    62 
    63   override def compute(validTime: Time): Option[RDD[T]] = {
    64     val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
    65     val rddsInWindow = parent.slice(currentWindow)
    66     Some(ssc.sc.union(rddsInWindow))
    67   }
    68 }
    函数1关键源码
    /*
     * Licensed to the Apache Software Foundation (ASF) under one or more
     * contributor license agreements.  See the NOTICE file distributed with
     * this work for additional information regarding copyright ownership.
     * The ASF licenses this file to You under the Apache License, Version 2.0
     * (the "License"); you may not use this file except in compliance with
     * the License.  You may obtain a copy of the License at
     *
     *    http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    
    package org.apache.spark.streaming.dstream
    
    import scala.collection.mutable.ArrayBuffer
    import scala.reflect.ClassTag
    
    import org.apache.spark.Partitioner
    import org.apache.spark.rdd.{CoGroupedRDD, RDD}
    import org.apache.spark.storage.StorageLevel
    import org.apache.spark.streaming.{Duration, Interval, Time}
    
    private[streaming]
    class ReducedWindowedDStream[K: ClassTag, V: ClassTag](
        parent: DStream[(K, V)],
        reduceFunc: (V, V) => V,
        invReduceFunc: (V, V) => V,
        filterFunc: Option[((K, V)) => Boolean],
        _windowDuration: Duration,
        _slideDuration: Duration,
        partitioner: Partitioner
      ) extends DStream[(K, V)](parent.ssc) {
    
      require(_windowDuration.isMultipleOf(parent.slideDuration),
        "The window duration of ReducedWindowedDStream (" + _windowDuration + ") " +
          "must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")"
      )
    
      require(_slideDuration.isMultipleOf(parent.slideDuration),
        "The slide duration of ReducedWindowedDStream (" + _slideDuration + ") " +
          "must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")"
      )
    
      // Reduce each batch of data using reduceByKey which will be further reduced by window
      // by ReducedWindowedDStream
      private val reducedStream = parent.reduceByKey(reduceFunc, partitioner)
    
      // Persist RDDs to memory by default as these RDDs are going to be reused.
      super.persist(StorageLevel.MEMORY_ONLY_SER)
      reducedStream.persist(StorageLevel.MEMORY_ONLY_SER)
    
      def windowDuration: Duration = _windowDuration
    
      override def dependencies: List[DStream[_]] = List(reducedStream)
    
      override def slideDuration: Duration = _slideDuration
    
      override val mustCheckpoint = true
    
      override def parentRememberDuration: Duration = rememberDuration + windowDuration
    
      override def persist(storageLevel: StorageLevel): DStream[(K, V)] = {
        super.persist(storageLevel)
        reducedStream.persist(storageLevel)
        this
      }
    
      override def checkpoint(interval: Duration): DStream[(K, V)] = {
        super.checkpoint(interval)
        // reducedStream.checkpoint(interval)
        this
      }
    
      override def compute(validTime: Time): Option[RDD[(K, V)]] = {
        val reduceF = reduceFunc
        val invReduceF = invReduceFunc
    
        val currentTime = validTime
        val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration,
          currentTime)
        val previousWindow = currentWindow - slideDuration
    
        logDebug("Window time = " + windowDuration)
        logDebug("Slide time = " + slideDuration)
        logDebug("Zero time = " + zeroTime)
        logDebug("Current window = " + currentWindow)
        logDebug("Previous window = " + previousWindow)
    
        //  _____________________________
        // |  previous window   _________|___________________
        // |___________________|       current window        |  --------------> Time
        //                     |_____________________________|
        //
        // |________ _________|          |________ _________|
        //          |                             |
        //          V                             V
        //       old RDDs                     new RDDs
        //
    
        // Get the RDDs of the reduced values in "old time steps"
        val oldRDDs =
          reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration)
        logDebug("# old RDDs = " + oldRDDs.size)
    
        // Get the RDDs of the reduced values in "new time steps"
        val newRDDs =
          reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime)
        logDebug("# new RDDs = " + newRDDs.size)
    
        // Get the RDD of the reduced value of the previous window
        val previousWindowRDD =
          getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))
    
        // Make the list of RDDs that needs to cogrouped together for reducing their reduced values
        val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs
    
        // Cogroup the reduced RDDs and merge the reduced values
        val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]],
          partitioner)
        // val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _
    
        val numOldValues = oldRDDs.size
        val numNewValues = newRDDs.size
    
        val mergeValues = (arrayOfValues: Array[Iterable[V]]) => {
          if (arrayOfValues.length != 1 + numOldValues + numNewValues) {
            throw new Exception("Unexpected number of sequences of reduced values")
          }
          // Getting reduced values "old time steps" that will be removed from current window
          val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head)
          // Getting reduced values "new time steps"
          val newValues =
            (1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head)
    
          if (arrayOfValues(0).isEmpty) {
            // If previous window's reduce value does not exist, then at least new values should exist
            if (newValues.isEmpty) {
              throw new Exception("Neither previous window has value for key, nor new values found. " +
                "Are you sure your key class hashes consistently?")
            }
            // Reduce the new values
            newValues.reduce(reduceF) // return
          } else {
            // Get the previous window's reduced value
            var tempValue = arrayOfValues(0).head
            // If old values exists, then inverse reduce then from previous value
            if (!oldValues.isEmpty) {
              tempValue = invReduceF(tempValue, oldValues.reduce(reduceF))
            }
            // If new values exists, then reduce them with previous value
            if (!newValues.isEmpty) {
              tempValue = reduceF(tempValue, newValues.reduce(reduceF))
            }
            tempValue // return
          }
        }
    
        val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]]
          .mapValues(mergeValues)
    
        if (filterFunc.isDefined) {
          Some(mergedValuesRDD.filter(filterFunc.get))
        } else {
          Some(mergedValuesRDD)
        }
      }
    }
     
    函数2的关键源码

    可以看到,它们关键的区别就在compute函数。

    函数1的compute函数,没有问题就根据两个时间点得到当前窗口时间并取得相应的值而已

    1 override def compute(validTime: Time): Option[RDD[T]] = {
    2     val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
    3     val rddsInWindow = parent.slice(currentWindow)
    4     Some(ssc.sc.union(rddsInWindow))
    5   }

    函数2的compute函数就要复杂多了

     1 override def compute(validTime: Time): Option[RDD[(K, V)]] = {
     2     val reduceF = reduceFunc
     3     val invReduceF = invReduceFunc
     4 
     5     val currentTime = validTime
     6     val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration,
     7       currentTime)
     8     val previousWindow = currentWindow - slideDuration
     9 
    10     logDebug("Window time = " + windowDuration)
    11     logDebug("Slide time = " + slideDuration)
    12     logDebug("Zero time = " + zeroTime)
    13     logDebug("Current window = " + currentWindow)
    14     logDebug("Previous window = " + previousWindow)
    15 
    16     //  _____________________________
    17     // |  previous window   _________|___________________
    18     // |___________________|       current window        |  --------------> Time
    19     //                     |_____________________________|
    20     //
    21     // |________ _________|          |________ _________|
    22     //          |                             |
    23     //          V                             V
    24     //       old RDDs                     new RDDs
    25     //
    26 
    27     // Get the RDDs of the reduced values in "old time steps"
    28     val oldRDDs =
    29       reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration)
    30     logDebug("# old RDDs = " + oldRDDs.size)
    31 
    32     // Get the RDDs of the reduced values in "new time steps"
    33     val newRDDs =
    34       reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime)
    35     logDebug("# new RDDs = " + newRDDs.size)
    36 
    37     // Get the RDD of the reduced value of the previous window
    38     val previousWindowRDD =
    39       getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))
    40 
    41     // Make the list of RDDs that needs to cogrouped together for reducing their reduced values
    42     val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs
    43 
    44     // Cogroup the reduced RDDs and merge the reduced values
    45     val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]],
    46       partitioner)
    47     // val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _
    48 
    49     val numOldValues = oldRDDs.size
    50     val numNewValues = newRDDs.size
    51 
    52     val mergeValues = (arrayOfValues: Array[Iterable[V]]) => {
    53       if (arrayOfValues.length != 1 + numOldValues + numNewValues) {
    54         throw new Exception("Unexpected number of sequences of reduced values")
    55       }
    56       // Getting reduced values "old time steps" that will be removed from current window
    57       val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head)
    58       // Getting reduced values "new time steps"
    59       val newValues =
    60         (1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head)
    61 
    62       if (arrayOfValues(0).isEmpty) {
    63         // If previous window's reduce value does not exist, then at least new values should exist
    64         if (newValues.isEmpty) {
    65           throw new Exception("Neither previous window has value for key, nor new values found. " +
    66             "Are you sure your key class hashes consistently?")
    67         }
    68         // Reduce the new values
    69         newValues.reduce(reduceF) // return
    70       } else {
    71         // Get the previous window's reduced value
    72         var tempValue = arrayOfValues(0).head
    73         // If old values exists, then inverse reduce then from previous value
    74         if (!oldValues.isEmpty) {
    75           tempValue = invReduceF(tempValue, oldValues.reduce(reduceF))
    76         }
    77         // If new values exists, then reduce them with previous value
    78         if (!newValues.isEmpty) {
    79           tempValue = reduceF(tempValue, newValues.reduce(reduceF))
    80         }
    81         tempValue // return
    82       }
    83     }
    84 
    85     val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]]
    86       .mapValues(mergeValues)
    87 
    88     if (filterFunc.isDefined) {
    89       Some(mergedValuesRDD.filter(filterFunc.get))
    90     } else {
    91       Some(mergedValuesRDD)
    92     }
    93   }

     如图,我在官方的基础上加了一个keyRDDs

        //  _____________________________
        // |  previous window   _________|___________________
        // |___________________|       current window        |  --------------> Time
        //                     |_____________________________|
        //
        // |________ _________|     |     |________ _________|
        //          |               |              |
        //          V               V              V
        //       old RDDs        keyRDDs         new RDDs
        //

    从代码中可以看出,做了如下操作:

    previous window - old RDDS = key RDDS
    key RDDs + new RDDs = current window (当前窗口)

     1        // Get the previous window's reduced value
     2         var tempValue = arrayOfValues(0).head 
     6         if (!oldValues.isEmpty) {
     7            tempValue = invReduceF(tempValue, oldValues.reduce(reduceF)) //自定义函数 前减后
     8         }
     9         // If new values exists, then reduce them with previous value         if (!newValues.isEmpty) {
    10            tempValue = reduceF(tempValue, newValues.reduce(reduceF)) //自定义函数 前加后
    11          }

    然后对前一个窗口previous window作为缓存

    // Get the RDD of the reduced value of the previous window
        val previousWindowRDD =
        getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))

    所以,非首次的计算需要计算old RDDs 和new RDDs,其实看起来还是很繁琐,效率当真较高一点吗?待测试。

  • 相关阅读:
    第二次结对编程作业
    团队项目-需求分析报告
    Beta阶段第四次会议
    Beta阶段第三次会议
    Beta阶段第二次会议
    Beta阶段第一次会议
    Beta设计和计划
    项目展示
    事后分析
    α版本发布说明
  • 原文地址:https://www.cnblogs.com/eryuan/p/6958009.html
Copyright © 2020-2023  润新知