streaming窗口操作

之前一直对窗口操作不太理解。认为spark streaming本身已经是分片计算，还需要窗口操作干啥。

窗口操作最为简单易懂的场景就是，在M时间间隔计算一次N时间内的热搜。当M=N的时候，就像上述所说，窗口操作本身没什么优势;但当在M!=N的时候，窗口计算优势就体现出来了。

之前用storm的窗口计算，实在是麻烦。而spark streaming则要简单许多。

借用官网提供的图以及例子：

简来说就是10秒钟计算30秒内的单词数。

两个参数
window length - The duration of the window (3 in the figure). -> N
sliding interval - The interval at which the window operation is performed (2 in the figure). -> M

在每一次十秒分别输入：

1：

sdf sdfsd sdf

2：

sdf sdfsd sdf

3：

sdf sdfsd sdf
sdf sdfsd sdf 234

输入

期望得到的结果：

1

----------------------------------------

sdf : 2
sdfsd : 1

----------------------------------------

2:

----------------------------------------

sdf : 4
sdfsd : 2

----------------------------------------

3:

----------------------------------------

sdf : 8
sdfsd : 4
234:1

----------------------------------------

4:

----------------------------------------

sdf : 6
sdfsd : 3
234:1

----------------------------------------

5:

----------------------------------------

sdf : 4
sdfsd : 2
234:1

----------------------------------------

6:

----------------------------------------

无

----------------------------------------

输出

代码如下：

 1 var maxRatePerPartition = "3700"
 2     if (args.length > 1) maxRatePerPartition = args(0)
 3     var master = "local"
 4     if (args.length > 2) master = args(1)
 5     var duration = 10
 6     if (args.length > 3) duration = args(2).toInt
 7     var group = "brokers"
 8     if (args.length > 4) group = args(3)
 9     var topic = "test"
10     if (args.length > 5) topic = args(4)
11     var brokerlist = "master:9092"
12     if (args.length > 6) brokerlist = args(5)
13 
14     val sparkConf = new SparkConf()
15       .setMaster(master)
16       .setAppName("window")
17 
18     val ssc = new StreamingContext(sparkConf, Seconds(duration))
19     val topicsSet = topic.split(",").toSet
20     val kafkaParams = Map[String, String](
21       "metadata.broker.list" -> brokerlist,
22       "group.id" -> group,
23       "enable.auto.commit" -> "false")
24     val kc = new KafkaCluster(kafkaParams)
25     val topicAndPartition2 = kc.getPartitions(topicsSet)
26     var topicAndPartition: Set[TopicAndPartition] = Set[TopicAndPartition]()
27 
28     var fromOffsets: Map[TopicAndPartition, Long] = Map[TopicAndPartition, Long]()
29 
30     var bool = true
31 
32     if (topicAndPartition2.isRight) topicAndPartition = kc.getPartitions(topicsSet).right.get
33 
34     else {
35       val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition)
36       if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } }
37       else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) }
38       bool = false
39     }
40 
41     if (bool) {
42       var temp = kc.getConsumerOffsets(group, topicAndPartition)
43       if (temp.isRight) {
44         fromOffsets = temp.right.get
45       } else if (temp.isLeft) {
46         val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition)
47         if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } }
48         else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) }
49       }
50     } 
51     val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message)
52 
53     val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)
54 
55     var offsetsList = Array[OffsetRange]()
56     val msgContent = messages.transform(x => {
57       offsetsList = x.asInstanceOf[HasOffsetRanges].offsetRanges
58       x
59     })
60 
61     val data = msgContent.flatMap(x => x._2.split(" "))
62 
63     val data2 = data.map(x => (x, 1)).reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(30), Seconds(10))
64 
65     data2.foreachRDD(x => {
66       x.foreach(x => {
67         println(x._1 + " = " + x._2)
68       })
69     })
70 
71     for (offsets <- offsetsList) {
72       val topicAndPartition = TopicAndPartition(topic, offsets.partition)
73       val o = kc.setConsumerOffsetMetadata(group, Map[TopicAndPartition, OffsetAndMetadata](topicAndPartition -> OffsetAndMetadata(offsets.untilOffset)))
74       if (o.isLeft) {
75         println(s"Error updating the offset to Kafka cluster: ${o.left.get}")
76       }
77     }
78 
79     ssc.start()
80     ssc.awaitTermination()

View Code

输出结果达到预期效果。

reduceByKeyAndWindow提供了两个重载函数。两个函数实现的功能是一样的，差别在于invReduceFunc。函数1直接计算窗口，函数2计算并集加差集，效率更高。

函数1
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
reduceByKeyAndWindow((a: Int, b: Int) => (a + b), (a: Int, b: Int) => (a - b),Seconds(30), Seconds(10))

函数2
reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
reduceByKeyAndWindow((a: Int, b: Int) => (a + b),Seconds(30), Seconds(10))

还是以上图为例

根据zhouzhihubeyond的博客，实现原理如下：

函数1(全量统计)： 
window3 : time1+time2+time3
window5 : time3+time4+time5

函数2（增量统计）：
window3 : （time1+time2）+time3   然后checkpoint -> (time1+time2)|window3 
window5 : window3 +time4+time5 - (time1+time2)  只需计算 time4+time5，无需计算time3

但是基于这种解释，有一个问题，为了便于window7的计算，在window5的时候，必须计算time3+time4并缓存，其实还是需要计算time3;而且需要缓存的大小(如window3需要缓存的大小是：(time1+time2)+（window3 = time1+time2+time3） = 2*time1+2*time2+time3=5/3*window3)是window本身的5/3。计算方式其实比函数1更复杂，而且更占用资源。

先问是不是，再问为什么。

上源码：

 1 /*
 2  * Licensed to the Apache Software Foundation (ASF) under one or more
 3  * contributor license agreements.  See the NOTICE file distributed with
 4  * this work for additional information regarding copyright ownership.
 5  * The ASF licenses this file to You under the Apache License, Version 2.0
 6  * (the "License"); you may not use this file except in compliance with
 7  * the License.  You may obtain a copy of the License at
 8  *
 9  *    http://www.apache.org/licenses/LICENSE-2.0
10  *
11  * Unless required by applicable law or agreed to in writing, software
12  * distributed under the License is distributed on an "AS IS" BASIS,
13  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14  * See the License for the specific language governing permissions and
15  * limitations under the License.
16  */
17 
18 package org.apache.spark.streaming.dstream
19 
20 import scala.reflect.ClassTag
21 
22 import org.apache.spark.rdd.RDD
23 import org.apache.spark.storage.StorageLevel
24 import org.apache.spark.streaming._
25 import org.apache.spark.streaming.Duration
26 
27 private[streaming]
28 class WindowedDStream[T: ClassTag](
29     parent: DStream[T],
30     _windowDuration: Duration,
31     _slideDuration: Duration)
32   extends DStream[T](parent.ssc) {
33 
34   if (!_windowDuration.isMultipleOf(parent.slideDuration)) {
35     throw new Exception("The window duration of windowed DStream (" + _windowDuration + ") " +
36     "must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")")
37   }
38 
39   if (!_slideDuration.isMultipleOf(parent.slideDuration)) {
40     throw new Exception("The slide duration of windowed DStream (" + _slideDuration + ") " +
41     "must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")")
42   }
43 
44   // Persist parent level by default, as those RDDs are going to be obviously reused.
45   parent.persist(StorageLevel.MEMORY_ONLY_SER)
46 
47   def windowDuration: Duration = _windowDuration
48 
49   override def dependencies: List[DStream[_]] = List(parent)
50 
51   override def slideDuration: Duration = _slideDuration
52 
53   override def parentRememberDuration: Duration = rememberDuration + windowDuration
54 
55   override def persist(level: StorageLevel): DStream[T] = {
56     // Do not let this windowed DStream be persisted as windowed (union-ed) RDDs share underlying
57     // RDDs and persisting the windowed RDDs would store numerous copies of the underlying data.
58     // Instead control the persistence of the parent DStream.
59     parent.persist(level)
60     this
61   }
62 
63   override def compute(validTime: Time): Option[RDD[T]] = {
64     val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
65     val rddsInWindow = parent.slice(currentWindow)
66     Some(ssc.sc.union(rddsInWindow))
67   }
68 }

函数1关键源码

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.streaming.dstream

import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag

import org.apache.spark.Partitioner
import org.apache.spark.rdd.{CoGroupedRDD, RDD}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Duration, Interval, Time}

private[streaming]
class ReducedWindowedDStream[K: ClassTag, V: ClassTag](
    parent: DStream[(K, V)],
    reduceFunc: (V, V) => V,
    invReduceFunc: (V, V) => V,
    filterFunc: Option[((K, V)) => Boolean],
    _windowDuration: Duration,
    _slideDuration: Duration,
    partitioner: Partitioner
  ) extends DStream[(K, V)](parent.ssc) {

  require(_windowDuration.isMultipleOf(parent.slideDuration),
    "The window duration of ReducedWindowedDStream (" + _windowDuration + ") " +
      "must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")"
  )

  require(_slideDuration.isMultipleOf(parent.slideDuration),
    "The slide duration of ReducedWindowedDStream (" + _slideDuration + ") " +
      "must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")"
  )

  // Reduce each batch of data using reduceByKey which will be further reduced by window
  // by ReducedWindowedDStream
  private val reducedStream = parent.reduceByKey(reduceFunc, partitioner)

  // Persist RDDs to memory by default as these RDDs are going to be reused.
  super.persist(StorageLevel.MEMORY_ONLY_SER)
  reducedStream.persist(StorageLevel.MEMORY_ONLY_SER)

  def windowDuration: Duration = _windowDuration

  override def dependencies: List[DStream[_]] = List(reducedStream)

  override def slideDuration: Duration = _slideDuration

  override val mustCheckpoint = true

  override def parentRememberDuration: Duration = rememberDuration + windowDuration

  override def persist(storageLevel: StorageLevel): DStream[(K, V)] = {
    super.persist(storageLevel)
    reducedStream.persist(storageLevel)
    this
  }

  override def checkpoint(interval: Duration): DStream[(K, V)] = {
    super.checkpoint(interval)
    // reducedStream.checkpoint(interval)
    this
  }

  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
    val reduceF = reduceFunc
    val invReduceF = invReduceFunc

    val currentTime = validTime
    val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration,
      currentTime)
    val previousWindow = currentWindow - slideDuration

    logDebug("Window time = " + windowDuration)
    logDebug("Slide time = " + slideDuration)
    logDebug("Zero time = " + zeroTime)
    logDebug("Current window = " + currentWindow)
    logDebug("Previous window = " + previousWindow)

    //  _____________________________
    // |  previous window   _________|___________________
    // |___________________|       current window        |  --------------> Time
    //                     |_____________________________|
    //
    // |________ _________|          |________ _________|
    //          |                             |
    //          V                             V
    //       old RDDs                     new RDDs
    //

    // Get the RDDs of the reduced values in "old time steps"
    val oldRDDs =
      reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration)
    logDebug("# old RDDs = " + oldRDDs.size)

    // Get the RDDs of the reduced values in "new time steps"
    val newRDDs =
      reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime)
    logDebug("# new RDDs = " + newRDDs.size)

    // Get the RDD of the reduced value of the previous window
    val previousWindowRDD =
      getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))

    // Make the list of RDDs that needs to cogrouped together for reducing their reduced values
    val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs

    // Cogroup the reduced RDDs and merge the reduced values
    val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]],
      partitioner)
    // val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _

    val numOldValues = oldRDDs.size
    val numNewValues = newRDDs.size

    val mergeValues = (arrayOfValues: Array[Iterable[V]]) => {
      if (arrayOfValues.length != 1 + numOldValues + numNewValues) {
        throw new Exception("Unexpected number of sequences of reduced values")
      }
      // Getting reduced values "old time steps" that will be removed from current window
      val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head)
      // Getting reduced values "new time steps"
      val newValues =
        (1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head)

      if (arrayOfValues(0).isEmpty) {
        // If previous window's reduce value does not exist, then at least new values should exist
        if (newValues.isEmpty) {
          throw new Exception("Neither previous window has value for key, nor new values found. " +
            "Are you sure your key class hashes consistently?")
        }
        // Reduce the new values
        newValues.reduce(reduceF) // return
      } else {
        // Get the previous window's reduced value
        var tempValue = arrayOfValues(0).head
        // If old values exists, then inverse reduce then from previous value
        if (!oldValues.isEmpty) {
          tempValue = invReduceF(tempValue, oldValues.reduce(reduceF))
        }
        // If new values exists, then reduce them with previous value
        if (!newValues.isEmpty) {
          tempValue = reduceF(tempValue, newValues.reduce(reduceF))
        }
        tempValue // return
      }
    }

    val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]]
      .mapValues(mergeValues)

    if (filterFunc.isDefined) {
      Some(mergedValuesRDD.filter(filterFunc.get))
    } else {
      Some(mergedValuesRDD)
    }
  }
}

函数2的关键源码

可以看到，它们关键的区别就在compute函数。

函数1的compute函数,没有问题就根据两个时间点得到当前窗口时间并取得相应的值而已

1 override def compute(validTime: Time): Option[RDD[T]] = {
2     val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
3     val rddsInWindow = parent.slice(currentWindow)
4     Some(ssc.sc.union(rddsInWindow))
5   }

函数2的compute函数就要复杂多了

 1 override def compute(validTime: Time): Option[RDD[(K, V)]] = {
 2     val reduceF = reduceFunc
 3     val invReduceF = invReduceFunc
 4 
 5     val currentTime = validTime
 6     val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration,
 7       currentTime)
 8     val previousWindow = currentWindow - slideDuration
 9 
10     logDebug("Window time = " + windowDuration)
11     logDebug("Slide time = " + slideDuration)
12     logDebug("Zero time = " + zeroTime)
13     logDebug("Current window = " + currentWindow)
14     logDebug("Previous window = " + previousWindow)
15 
16     //  _____________________________
17     // |  previous window   _________|___________________
18     // |___________________|       current window        |  --------------> Time
19     //                     |_____________________________|
20     //
21     // |________ _________|          |________ _________|
22     //          |                             |
23     //          V                             V
24     //       old RDDs                     new RDDs
25     //
26 
27     // Get the RDDs of the reduced values in "old time steps"
28     val oldRDDs =
29       reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration)
30     logDebug("# old RDDs = " + oldRDDs.size)
31 
32     // Get the RDDs of the reduced values in "new time steps"
33     val newRDDs =
34       reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime)
35     logDebug("# new RDDs = " + newRDDs.size)
36 
37     // Get the RDD of the reduced value of the previous window
38     val previousWindowRDD =
39       getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))
40 
41     // Make the list of RDDs that needs to cogrouped together for reducing their reduced values
42     val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs
43 
44     // Cogroup the reduced RDDs and merge the reduced values
45     val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]],
46       partitioner)
47     // val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _
48 
49     val numOldValues = oldRDDs.size
50     val numNewValues = newRDDs.size
51 
52     val mergeValues = (arrayOfValues: Array[Iterable[V]]) => {
53       if (arrayOfValues.length != 1 + numOldValues + numNewValues) {
54         throw new Exception("Unexpected number of sequences of reduced values")
55       }
56       // Getting reduced values "old time steps" that will be removed from current window
57       val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head)
58       // Getting reduced values "new time steps"
59       val newValues =
60         (1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head)
61 
62       if (arrayOfValues(0).isEmpty) {
63         // If previous window's reduce value does not exist, then at least new values should exist
64         if (newValues.isEmpty) {
65           throw new Exception("Neither previous window has value for key, nor new values found. " +
66             "Are you sure your key class hashes consistently?")
67         }
68         // Reduce the new values
69         newValues.reduce(reduceF) // return
70       } else {
71         // Get the previous window's reduced value
72         var tempValue = arrayOfValues(0).head
73         // If old values exists, then inverse reduce then from previous value
74         if (!oldValues.isEmpty) {
75           tempValue = invReduceF(tempValue, oldValues.reduce(reduceF))
76         }
77         // If new values exists, then reduce them with previous value
78         if (!newValues.isEmpty) {
79           tempValue = reduceF(tempValue, newValues.reduce(reduceF))
80         }
81         tempValue // return
82       }
83     }
84 
85     val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]]
86       .mapValues(mergeValues)
87 
88     if (filterFunc.isDefined) {
89       Some(mergedValuesRDD.filter(filterFunc.get))
90     } else {
91       Some(mergedValuesRDD)
92     }
93   }

如图，我在官方的基础上加了一个keyRDDs

    //  _____________________________
    // |  previous window   _________|___________________
    // |___________________|       current window        |  --------------> Time
    //                     |_____________________________|
    //
    // |________ _________|     |     |________ _________|
    //          |               |              |
    //          V               V              V
    //       old RDDs        keyRDDs         new RDDs
    //

从代码中可以看出,做了如下操作：

previous window - old RDDS = key RDDS
key RDDs + new RDDs = current window (当前窗口)

 1        // Get the previous window's reduced value
 2         var tempValue = arrayOfValues(0).head 
 6         if (!oldValues.isEmpty) {
 7            tempValue = invReduceF(tempValue, oldValues.reduce(reduceF)) //自定义函数 前减后
 8         }
 9         // If new values exists, then reduce them with previous value         if (!newValues.isEmpty) {
10            tempValue = reduceF(tempValue, newValues.reduce(reduceF)) //自定义函数 前加后
11          }

然后对前一个窗口previous window作为缓存

// Get the RDD of the reduced value of the previous window
    val previousWindowRDD =
    getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))

所以，非首次的计算需要计算old RDDs 和new RDDs,其实看起来还是很繁琐，效率当真较高一点吗？待测试。

相关阅读:
第二次结对编程作业
 团队项目-需求分析报告
 Beta阶段第四次会议
 Beta阶段第三次会议
 Beta阶段第二次会议
 Beta阶段第一次会议
 Beta设计和计划
 项目展示
 事后分析
 α版本发布说明
原文地址：https://www.cnblogs.com/eryuan/p/6958009.html