之前一直对窗口操作不太理解。认为spark streaming本身已经是分片计算,还需要窗口操作干啥。
窗口操作最为简单易懂的场景就是,在M时间间隔计算一次N时间内的热搜。当M=N的时候,就像上述所说,窗口操作本身没什么优势;但当在M!=N的时候,窗口计算优势就体现出来了。
之前用storm的窗口计算,实在是麻烦。而spark streaming则要简单许多。
借用官网提供的图以及例子:
简来说就是10秒钟计算30秒内的单词数。
两个参数
window length - The duration of the window (3 in the figure). -> N sliding interval - The interval at which the window operation is performed (2 in the figure). -> M
在每一次十秒分别输入:
1: sdf sdfsd sdf 2: sdf sdfsd sdf 3: sdf sdfsd sdf sdf sdfsd sdf 234
期望得到的结果:
1 ---------------------------------------- sdf : 2 sdfsd : 1 ---------------------------------------- 2: ---------------------------------------- sdf : 4 sdfsd : 2 ---------------------------------------- 3: ---------------------------------------- sdf : 8 sdfsd : 4 234:1 ---------------------------------------- 4: ---------------------------------------- sdf : 6 sdfsd : 3 234:1 ---------------------------------------- 5: ---------------------------------------- sdf : 4 sdfsd : 2 234:1 ---------------------------------------- 6: ---------------------------------------- 无 ----------------------------------------
代码如下:
1 var maxRatePerPartition = "3700" 2 if (args.length > 1) maxRatePerPartition = args(0) 3 var master = "local" 4 if (args.length > 2) master = args(1) 5 var duration = 10 6 if (args.length > 3) duration = args(2).toInt 7 var group = "brokers" 8 if (args.length > 4) group = args(3) 9 var topic = "test" 10 if (args.length > 5) topic = args(4) 11 var brokerlist = "master:9092" 12 if (args.length > 6) brokerlist = args(5) 13 14 val sparkConf = new SparkConf() 15 .setMaster(master) 16 .setAppName("window") 17 18 val ssc = new StreamingContext(sparkConf, Seconds(duration)) 19 val topicsSet = topic.split(",").toSet 20 val kafkaParams = Map[String, String]( 21 "metadata.broker.list" -> brokerlist, 22 "group.id" -> group, 23 "enable.auto.commit" -> "false") 24 val kc = new KafkaCluster(kafkaParams) 25 val topicAndPartition2 = kc.getPartitions(topicsSet) 26 var topicAndPartition: Set[TopicAndPartition] = Set[TopicAndPartition]() 27 28 var fromOffsets: Map[TopicAndPartition, Long] = Map[TopicAndPartition, Long]() 29 30 var bool = true 31 32 if (topicAndPartition2.isRight) topicAndPartition = kc.getPartitions(topicsSet).right.get 33 34 else { 35 val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition) 36 if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } } 37 else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) } 38 bool = false 39 } 40 41 if (bool) { 42 var temp = kc.getConsumerOffsets(group, topicAndPartition) 43 if (temp.isRight) { 44 fromOffsets = temp.right.get 45 } else if (temp.isLeft) { 46 val lateOffsets = kc.getLatestLeaderOffsets(topicAndPartition) 47 if (lateOffsets.isLeft) { topicAndPartition.foreach { x => fromOffsets += (x -> 0) } } 48 else { lateOffsets.right.get.foreach(x => fromOffsets += (x._1 -> x._2.offset)) } 49 } 50 } 51 val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message) 52 53 val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler) 54 55 var offsetsList = Array[OffsetRange]() 56 val msgContent = messages.transform(x => { 57 offsetsList = x.asInstanceOf[HasOffsetRanges].offsetRanges 58 x 59 }) 60 61 val data = msgContent.flatMap(x => x._2.split(" ")) 62 63 val data2 = data.map(x => (x, 1)).reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(30), Seconds(10)) 64 65 data2.foreachRDD(x => { 66 x.foreach(x => { 67 println(x._1 + " = " + x._2) 68 }) 69 }) 70 71 for (offsets <- offsetsList) { 72 val topicAndPartition = TopicAndPartition(topic, offsets.partition) 73 val o = kc.setConsumerOffsetMetadata(group, Map[TopicAndPartition, OffsetAndMetadata](topicAndPartition -> OffsetAndMetadata(offsets.untilOffset))) 74 if (o.isLeft) { 75 println(s"Error updating the offset to Kafka cluster: ${o.left.get}") 76 } 77 } 78 79 ssc.start() 80 ssc.awaitTermination()
输出结果达到预期效果。
reduceByKeyAndWindow提供了两个重载函数。两个函数实现的功能是一样的,差别在于invReduceFunc。函数1直接计算窗口,函数2计算并集加差集,效率更高。
函数1
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
reduceByKeyAndWindow((a: Int, b: Int) => (a + b), (a: Int, b: Int) => (a - b),Seconds(30), Seconds(10))
函数2
reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, numPartitions, filterFunc)
reduceByKeyAndWindow((a: Int, b: Int) => (a + b),Seconds(30), Seconds(10))
还是以上图为例
根据zhouzhihubeyond的博客,实现原理如下:
函数1(全量统计):
window3 : time1+time2+time3
window5 : time3+time4+time5
函数2(增量统计):
window3 : (time1+time2)+time3 然后checkpoint -> (time1+time2)|window3
window5 : window3 +time4+time5 - (time1+time2) 只需计算 time4+time5,无需计算time3
但是基于这种解释,有一个问题,为了便于window7的计算,在window5的时候,必须计算time3+time4并缓存,其实还是需要计算time3;而且需要缓存的大小(如window3需要缓存的大小是:(time1+time2)+(window3 = time1+time2+time3) = 2*time1+2*time2+time3=5/3*window3)是window本身的5/3。计算方式其实比函数1更复杂,而且更占用资源。
先问是不是,再问为什么。
上源码:
1 /* 2 * Licensed to the Apache Software Foundation (ASF) under one or more 3 * contributor license agreements. See the NOTICE file distributed with 4 * this work for additional information regarding copyright ownership. 5 * The ASF licenses this file to You under the Apache License, Version 2.0 6 * (the "License"); you may not use this file except in compliance with 7 * the License. You may obtain a copy of the License at 8 * 9 * http://www.apache.org/licenses/LICENSE-2.0 10 * 11 * Unless required by applicable law or agreed to in writing, software 12 * distributed under the License is distributed on an "AS IS" BASIS, 13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 * See the License for the specific language governing permissions and 15 * limitations under the License. 16 */ 17 18 package org.apache.spark.streaming.dstream 19 20 import scala.reflect.ClassTag 21 22 import org.apache.spark.rdd.RDD 23 import org.apache.spark.storage.StorageLevel 24 import org.apache.spark.streaming._ 25 import org.apache.spark.streaming.Duration 26 27 private[streaming] 28 class WindowedDStream[T: ClassTag]( 29 parent: DStream[T], 30 _windowDuration: Duration, 31 _slideDuration: Duration) 32 extends DStream[T](parent.ssc) { 33 34 if (!_windowDuration.isMultipleOf(parent.slideDuration)) { 35 throw new Exception("The window duration of windowed DStream (" + _windowDuration + ") " + 36 "must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")") 37 } 38 39 if (!_slideDuration.isMultipleOf(parent.slideDuration)) { 40 throw new Exception("The slide duration of windowed DStream (" + _slideDuration + ") " + 41 "must be a multiple of the slide duration of parent DStream (" + parent.slideDuration + ")") 42 } 43 44 // Persist parent level by default, as those RDDs are going to be obviously reused. 45 parent.persist(StorageLevel.MEMORY_ONLY_SER) 46 47 def windowDuration: Duration = _windowDuration 48 49 override def dependencies: List[DStream[_]] = List(parent) 50 51 override def slideDuration: Duration = _slideDuration 52 53 override def parentRememberDuration: Duration = rememberDuration + windowDuration 54 55 override def persist(level: StorageLevel): DStream[T] = { 56 // Do not let this windowed DStream be persisted as windowed (union-ed) RDDs share underlying 57 // RDDs and persisting the windowed RDDs would store numerous copies of the underlying data. 58 // Instead control the persistence of the parent DStream. 59 parent.persist(level) 60 this 61 } 62 63 override def compute(validTime: Time): Option[RDD[T]] = { 64 val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime) 65 val rddsInWindow = parent.slice(currentWindow) 66 Some(ssc.sc.union(rddsInWindow)) 67 } 68 }
/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.spark.streaming.dstream import scala.collection.mutable.ArrayBuffer import scala.reflect.ClassTag import org.apache.spark.Partitioner import org.apache.spark.rdd.{CoGroupedRDD, RDD} import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.{Duration, Interval, Time} private[streaming] class ReducedWindowedDStream[K: ClassTag, V: ClassTag]( parent: DStream[(K, V)], reduceFunc: (V, V) => V, invReduceFunc: (V, V) => V, filterFunc: Option[((K, V)) => Boolean], _windowDuration: Duration, _slideDuration: Duration, partitioner: Partitioner ) extends DStream[(K, V)](parent.ssc) { require(_windowDuration.isMultipleOf(parent.slideDuration), "The window duration of ReducedWindowedDStream (" + _windowDuration + ") " + "must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")" ) require(_slideDuration.isMultipleOf(parent.slideDuration), "The slide duration of ReducedWindowedDStream (" + _slideDuration + ") " + "must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")" ) // Reduce each batch of data using reduceByKey which will be further reduced by window // by ReducedWindowedDStream private val reducedStream = parent.reduceByKey(reduceFunc, partitioner) // Persist RDDs to memory by default as these RDDs are going to be reused. super.persist(StorageLevel.MEMORY_ONLY_SER) reducedStream.persist(StorageLevel.MEMORY_ONLY_SER) def windowDuration: Duration = _windowDuration override def dependencies: List[DStream[_]] = List(reducedStream) override def slideDuration: Duration = _slideDuration override val mustCheckpoint = true override def parentRememberDuration: Duration = rememberDuration + windowDuration override def persist(storageLevel: StorageLevel): DStream[(K, V)] = { super.persist(storageLevel) reducedStream.persist(storageLevel) this } override def checkpoint(interval: Duration): DStream[(K, V)] = { super.checkpoint(interval) // reducedStream.checkpoint(interval) this } override def compute(validTime: Time): Option[RDD[(K, V)]] = { val reduceF = reduceFunc val invReduceF = invReduceFunc val currentTime = validTime val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration, currentTime) val previousWindow = currentWindow - slideDuration logDebug("Window time = " + windowDuration) logDebug("Slide time = " + slideDuration) logDebug("Zero time = " + zeroTime) logDebug("Current window = " + currentWindow) logDebug("Previous window = " + previousWindow) // _____________________________ // | previous window _________|___________________ // |___________________| current window | --------------> Time // |_____________________________| // // |________ _________| |________ _________| // | | // V V // old RDDs new RDDs // // Get the RDDs of the reduced values in "old time steps" val oldRDDs = reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration) logDebug("# old RDDs = " + oldRDDs.size) // Get the RDDs of the reduced values in "new time steps" val newRDDs = reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime) logDebug("# new RDDs = " + newRDDs.size) // Get the RDD of the reduced value of the previous window val previousWindowRDD = getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]())) // Make the list of RDDs that needs to cogrouped together for reducing their reduced values val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs // Cogroup the reduced RDDs and merge the reduced values val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]], partitioner) // val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _ val numOldValues = oldRDDs.size val numNewValues = newRDDs.size val mergeValues = (arrayOfValues: Array[Iterable[V]]) => { if (arrayOfValues.length != 1 + numOldValues + numNewValues) { throw new Exception("Unexpected number of sequences of reduced values") } // Getting reduced values "old time steps" that will be removed from current window val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head) // Getting reduced values "new time steps" val newValues = (1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head) if (arrayOfValues(0).isEmpty) { // If previous window's reduce value does not exist, then at least new values should exist if (newValues.isEmpty) { throw new Exception("Neither previous window has value for key, nor new values found. " + "Are you sure your key class hashes consistently?") } // Reduce the new values newValues.reduce(reduceF) // return } else { // Get the previous window's reduced value var tempValue = arrayOfValues(0).head // If old values exists, then inverse reduce then from previous value if (!oldValues.isEmpty) { tempValue = invReduceF(tempValue, oldValues.reduce(reduceF)) } // If new values exists, then reduce them with previous value if (!newValues.isEmpty) { tempValue = reduceF(tempValue, newValues.reduce(reduceF)) } tempValue // return } } val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]] .mapValues(mergeValues) if (filterFunc.isDefined) { Some(mergedValuesRDD.filter(filterFunc.get)) } else { Some(mergedValuesRDD) } } }
可以看到,它们关键的区别就在compute函数。
函数1的compute函数,没有问题就根据两个时间点得到当前窗口时间并取得相应的值而已
1 override def compute(validTime: Time): Option[RDD[T]] = { 2 val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime) 3 val rddsInWindow = parent.slice(currentWindow) 4 Some(ssc.sc.union(rddsInWindow)) 5 }
函数2的compute函数就要复杂多了
1 override def compute(validTime: Time): Option[RDD[(K, V)]] = { 2 val reduceF = reduceFunc 3 val invReduceF = invReduceFunc 4 5 val currentTime = validTime 6 val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration, 7 currentTime) 8 val previousWindow = currentWindow - slideDuration 9 10 logDebug("Window time = " + windowDuration) 11 logDebug("Slide time = " + slideDuration) 12 logDebug("Zero time = " + zeroTime) 13 logDebug("Current window = " + currentWindow) 14 logDebug("Previous window = " + previousWindow) 15 16 // _____________________________ 17 // | previous window _________|___________________ 18 // |___________________| current window | --------------> Time 19 // |_____________________________| 20 // 21 // |________ _________| |________ _________| 22 // | | 23 // V V 24 // old RDDs new RDDs 25 // 26 27 // Get the RDDs of the reduced values in "old time steps" 28 val oldRDDs = 29 reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration) 30 logDebug("# old RDDs = " + oldRDDs.size) 31 32 // Get the RDDs of the reduced values in "new time steps" 33 val newRDDs = 34 reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime) 35 logDebug("# new RDDs = " + newRDDs.size) 36 37 // Get the RDD of the reduced value of the previous window 38 val previousWindowRDD = 39 getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]())) 40 41 // Make the list of RDDs that needs to cogrouped together for reducing their reduced values 42 val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs 43 44 // Cogroup the reduced RDDs and merge the reduced values 45 val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]], 46 partitioner) 47 // val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _ 48 49 val numOldValues = oldRDDs.size 50 val numNewValues = newRDDs.size 51 52 val mergeValues = (arrayOfValues: Array[Iterable[V]]) => { 53 if (arrayOfValues.length != 1 + numOldValues + numNewValues) { 54 throw new Exception("Unexpected number of sequences of reduced values") 55 } 56 // Getting reduced values "old time steps" that will be removed from current window 57 val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head) 58 // Getting reduced values "new time steps" 59 val newValues = 60 (1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head) 61 62 if (arrayOfValues(0).isEmpty) { 63 // If previous window's reduce value does not exist, then at least new values should exist 64 if (newValues.isEmpty) { 65 throw new Exception("Neither previous window has value for key, nor new values found. " + 66 "Are you sure your key class hashes consistently?") 67 } 68 // Reduce the new values 69 newValues.reduce(reduceF) // return 70 } else { 71 // Get the previous window's reduced value 72 var tempValue = arrayOfValues(0).head 73 // If old values exists, then inverse reduce then from previous value 74 if (!oldValues.isEmpty) { 75 tempValue = invReduceF(tempValue, oldValues.reduce(reduceF)) 76 } 77 // If new values exists, then reduce them with previous value 78 if (!newValues.isEmpty) { 79 tempValue = reduceF(tempValue, newValues.reduce(reduceF)) 80 } 81 tempValue // return 82 } 83 } 84 85 val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]] 86 .mapValues(mergeValues) 87 88 if (filterFunc.isDefined) { 89 Some(mergedValuesRDD.filter(filterFunc.get)) 90 } else { 91 Some(mergedValuesRDD) 92 } 93 }
如图,我在官方的基础上加了一个keyRDDs
// _____________________________ // | previous window _________|___________________ // |___________________| current window | --------------> Time // |_____________________________| // // |________ _________| | |________ _________| // | | | // V V V // old RDDs keyRDDs new RDDs //
从代码中可以看出,做了如下操作:
previous window - old RDDS = key RDDS
key RDDs + new RDDs = current window (当前窗口)
1 // Get the previous window's reduced value 2 var tempValue = arrayOfValues(0).head 6 if (!oldValues.isEmpty) { 7 tempValue = invReduceF(tempValue, oldValues.reduce(reduceF)) //自定义函数 前减后 8 } 9 // If new values exists, then reduce them with previous value if (!newValues.isEmpty) { 10 tempValue = reduceF(tempValue, newValues.reduce(reduceF)) //自定义函数 前加后 11 }
然后对前一个窗口previous window作为缓存
// Get the RDD of the reduced value of the previous window val previousWindowRDD = getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))
所以,非首次的计算需要计算old RDDs 和new RDDs,其实看起来还是很繁琐,效率当真较高一点吗?待测试。