一、基本概念
1.窗口分类
TimeWindow:按照时间生成 Window。对于 TimeWindow,可以根据窗口实现原理的不同分成三类:滚动窗口(TumblingWindow)、滑动窗口(Sliding Window)和会话窗口(Session Window)。
2.时间分类
Event Time:是事件创建的时间。它通常由事件中的时间戳描述,例如采集的日志数据中,每一条日志都会记录自己的生成时间,Flink 通过时间戳分配器访问事件时间戳。
Ingestion Time:是数据进入 Flink 的时间。
二、案例演示
案例1:按Processing Time划分滚动时间窗口
import org.apache.flink.streaming.api.scala._ import org.apache.flink.streaming.api.windowing.time.Time object WindowTest { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment env.setParallelism(1) val socketStream = env.socketTextStream("hadoop102",7777) val dataStream: DataStream[SensorReading] = socketStream.map(d => { val arr = d.split(",") SensorReading(arr(0).trim, arr(1).trim.toLong, arr(2).toDouble) }) //统计10秒内的最小温度 val minTemperatureStream = dataStream.map(data=>(data.id,data.temperature)) .keyBy(_._1) .timeWindow(Time.seconds(10)) //10秒滚动窗口,不指定时间特性,默认为ProcessingTime .reduce((data1, data2)=>(data1._1,data1._2.min(data2._2))) //打印原始的dataStream dataStream.print("data stream") //打印窗口数据流 minTemperatureStream.print("min temperature") env.execute("window test") } }
测试:
连续输入两条数据
[atguigu@hadoop102 ~]$ nc -lk 7777 sensor_1, 1547718200, 30.8 sensor_1, 1547718201, 40.8
在一个10秒的滚动窗口内,窗口流minTemperatureStream 只输出了一条数据。此时触发TimeWindow去计算的时机就是第一条数据来的10秒过后。
data stream> SensorReading(sensor_1,1547718200,30.8) data stream> SensorReading(sensor_1,1547718201,40.8) min temperature> (sensor_1,30.8)
案例2:按EventTime划分带水位的滚动时间窗口
知识点:
①通过env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)设定窗口的时间特性为事件时间。
②在assignTimestampsAndWatermarks()方法中,传递一个BoundedOutOfOrdernessTimestampExtractor类实现对象,构造器参数就是容忍的延迟时间,实现方法,指明时间戳用哪个字段。
③env.getConfig.setAutoWatermarkInterval(300) //周期性的生成watermark:系统会周期性的将watermark插入到流中(水位线也是一种特殊的事件!)。默认周期是200毫秒。
④如果在keyBy之前设置水位线,则所有分区共用一个watermark。但是输出还是根据keyBy的结果分开输出。
object WindowTest { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment env.setParallelism(1) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val socketStream = env.socketTextStream("hadoop102",7777) val dataStream: DataStream[SensorReading] = socketStream .map(d => { val arr = d.split(",") SensorReading(arr(0).trim, arr(1).trim.toLong, arr(2).toDouble) }) .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(2)) { override def extractTimestamp(t: SensorReading): Long = t.timestamp * 1000 }) //.assignAscendingTimestamps(_.timestamp) //升序数据添指定时间戳 //统计5秒内的最小温度 val minTemperatureStream = dataStream.map(data=>(data.id,data.temperature)) .keyBy(_._1) .timeWindow(Time.seconds(5)) //5秒滚动窗口 .reduce((data1, data2)=>(data1._1,data1._2.min(data2._2))) //打印原始的dataStream dataStream.print("data stream") //打印窗口数据流 minTemperatureStream.print("min temperature") env.execute("window test") } }
测试:
sockt输入数据如下
[atguigu@hadoop102 ~]$ nc -lk 7777 sensor_1, 1547718199, 29 sensor_1, 1547718200, 30 sensor_1, 1547718201, 31 sensor_1, 1547718202, 32 sensor_1, 1547718203, 33
控制台打印如下:
data stream> SensorReading(sensor_1,1547718199,29.0) data stream> SensorReading(sensor_1,1547718200,30.0) data stream> SensorReading(sensor_1,1547718201,31.0) data stream> SensorReading(sensor_1,1547718202,32.0) min temperature> (sensor_1,29.0) data stream> SensorReading(sensor_1,1547718203,33.0)
滚动窗口的第一个窗口的起始时间如何确定?
//滚动窗口,添加初始窗口的源码 public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) { if (timestamp > -9223372036854775808L) { long start = TimeWindow.getWindowStartWithOffset(timestamp, this.offset, this.size); return Collections.singletonList(new TimeWindow(start, start + this.size)); } else { throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no timestamp marker). Is the time characteristic set to 'ProcessingTime', or did you forget to call 'DataStream.assignTimestampsAndWatermarks(...)'?"); } } //计算窗口起始点,时区偏移offset默认为0。windowSize实际上是滑动长度。加一下再取模也没有意义。
public static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) { return timestamp - (timestamp - offset + windowSize) % windowSize; }
对于我们的测试数据,窗口起始点start = 1547718199 - (1547718199 - 0 + 5) % 5 = 1547718195,所以第一个窗口是1547718195到1547718200,如果没有水位线设置,会在接收到数据时间戳为1547718200时关闭窗口,但是由于设定了延迟2秒,所以当接收到时间戳为1547718202的数据时才打印输出。
也可以自定义watermark从事件数据中抽取时间戳。方式一,周期式,继承AssignerWithPeriodicWatermarks,默认200毫秒抽取一次。方式二,间断式,继承AssignerWithPunctuatedWatermarks,根据需要自定义筛选条件。周期式举例:
class MyAssigner extends AssignerWithPeriodicWatermarks[SensorReading]{ // 1分钟延迟 val bound = 60 * 1000 // 记录数据的最大时间戳 var maxTs = Long.MinValue // 水位线等于最大时间戳减延迟 override def getCurrentWatermark: Watermark = new Watermark(maxTs - bound) override def extractTimestamp(t: SensorReading, l: Long): Long = { //更新最大的时间戳 maxTs = maxTs.max(t.timestamp * 1000) //时间戳单位毫秒 t.timestamp * 1000 } }
案例3:滑动时间窗口
滑动窗口和滚动窗口特性类似,滚动窗口可以看作一种特殊的滑动窗口,其窗口长度与滑动长度一样。在.timeWindow(Time.seconds(10),Time.seconds(5)) 方法中,设定了窗口的长度为10,滑动长度为5。窗口长度决定了窗口计算的数据的范围有多大,而滑动长度决定了窗口计算并关闭的时机。
def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment env.setParallelism(1) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val socketStream = env.socketTextStream("hadoop102",7777) val dataStream: DataStream[SensorReading] = socketStream .map(d => { val arr = d.split(",") SensorReading(arr(0).trim, arr(1).trim.toLong, arr(2).toDouble) }) .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(0)) {//简单起见,去掉水位设置 override def extractTimestamp(t: SensorReading): Long = t.timestamp * 1000 }) //统计15秒内的最小温度,5秒输出一次 val minTemperatureStream = dataStream.map(data=>(data.id,data.temperature)) .keyBy(_._1) .timeWindow(Time.seconds(15), Time.seconds(5)) //滑动窗口 .reduce((data1, data2)=>(data1._1,data1._2.min(data2._2))) //打印原始的dataStream dataStream.print("data stream") //打印窗口数据流 minTemperatureStream.print("min temperature") env.execute("window test") }
socket数据
[atguigu@hadoop102 ~]$ nc -lk 7777 sensor_1, 1547718199, 29 sensor_1, 1547718200, 30 sensor_1, 1547718201, 31
控制台打印
data stream> SensorReading(sensor_1,1547718199,29.0) data stream> SensorReading(sensor_1,1547718200,30.0) min temperature> (sensor_1,29.0) data stream> SensorReading(sensor_1,1547718201,31.0)
滑动窗口的第一个窗口如何确定?与滚动窗口不同,由于滑动长度与窗口长度不一样,所以会设置多个初始窗口。
public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) { if (timestamp <= -9223372036854775808L) { throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no timestamp marker). Is the time characteristic set to 'ProcessingTime', or did you forget to call 'DataStream.assignTimestampsAndWatermarks(...)'?"); } else { List<TimeWindow> windows = new ArrayList((int)(this.size / this.slide)); long lastStart = TimeWindow.getWindowStartWithOffset(timestamp, this.offset, this.slide); for(long start = lastStart; start > timestamp - this.size; start -= this.slide) { windows.add(new TimeWindow(start, start + this.size)); } return windows; } }
对于这个案例,lastStart = 1547718199 - (1547718199 - 0 + 5) % 5 = 1547718195,只要start > 1547718199 -15 = 1547718184就会一直循环添加窗口,在第一次for循环中,添加第一个窗口1547718195到1547718210,第二次添加1547718190到1547718205,第三次添加1547718185到1547718200。生成的窗口数为Math.ceil(15.0 / 5.0),所以在输入第二条数据,时间戳为1547718200,就触发了第三个窗口关闭。