CONNECT
union
虽然可以合并多个数据流,但有一个限制,即多个数据流的数据类型必须相同。connect
提供了和union
类似的功能,用来连接两个数据流,它与union
的区别在于:
connect
只能连接两个数据流,union
可以连接多个数据流。connect
所连接的两个数据流的数据类型可以不一致,union
所连接的两个数据流的数据类型必须一致。- 两个
DataStream
经过connect
之后被转化为ConnectedStreams
,ConnectedStreams
会对两个流的数据应用不同的处理方法,且双流之间可以共享状态。
connect
经常被应用在对一个数据流使用另外一个流进行控制处理的场景上,如下图所示。控制流可以是阈值、规则、机器学习模型或其他参数。
联合两条流的事件是非常常见的流处理需求。例如监控一片森林然后发出高危的火警警报。报警的Application接收两条流,一条是温度传感器传回来的数据,一条是烟雾传感器传回来的数据。当两条流都超过各自的阈值时,报警。
DataStream API提供了connect
操作来支持以上的应用场景。DataStream.connect()
方法接收一条DataStream
,然后返回一个ConnectedStreams
类型的对象,这个对象表示了两条连接的流。
scala version
val first = ... val second = ... val connected = first.connect(second)
java version
// first stream DataStream<Integer> first = ... // second stream DataStream<String> second = ... // connect streams ConnectedStreams<Integer, String> connected = first.connect(second);
ConnectedStreams提供了map()
和flatMap()
方法,分别需要接收类型为CoMapFunction
和CoFlatMapFunction
的参数。
以上两个函数里面的泛型是第一条流的事件类型和第二条流的事件类型,以及输出流的事件类型。还定义了两个方法,每一个方法针对一条流来调用。map1()
和flatMap1()
会调用在第一条流的元素上面,map2()
和flatMap2()
会调用在第二条流的元素上面。
// IN1: 第一条流的事件类型 // IN2: 第二条流的事件类型 // OUT: 输出流的事件类型 CoMapFunction[IN1, IN2, OUT] > map1(IN1): OUT > map2(IN2): OUT CoFlatMapFunction[IN1, IN2, OUT] > flatMap1(IN1, Collector[OUT]): Unit > flatMap2(IN2, Collector[OUT]): Unit
函数无法选择读某一条流。我们是无法控制函数中的两个方法的调用顺序的。当一条流中的元素到来时,将会调用相对应的方法。
对两条流做连接查询通常需要这两条流基于某些条件被确定性的路由到操作符中相同的并行实例里面去。在默认情况下,connect()操作将不会对两条流的事件建立任何关系,所以两条流的事件将会随机的被发送到下游的算子实例里面去。这样的行为会产生不确定性的计算结果,显然不是我们想要的。为了针对ConnectedStreams进行确定性的转换操作,connect()方法可以和keyBy()或者broadcast()组合起来使用。我们首先看一下keyBy()的示例。
scala version
val one = ... val two = ... val keyedConnect1 = one.connect(two).keyBy(0, 0) val keyedConnect2 = one.keyBy(0).connect(two.keyBy(0))
java version
DataStream<Tuple2<Integer, Long>> one = ... DataStream<Tuple2<Integer, String>> two = ... // keyBy two connected streams ConnectedStreams<Tuple2<Int, Long>, Tuple2<Integer, String>> keyedConnect1 = one .connect(two) .keyBy(0, 0); // key both input streams on first attribute // alternative: connect two keyed streams ConnectedStreams<Tuple2<Integer, Long>, Tuple2<Integer, String>> keyedConnect2 = one .keyBy(0) .connect(two.keyBy(0));
无论使用keyBy()算子操作ConnectedStreams还是使用connect()算子连接两条KeyedStreams,connect()算子会将两条流的含有相同Key的所有事件都发送到相同的算子实例。两条流的key必须是一样的类型和值,就像SQL中的JOIN。在connected和keyed stream上面执行的算子有访问keyed state的权限。
实例一:
例如,我们将之前的股票价格数据流与一个媒体评价数据流结合起来,按照股票代号进行分组。
我们知道,如果不对DataStream
按照Key进行分组,数据是随机分配在各个TaskSlot上的,而绝大多数情况我们是要对某个Key进行分析和处理,Flink允许我们将connect
和keyBy
或broadcast
结合起来使用。
// 先将两个流connect,再进行keyBy val keyByConnect1: ConnectedStreams[StockPrice, Media] = stockPriceRawStream .connect(mediaStatusStream) .keyBy(0,0) // 先keyBy再connect val keyByConnect2: ConnectedStreams[StockPrice, Media] = stockPriceRawStream.keyBy(0) .connect(mediaStatusStream.keyBy(0))
无论先keyBy
还是先connect
,我们都可以将含有相同Key的数据转发到下游同一个算子实例上。这种操作有点像SQL中的join操作。Flink也提供了join
算子,join
主要在时间窗口维度上,connect
相比而言更广义一些,关于join
的介绍将在后续文章中介绍。
下面的代码展示了如何将股票价格和媒体正负面评价结合起来,当媒体评价为正且股票价格大于阈值时,输出一个正面信号。完整代码在我的github上:https://github.com/luweizheng/flink-tutorials
package com.flink.tutorials.demos.stock import java.util.Calendar import com.flink.tutorials.demos.stock.StockPriceDemo.{StockPrice, StockPriceSource, StockPriceTimeAssigner} import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.TimeCharacteristic import org.apache.flink.streaming.api.functions.co.RichCoFlatMapFunction import org.apache.flink.streaming.api.functions.source.RichSourceFunction import org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext import org.apache.flink.streaming.api.scala._ import org.apache.flink.util.Collector import scala.util.Random object StockMediaConnectedDemo { def main(args: Array[String]) { // 设置执行环境 val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration()) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) // 每5秒生成一个Watermark env.getConfig.setAutoWatermarkInterval(5000L) // 股票价格数据流 val stockPriceRawStream: DataStream[StockPrice] = env // 该数据流由StockPriceSource类随机生成 .addSource(new StockPriceSource) // 设置 Timestamp 和 Watermark .assignTimestampsAndWatermarks(new StockPriceTimeAssigner) val mediaStatusStream: DataStream[Media] = env .addSource(new MediaSource) // 先将两个流connect,再进行keyBy val keyByConnect1: ConnectedStreams[StockPrice, Media] = stockPriceRawStream .connect(mediaStatusStream) .keyBy(0,0) // 先keyBy再connect val keyByConnect2: ConnectedStreams[StockPrice, Media] = stockPriceRawStream.keyBy(0) .connect(mediaStatusStream.keyBy(0)) val alert1 = keyByConnect1.flatMap(new AlertFlatMap).print() val alerts2 = keyByConnect2.flatMap(new AlertFlatMap).print() // 执行程序 env.execute("connect stock price with media status") } /** 媒体评价 * * symbol 股票代号 * timestamp 时间戳 * status 评价 正面/一般/负面 */ case class Media(symbol: String, timestamp: Long, status: String) class MediaSource extends RichSourceFunction[Media]{ var isRunning: Boolean = true val rand = new Random() var stockId = 0 override def run(srcCtx: SourceContext[Media]): Unit = { while (isRunning) { // 每次从列表中随机选择一只股票 stockId = rand.nextInt(5) var status: String = "NORMAL" if (rand.nextGaussian() > 0.9) { status = "POSITIVE" } else if (rand.nextGaussian() < 0.05) { status = "NEGATIVE" } val curTime = Calendar.getInstance.getTimeInMillis srcCtx.collect(Media(stockId.toString, curTime, status)) Thread.sleep(rand.nextInt(100)) } } override def cancel(): Unit = { isRunning = false } } case class Alert(symbol: String, timestamp: Long, alert: String) class AlertFlatMap extends RichCoFlatMapFunction[StockPrice, Media, Alert] { var priceMaxThreshold: List[Double] = List(101.0d, 201.0d, 301.0d, 401.0d, 501.0d) var mediaLevel: String = "NORMAL" override def flatMap1(stock: StockPrice, collector: Collector[Alert]) : Unit = { val stockId = stock.symbol.toInt if ("POSITIVE".equals(mediaLevel) && stock.price > priceMaxThreshold(stockId)) { collector.collect(Alert(stock.symbol, stock.timestamp, "POSITIVE")) } } override def flatMap2(media: Media, collector: Collector[Alert]): Unit = { mediaLevel = media.status } } }
实例二:
下面的例子展示了如何连接一条DataStream和广播过的流。
scala version
val one = ... val two = ... val keyedConnect = first.connect(second.broadcast())
java version
DataStream<Tuple2<Integer, Long>> one = ... DataStream<Tuple2<Int, String>> two = ... // connect streams with broadcast ConnectedStreams<Tuple2<Int, Long>, Tuple2<Int, String>> keyedConnect = first // broadcast second input stream .connect(second.broadcast());
一条被广播过的流中的所有元素将会被复制然后发送到下游算子的所有并行实例中去。未被广播过的流仅仅向前发送。所以两条流的元素显然会被连接处理。
例子:
警告类:
scala version
case class Alert(message: String, timestamp: Long)
java version
public class Alert { public String message; public long timestamp; public Alert() { } public Alert(String message, long timestamp) { this.message = message; this.timestamp = timestamp; } public String toString() { return "(" + message + ", " + timestamp + ")"; } }
烟雾传感器读数类:
public enum SmokeLevel { LOW, HIGH }
产生烟雾传感器读数的自定义数据源:
public class SmokeLevelSource implements SourceFunction<SmokeLevel> { private boolean running = true; @Override public void run(SourceContext<SmokeLevel> srcCtx) throws Exception { Random rand = new Random(); while (running) { if (rand.nextGaussian() > 0.8) { srcCtx.collect(SmokeLevel.HIGH); } else { srcCtx.collect(SmokeLevel.LOW); } Thread.sleep(1000); } } @Override public void cancel() { this.running = false; } }
监控一片森林然后发出高危的火警警报。报警的Application接收两条流,一条是温度传感器传回来的数据,一条是烟雾传感器传回来的数据。当两条流都超过各自的阈值时,报警。
scala version
object MultiStreamTransformations { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment val tempReadings = env.addSource(new SensorSource) val smokeReadings = env .addSource(new SmokeLevelSource) .setParallelism(1) val keyedTempReadings = tempReadings .keyBy(r => r.id) val alerts = keyedTempReadings .connect(smokeReadings.broadcast()) .flatMap(new RaiseAlertFlatMap) alerts.print() env.execute("Multi-Stream Transformations Example") } class RaiseAlertFlatMap extends CoFlatMapFunction[SensorReading, SmokeLevel, Alert] { private var smokeLevel = "LOW" override def flatMap1(tempReading: SensorReading, out: Collector[Alert]) : Unit = { if (smokeLevel == "HIGH" && tempReading.temperature > 100) { out.collect(Alert("Risk of fire! " + tempReading, tempReading.timestamp)) } } override def flatMap2(sl: String, out: Collector[Alert]) : Unit = { smokeLevel = sl } } }
java version
public class MultiStreamTransformations { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<SensorReading> tempReadings = env .addSource(new SensorSource()); DataStream<SmokeLevel> smokeReadings = env .addSource(new SmokeLevelSource()) .setParallelism(1); KeyedStream<SensorReading, String> keyedTempReadings = tempReadings .keyBy(r -> r.id); DataStream<Alert> alerts = keyedTempReadings .connect(smokeReadings.broadcast()) .flatMap(new RaiseAlertFlatMap()); alerts.print(); env.execute("Multi-Stream Transformations Example"); } public static class RaiseAlertFlatMap implements CoFlatMapFunction<SensorReading, SmokeLevel, Alert> { private SmokeLevel smokeLevel = SmokeLevel.LOW; @Override public void flatMap1(SensorReading tempReading, Collector<Alert> out) throws Exception { // high chance of fire => true if (this.smokeLevel == SmokeLevel.HIGH && tempReading.temperature > 100) { out.collect(new Alert("Risk of fire! " + tempReading, tempReading.timestamp)); } } @Override public void flatMap2(SmokeLevel smokeLevel, Collector<Alert> out) { // update smoke level this.smokeLevel = smokeLevel; } } }
参考链接
https://cloud.tencent.com/developer/article/1560680
联合两条流的事件是非常常见的流处理需求。例如监控一片森林然后发出高危的火警警报。报警的Application接收两条流,一条是温度传感器传回来的数据,一条是烟雾传感器传回来的数据。当两条流都超过各自的阈值时,报警。
DataStream API提供了connect
操作来支持以上的应用场景。DataStream.connect()
方法接收一条DataStream
,然后返回一个ConnectedStreams
类型的对象,这个对象表示了两条连接的流。
scala version
val first = ... val second = ... val connected = first.connect(second)
java version
// first stream DataStream<Integer> first = ... // second stream DataStream<String> second = ... // connect streams ConnectedStreams<Integer, String> connected = first.connect(second);
ConnectedStreams提供了map()
和flatMap()
方法,分别需要接收类型为CoMapFunction
和CoFlatMapFunction
的参数。
以上两个函数里面的泛型是第一条流的事件类型和第二条流的事件类型,以及输出流的事件类型。还定义了两个方法,每一个方法针对一条流来调用。map1()
和flatMap1()
会调用在第一条流的元素上面,map2()
和flatMap2()
会调用在第二条流的元素上面。
// IN1: 第一条流的事件类型 // IN2: 第二条流的事件类型 // OUT: 输出流的事件类型 CoMapFunction[IN1, IN2, OUT] > map1(IN1): OUT > map2(IN2): OUT CoFlatMapFunction[IN1, IN2, OUT] > flatMap1(IN1, Collector[OUT]): Unit > flatMap2(IN2, Collector[OUT]): Unit
函数无法选择读某一条流。我们是无法控制函数中的两个方法的调用顺序的。当一条流中的元素到来时,将会调用相对应的方法。
对两条流做连接查询通常需要这两条流基于某些条件被确定性的路由到操作符中相同的并行实例里面去。在默认情况下,connect()操作将不会对两条流的事件建立任何关系,所以两条流的事件将会随机的被发送到下游的算子实例里面去。这样的行为会产生不确定性的计算结果,显然不是我们想要的。为了针对ConnectedStreams进行确定性的转换操作,connect()方法可以和keyBy()或者broadcast()组合起来使用。我们首先看一下keyBy()的示例。
scala version
val one = ... val two = ... val keyedConnect1 = one.connect(two).keyBy(0, 0) val keyedConnect2 = one.keyBy(0).connect(two.keyBy(0))
java version
DataStream<Tuple2<Integer, Long>> one = ... DataStream<Tuple2<Integer, String>> two = ... // keyBy two connected streams ConnectedStreams<Tuple2<Int, Long>, Tuple2<Integer, String>> keyedConnect1 = one .connect(two) .keyBy(0, 0); // key both input streams on first attribute // alternative: connect two keyed streams ConnectedStreams<Tuple2<Integer, Long>, Tuple2<Integer, String>> keyedConnect2 = one .keyBy(0) .connect(two.keyBy(0));
无论使用keyBy()算子操作ConnectedStreams还是使用connect()算子连接两条KeyedStreams,connect()算子会将两条流的含有相同Key的所有事件都发送到相同的算子实例。两条流的key必须是一样的类型和值,就像SQL中的JOIN。在connected和keyed stream上面执行的算子有访问keyed state的权限。
下面的例子展示了如何连接一条DataStream和广播过的流。
scala version
val one = ... val two = ... val keyedConnect = first.connect(second.broadcast())
java version
DataStream<Tuple2<Integer, Long>> one = ... DataStream<Tuple2<Int, String>> two = ... // connect streams with broadcast ConnectedStreams<Tuple2<Int, Long>, Tuple2<Int, String>> keyedConnect = first // broadcast second input stream .connect(second.broadcast());
一条被广播过的流中的所有元素将会被复制然后发送到下游算子的所有并行实例中去。未被广播过的流仅仅向前发送。所以两条流的元素显然会被连接处理。
例子:
警告类:
scala version
case class Alert(message: String, timestamp: Long)
java version
public class Alert { public String message; public long timestamp; public Alert() { } public Alert(String message, long timestamp) { this.message = message; this.timestamp = timestamp; } public String toString() { return "(" + message + ", " + timestamp + ")"; } }
烟雾传感器读数类:
public enum SmokeLevel { LOW, HIGH }
产生烟雾传感器读数的自定义数据源:
public class SmokeLevelSource implements SourceFunction<SmokeLevel> { private boolean running = true; @Override public void run(SourceContext<SmokeLevel> srcCtx) throws Exception { Random rand = new Random(); while (running) { if (rand.nextGaussian() > 0.8) { srcCtx.collect(SmokeLevel.HIGH); } else { srcCtx.collect(SmokeLevel.LOW); } Thread.sleep(1000); } } @Override public void cancel() { this.running = false; } }
监控一片森林然后发出高危的火警警报。报警的Application接收两条流,一条是温度传感器传回来的数据,一条是烟雾传感器传回来的数据。当两条流都超过各自的阈值时,报警。
scala version
object MultiStreamTransformations { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment val tempReadings = env.addSource(new SensorSource) val smokeReadings = env .addSource(new SmokeLevelSource) .setParallelism(1) val keyedTempReadings = tempReadings .keyBy(r => r.id) val alerts = keyedTempReadings .connect(smokeReadings.broadcast()) .flatMap(new RaiseAlertFlatMap) alerts.print() env.execute("Multi-Stream Transformations Example") } class RaiseAlertFlatMap extends CoFlatMapFunction[SensorReading, SmokeLevel, Alert] { private var smokeLevel = "LOW" override def flatMap1(tempReading: SensorReading, out: Collector[Alert]) : Unit = { if (smokeLevel == "HIGH" && tempReading.temperature > 100) { out.collect(Alert("Risk of fire! " + tempReading, tempReading.timestamp)) } } override def flatMap2(sl: String, out: Collector[Alert]) : Unit = { smokeLevel = sl } } }
java version
public class MultiStreamTransformations { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<SensorReading> tempReadings = env .addSource(new SensorSource()); DataStream<SmokeLevel> smokeReadings = env .addSource(new SmokeLevelSource()) .setParallelism(1); KeyedStream<SensorReading, String> keyedTempReadings = tempReadings .keyBy(r -> r.id); DataStream<Alert> alerts = keyedTempReadings .connect(smokeReadings.broadcast()) .flatMap(new RaiseAlertFlatMap()); alerts.print(); env.execute("Multi-Stream Transformations Example"); } public static class RaiseAlertFlatMap implements CoFlatMapFunction<SensorReading, SmokeLevel, Alert> { private SmokeLevel smokeLevel = SmokeLevel.LOW; @Override public void flatMap1(SensorReading tempReading, Collector<Alert> out) throws Exception { // high chance of fire => true if (this.smokeLevel == SmokeLevel.HIGH && tempReading.temperature > 100) { out.collect(new Alert("Risk of fire! " + tempReading, tempReading.timestamp)); } } @Override public void flatMap2(SmokeLevel smokeLevel, Collector<Alert> out) { // update smoke level this.smokeLevel = smokeLevel; } } }