• Flink实例(四十六): Operators(七)多流转换算子(二)CONNECT, COMAP和COFLATMAP


    CONNECT

    union虽然可以合并多个数据流,但有一个限制,即多个数据流的数据类型必须相同。connect提供了和union类似的功能,用来连接两个数据流,它与union的区别在于:

    1. connect只能连接两个数据流,union可以连接多个数据流。
    2. connect所连接的两个数据流的数据类型可以不一致,union所连接的两个数据流的数据类型必须一致。
    3. 两个DataStream经过connect之后被转化为ConnectedStreamsConnectedStreams会对两个流的数据应用不同的处理方法,且双流之间可以共享状态。

    connect经常被应用在对一个数据流使用另外一个流进行控制处理的场景上,如下图所示。控制流可以是阈值、规则、机器学习模型或其他参数。

     

     

      联合两条流的事件是非常常见的流处理需求。例如监控一片森林然后发出高危的火警警报。报警的Application接收两条流,一条是温度传感器传回来的数据,一条是烟雾传感器传回来的数据。当两条流都超过各自的阈值时,报警。

    DataStream API提供了connect操作来支持以上的应用场景。DataStream.connect()方法接收一条DataStream,然后返回一个ConnectedStreams类型的对象,这个对象表示了两条连接的流。

    scala version

    val first = ...
    val second = ...
    val connected = first.connect(second)

    java version

    复制代码
    // first stream
    DataStream<Integer> first = ...
    // second stream
    DataStream<String> second = ...
    
    // connect streams
    ConnectedStreams<Integer, String> connected = first.connect(second);
    复制代码

    ConnectedStreams提供了map()flatMap()方法,分别需要接收类型为CoMapFunctionCoFlatMapFunction的参数。

      以上两个函数里面的泛型是第一条流的事件类型和第二条流的事件类型,以及输出流的事件类型。还定义了两个方法,每一个方法针对一条流来调用。map1()flatMap1()会调用在第一条流的元素上面,map2()flatMap2()会调用在第二条流的元素上面。

    复制代码
    // IN1: 第一条流的事件类型
    // IN2: 第二条流的事件类型
    // OUT: 输出流的事件类型
    CoMapFunction[IN1, IN2, OUT]
        > map1(IN1): OUT
        > map2(IN2): OUT
    
    CoFlatMapFunction[IN1, IN2, OUT]
        > flatMap1(IN1, Collector[OUT]): Unit
        > flatMap2(IN2, Collector[OUT]): Unit
    复制代码

    函数无法选择读某一条流。我们是无法控制函数中的两个方法的调用顺序的。当一条流中的元素到来时,将会调用相对应的方法。

      对两条流做连接查询通常需要这两条流基于某些条件被确定性的路由到操作符中相同的并行实例里面去。在默认情况下,connect()操作将不会对两条流的事件建立任何关系,所以两条流的事件将会随机的被发送到下游的算子实例里面去。这样的行为会产生不确定性的计算结果,显然不是我们想要的。为了针对ConnectedStreams进行确定性的转换操作,connect()方法可以和keyBy()或者broadcast()组合起来使用。我们首先看一下keyBy()的示例。

    scala version

    val one = ...
    val two = ...
    
    val keyedConnect1 = one.connect(two).keyBy(0, 0)
    
    val keyedConnect2 = one.keyBy(0).connect(two.keyBy(0))

    java version

    复制代码
    DataStream<Tuple2<Integer, Long>> one = ...
    DataStream<Tuple2<Integer, String>> two = ...
    
    // keyBy two connected streams
    ConnectedStreams<Tuple2<Int, Long>, Tuple2<Integer, String>> keyedConnect1 = one
      .connect(two)
      .keyBy(0, 0); // key both input streams on first attribute
    
    // alternative: connect two keyed streams
    ConnectedStreams<Tuple2<Integer, Long>, Tuple2<Integer, String>> keyedConnect2 = one
      .keyBy(0)
      .connect(two.keyBy(0));
    复制代码

      无论使用keyBy()算子操作ConnectedStreams还是使用connect()算子连接两条KeyedStreams,connect()算子会将两条流的含有相同Key的所有事件都发送到相同的算子实例。两条流的key必须是一样的类型和值,就像SQL中的JOIN。在connected和keyed stream上面执行的算子有访问keyed state的权限。

     

    实例一:

    例如,我们将之前的股票价格数据流与一个媒体评价数据流结合起来,按照股票代号进行分组。

    我们知道,如果不对DataStream按照Key进行分组,数据是随机分配在各个TaskSlot上的,而绝大多数情况我们是要对某个Key进行分析和处理,Flink允许我们将connectkeyBybroadcast结合起来使用。

    // 先将两个流connect,再进行keyBy
    val keyByConnect1: ConnectedStreams[StockPrice, Media] = stockPriceRawStream
      .connect(mediaStatusStream)
      .keyBy(0,0)
    // 先keyBy再connect
    val keyByConnect2: ConnectedStreams[StockPrice, Media] = stockPriceRawStream.keyBy(0)
      .connect(mediaStatusStream.keyBy(0))

    无论先keyBy还是先connect,我们都可以将含有相同Key的数据转发到下游同一个算子实例上。这种操作有点像SQL中的join操作。Flink也提供了join算子,join主要在时间窗口维度上,connect相比而言更广义一些,关于join的介绍将在后续文章中介绍。

    下面的代码展示了如何将股票价格和媒体正负面评价结合起来,当媒体评价为正且股票价格大于阈值时,输出一个正面信号。完整代码在我的github上:https://github.com/luweizheng/flink-tutorials

    package com.flink.tutorials.demos.stock
    import java.util.Calendar
    import com.flink.tutorials.demos.stock.StockPriceDemo.{StockPrice, StockPriceSource, StockPriceTimeAssigner}
    import org.apache.flink.configuration.Configuration
    import org.apache.flink.streaming.api.TimeCharacteristic
    import org.apache.flink.streaming.api.functions.co.RichCoFlatMapFunction
    import org.apache.flink.streaming.api.functions.source.RichSourceFunction
    import org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext
    import org.apache.flink.streaming.api.scala._
    import org.apache.flink.util.Collector
    import scala.util.Random
    object StockMediaConnectedDemo {
      def main(args: Array[String]) {
        // 设置执行环境
        val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        // 每5秒生成一个Watermark
        env.getConfig.setAutoWatermarkInterval(5000L)
        // 股票价格数据流
        val stockPriceRawStream: DataStream[StockPrice] = env
          // 该数据流由StockPriceSource类随机生成
          .addSource(new StockPriceSource)
          // 设置 Timestamp 和 Watermark
          .assignTimestampsAndWatermarks(new StockPriceTimeAssigner)
        val mediaStatusStream: DataStream[Media] = env
          .addSource(new MediaSource)
        // 先将两个流connect,再进行keyBy
        val keyByConnect1: ConnectedStreams[StockPrice, Media] = stockPriceRawStream
          .connect(mediaStatusStream)
          .keyBy(0,0)
        // 先keyBy再connect
        val keyByConnect2: ConnectedStreams[StockPrice, Media] = stockPriceRawStream.keyBy(0)
          .connect(mediaStatusStream.keyBy(0))
        val alert1 = keyByConnect1.flatMap(new AlertFlatMap).print()
        val alerts2 = keyByConnect2.flatMap(new AlertFlatMap).print()
        // 执行程序
        env.execute("connect stock price with media status")
      }
      /** 媒体评价
        *
        * symbol 股票代号
        * timestamp 时间戳
        * status 评价 正面/一般/负面
        */
      case class Media(symbol: String, timestamp: Long, status: String)
      class MediaSource extends RichSourceFunction[Media]{
        var isRunning: Boolean = true
        val rand = new Random()
        var stockId = 0
        override def run(srcCtx: SourceContext[Media]): Unit = {
          while (isRunning) {
            // 每次从列表中随机选择一只股票
            stockId = rand.nextInt(5)
            var status: String = "NORMAL"
            if (rand.nextGaussian() > 0.9) {
              status = "POSITIVE"
            } else if (rand.nextGaussian() < 0.05) {
              status = "NEGATIVE"
            }
            val curTime = Calendar.getInstance.getTimeInMillis
            srcCtx.collect(Media(stockId.toString, curTime, status))
            Thread.sleep(rand.nextInt(100))
          }
        }
        override def cancel(): Unit = {
          isRunning = false
        }
      }
      case class Alert(symbol: String, timestamp: Long, alert: String)
      class AlertFlatMap extends RichCoFlatMapFunction[StockPrice, Media, Alert] {
        var priceMaxThreshold: List[Double] = List(101.0d, 201.0d, 301.0d, 401.0d, 501.0d)
        var mediaLevel: String = "NORMAL"
        override def flatMap1(stock: StockPrice, collector: Collector[Alert]) : Unit = {
          val stockId = stock.symbol.toInt
          if ("POSITIVE".equals(mediaLevel) && stock.price > priceMaxThreshold(stockId)) {
            collector.collect(Alert(stock.symbol, stock.timestamp, "POSITIVE"))
          }
        }
        override def flatMap2(media: Media, collector: Collector[Alert]): Unit = {
          mediaLevel = media.status
        }
      }
    }

    实例二:

    下面的例子展示了如何连接一条DataStream和广播过的流。

    scala version

    val one = ...
    val two = ...
    
    val keyedConnect = first.connect(second.broadcast())

    java version

    复制代码
    DataStream<Tuple2<Integer, Long>> one = ...
    DataStream<Tuple2<Int, String>> two = ...
    
    // connect streams with broadcast
    ConnectedStreams<Tuple2<Int, Long>, Tuple2<Int, String>> keyedConnect = first
      // broadcast second input stream
      .connect(second.broadcast());
    复制代码

    一条被广播过的流中的所有元素将会被复制然后发送到下游算子的所有并行实例中去。未被广播过的流仅仅向前发送。所以两条流的元素显然会被连接处理。

    例子:

    警告类:

    scala version

    case class Alert(message: String, timestamp: Long)

    java version

    复制代码
    public class Alert {
    
        public String message;
        public long timestamp;
    
        public Alert() { }
    
        public Alert(String message, long timestamp) {
            this.message = message;
            this.timestamp = timestamp;
        }
    
        public String toString() {
            return "(" + message + ", " + timestamp + ")";
        }
    }
    复制代码

    烟雾传感器读数类:

    public enum SmokeLevel {
        LOW,
        HIGH
    }

    产生烟雾传感器读数的自定义数据源:

    复制代码
    public class SmokeLevelSource implements SourceFunction<SmokeLevel> {
    
        private boolean running = true;
    
        @Override
        public void run(SourceContext<SmokeLevel> srcCtx) throws Exception {
    
            Random rand = new Random();
    
            while (running) {
    
                if (rand.nextGaussian() > 0.8) {
                    srcCtx.collect(SmokeLevel.HIGH);
                } else {
                    srcCtx.collect(SmokeLevel.LOW);
                }
    
                Thread.sleep(1000);
            }
        }
    
        @Override
        public void cancel() {
            this.running = false;
        }
    }
    复制代码

    监控一片森林然后发出高危的火警警报。报警的Application接收两条流,一条是温度传感器传回来的数据,一条是烟雾传感器传回来的数据。当两条流都超过各自的阈值时,报警。

    scala version

    复制代码
    object MultiStreamTransformations {
        def main(args: Array[String]): Unit = {
            val env = StreamExecutionEnvironment.getExecutionEnvironment
            val tempReadings = env.addSource(new SensorSource)
            val smokeReadings = env
                    .addSource(new SmokeLevelSource)
                    .setParallelism(1)
            val keyedTempReadings = tempReadings
                    .keyBy(r => r.id)
            val alerts = keyedTempReadings
                    .connect(smokeReadings.broadcast())
                    .flatMap(new RaiseAlertFlatMap)
    
            alerts.print()
    
            env.execute("Multi-Stream Transformations Example")
        }
    
        class RaiseAlertFlatMap extends CoFlatMapFunction[SensorReading, SmokeLevel, Alert] {
            private var smokeLevel = "LOW"
    
            override def flatMap1(tempReading: SensorReading, out: Collector[Alert]) : Unit = {
                if (smokeLevel == "HIGH" && tempReading.temperature > 100) {
                    out.collect(Alert("Risk of fire! " + tempReading, tempReading.timestamp))
                }
            }
    
            override def flatMap2(sl: String, out: Collector[Alert]) : Unit = {
                smokeLevel = sl
            }
        }
    }
    复制代码

    java version

    复制代码
    public class MultiStreamTransformations {
    
        public static void main(String[] args) throws Exception {
    
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    
            DataStream<SensorReading> tempReadings = env
                    .addSource(new SensorSource());
    
            DataStream<SmokeLevel> smokeReadings = env
                    .addSource(new SmokeLevelSource())
                    .setParallelism(1);
    
            KeyedStream<SensorReading, String> keyedTempReadings = tempReadings
                    .keyBy(r -> r.id);
    
            DataStream<Alert> alerts = keyedTempReadings
                    .connect(smokeReadings.broadcast())
                    .flatMap(new RaiseAlertFlatMap());
    
            alerts.print();
    
            env.execute("Multi-Stream Transformations Example");
        }
    
        public static class RaiseAlertFlatMap implements CoFlatMapFunction<SensorReading, SmokeLevel, Alert> {
    
            private SmokeLevel smokeLevel = SmokeLevel.LOW;
    
            @Override
            public void flatMap1(SensorReading tempReading, Collector<Alert> out) throws Exception {
                // high chance of fire => true
                if (this.smokeLevel == SmokeLevel.HIGH && tempReading.temperature > 100) {
                    out.collect(new Alert("Risk of fire! " + tempReading, tempReading.timestamp));
                }
            }
    
            @Override
            public void flatMap2(SmokeLevel smokeLevel, Collector<Alert> out) {
                // update smoke level
                this.smokeLevel = smokeLevel;
            }
        }
    }
    复制代码

    参考链接

    https://cloud.tencent.com/developer/article/1560680

    联合两条流的事件是非常常见的流处理需求。例如监控一片森林然后发出高危的火警警报。报警的Application接收两条流,一条是温度传感器传回来的数据,一条是烟雾传感器传回来的数据。当两条流都超过各自的阈值时,报警。

    DataStream API提供了connect操作来支持以上的应用场景。DataStream.connect()方法接收一条DataStream,然后返回一个ConnectedStreams类型的对象,这个对象表示了两条连接的流。

    scala version

    val first = ...
    val second = ...
    val connected = first.connect(second)

    java version

    复制代码
    // first stream
    DataStream<Integer> first = ...
    // second stream
    DataStream<String> second = ...
    
    // connect streams
    ConnectedStreams<Integer, String> connected = first.connect(second);
    复制代码

    ConnectedStreams提供了map()flatMap()方法,分别需要接收类型为CoMapFunctionCoFlatMapFunction的参数。

    以上两个函数里面的泛型是第一条流的事件类型和第二条流的事件类型,以及输出流的事件类型。还定义了两个方法,每一个方法针对一条流来调用。map1()flatMap1()会调用在第一条流的元素上面,map2()flatMap2()会调用在第二条流的元素上面。

    复制代码
    // IN1: 第一条流的事件类型
    // IN2: 第二条流的事件类型
    // OUT: 输出流的事件类型
    CoMapFunction[IN1, IN2, OUT]
        > map1(IN1): OUT
        > map2(IN2): OUT
    
    CoFlatMapFunction[IN1, IN2, OUT]
        > flatMap1(IN1, Collector[OUT]): Unit
        > flatMap2(IN2, Collector[OUT]): Unit
    复制代码

    函数无法选择读某一条流。我们是无法控制函数中的两个方法的调用顺序的。当一条流中的元素到来时,将会调用相对应的方法。

    对两条流做连接查询通常需要这两条流基于某些条件被确定性的路由到操作符中相同的并行实例里面去。在默认情况下,connect()操作将不会对两条流的事件建立任何关系,所以两条流的事件将会随机的被发送到下游的算子实例里面去。这样的行为会产生不确定性的计算结果,显然不是我们想要的。为了针对ConnectedStreams进行确定性的转换操作,connect()方法可以和keyBy()或者broadcast()组合起来使用。我们首先看一下keyBy()的示例。

    scala version

    val one = ...
    val two = ...
    
    val keyedConnect1 = one.connect(two).keyBy(0, 0)
    
    val keyedConnect2 = one.keyBy(0).connect(two.keyBy(0))

    java version

    复制代码
    DataStream<Tuple2<Integer, Long>> one = ...
    DataStream<Tuple2<Integer, String>> two = ...
    
    // keyBy two connected streams
    ConnectedStreams<Tuple2<Int, Long>, Tuple2<Integer, String>> keyedConnect1 = one
      .connect(two)
      .keyBy(0, 0); // key both input streams on first attribute
    
    // alternative: connect two keyed streams
    ConnectedStreams<Tuple2<Integer, Long>, Tuple2<Integer, String>> keyedConnect2 = one
      .keyBy(0)
      .connect(two.keyBy(0));
    复制代码

    无论使用keyBy()算子操作ConnectedStreams还是使用connect()算子连接两条KeyedStreams,connect()算子会将两条流的含有相同Key的所有事件都发送到相同的算子实例。两条流的key必须是一样的类型和值,就像SQL中的JOIN。在connected和keyed stream上面执行的算子有访问keyed state的权限。

    下面的例子展示了如何连接一条DataStream和广播过的流。

    scala version

    val one = ...
    val two = ...
    
    val keyedConnect = first.connect(second.broadcast())

    java version

    复制代码
    DataStream<Tuple2<Integer, Long>> one = ...
    DataStream<Tuple2<Int, String>> two = ...
    
    // connect streams with broadcast
    ConnectedStreams<Tuple2<Int, Long>, Tuple2<Int, String>> keyedConnect = first
      // broadcast second input stream
      .connect(second.broadcast());
    复制代码

    一条被广播过的流中的所有元素将会被复制然后发送到下游算子的所有并行实例中去。未被广播过的流仅仅向前发送。所以两条流的元素显然会被连接处理。

    例子:

    警告类:

    scala version

    case class Alert(message: String, timestamp: Long)

    java version

    复制代码
    public class Alert {
    
        public String message;
        public long timestamp;
    
        public Alert() { }
    
        public Alert(String message, long timestamp) {
            this.message = message;
            this.timestamp = timestamp;
        }
    
        public String toString() {
            return "(" + message + ", " + timestamp + ")";
        }
    }
    复制代码

    烟雾传感器读数类:

    public enum SmokeLevel {
        LOW,
        HIGH
    }

    产生烟雾传感器读数的自定义数据源:

    复制代码
    public class SmokeLevelSource implements SourceFunction<SmokeLevel> {
    
        private boolean running = true;
    
        @Override
        public void run(SourceContext<SmokeLevel> srcCtx) throws Exception {
    
            Random rand = new Random();
    
            while (running) {
    
                if (rand.nextGaussian() > 0.8) {
                    srcCtx.collect(SmokeLevel.HIGH);
                } else {
                    srcCtx.collect(SmokeLevel.LOW);
                }
    
                Thread.sleep(1000);
            }
        }
    
        @Override
        public void cancel() {
            this.running = false;
        }
    }
    复制代码

    监控一片森林然后发出高危的火警警报。报警的Application接收两条流,一条是温度传感器传回来的数据,一条是烟雾传感器传回来的数据。当两条流都超过各自的阈值时,报警。

    scala version

    复制代码
    object MultiStreamTransformations {
        def main(args: Array[String]): Unit = {
            val env = StreamExecutionEnvironment.getExecutionEnvironment
            val tempReadings = env.addSource(new SensorSource)
            val smokeReadings = env
                    .addSource(new SmokeLevelSource)
                    .setParallelism(1)
            val keyedTempReadings = tempReadings
                    .keyBy(r => r.id)
            val alerts = keyedTempReadings
                    .connect(smokeReadings.broadcast())
                    .flatMap(new RaiseAlertFlatMap)
    
            alerts.print()
    
            env.execute("Multi-Stream Transformations Example")
        }
    
        class RaiseAlertFlatMap extends CoFlatMapFunction[SensorReading, SmokeLevel, Alert] {
            private var smokeLevel = "LOW"
    
            override def flatMap1(tempReading: SensorReading, out: Collector[Alert]) : Unit = {
                if (smokeLevel == "HIGH" && tempReading.temperature > 100) {
                    out.collect(Alert("Risk of fire! " + tempReading, tempReading.timestamp))
                }
            }
    
            override def flatMap2(sl: String, out: Collector[Alert]) : Unit = {
                smokeLevel = sl
            }
        }
    }
    复制代码

    java version

    复制代码
    public class MultiStreamTransformations {
    
        public static void main(String[] args) throws Exception {
    
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    
            DataStream<SensorReading> tempReadings = env
                    .addSource(new SensorSource());
    
            DataStream<SmokeLevel> smokeReadings = env
                    .addSource(new SmokeLevelSource())
                    .setParallelism(1);
    
            KeyedStream<SensorReading, String> keyedTempReadings = tempReadings
                    .keyBy(r -> r.id);
    
            DataStream<Alert> alerts = keyedTempReadings
                    .connect(smokeReadings.broadcast())
                    .flatMap(new RaiseAlertFlatMap());
    
            alerts.print();
    
            env.execute("Multi-Stream Transformations Example");
        }
    
        public static class RaiseAlertFlatMap implements CoFlatMapFunction<SensorReading, SmokeLevel, Alert> {
    
            private SmokeLevel smokeLevel = SmokeLevel.LOW;
    
            @Override
            public void flatMap1(SensorReading tempReading, Collector<Alert> out) throws Exception {
                // high chance of fire => true
                if (this.smokeLevel == SmokeLevel.HIGH && tempReading.temperature > 100) {
                    out.collect(new Alert("Risk of fire! " + tempReading, tempReading.timestamp));
                }
            }
    
            @Override
            public void flatMap2(SmokeLevel smokeLevel, Collector<Alert> out) {
                // update smoke level
                this.smokeLevel = smokeLevel;
            }
        }
    }
    复制代码
  • 相关阅读:
    移动端网络优化
    性能优化之Java(Android)代码优化
    性能优化之布局优化
    性能优化之数据库优化
    Android性能调优
    Android性能检测--traceview工具各个参数的意思
    RDIFramework.NET ━ .NET高速信息系统开发框架钜献 V2.9 版本震撼发布
    【C++基金会 04】vector详细解释
    Chromium-Dev一些缩写
    怎么样CSDN Blog投机和增加流量?
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/13796600.html
Copyright © 2020-2023  润新知