• 【源码】Join 算子 Transforation 过程


    又看下了 Join 算子 Transformation 的过程,发现有使用 union 和 coGroup,比较特殊,就仔细梳理一下

    join demo 代码: 两个 Stream join 只能在窗口中进行 join,join 在处理无界数据集的时候,必须指定窗口,让无界数据变成有界数据,Flink 状态缓存左右两条流的部分数据做 join 联接,在时间(或条数、指定窗口)上清除超过 join 窗口范围的数据,Flink 状态的大小才能保持在一个合理的范围内,而不是一直增大,直到超出大小失败。

    val join = process.join(map)
          .where(str => str)
          .equalTo(str => str)
          .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
          .apply(new JoinFunction[String, String, String] {
            override def join(first: String, second: String): String = {
              first + ";" + second
            }
          })

    先是 input1.join(input2) , 使用 A、B 两个流创建一个 JoinedStreams, input1、input2 分别是左右两个流

    DataStream.scala

    def join[T2](otherStream: DataStream[T2]): JoinedStreams[T, T2] = {
        new JoinedStreams(this, otherStream)
      }

    where、equalTo、window 没什么内容,跳过

    apply 方法是 join 算子的关键

    先使用 JoinedStreams 的 input1、input2 创建一个 JavaJoinedStreams (多直白的名字, 上篇已经说过了, Java 开发了 Flink 的方法,Scala Api 相当于一个壳,调用了 Java 的内容)

    对 JavaJoinedStreams 的对象 join 调用 Join 对应的内容, 如果没有指定 trigger、evictor、allowedLateness 就是 null

    返回的结果做为参数调用 asScalaStream,将 Java 的 DataStream 转为 Scala 的 DataStream,供后续使用

    JoinedStreams.scala

    def apply[T: TypeInformation](function: JoinFunction[T1, T2, T]): DataStream[T] = {
              // 创建 JavaJoinedStreams
              val join = new JavaJoinedStreams[T1, T2](input1.javaStream, input2.javaStream)
    
              asScalaStream(join
                .where(keySelector1)
                .equalTo(keySelector2)
                .window(windowAssigner)
                .trigger(trigger)
                .evictor(evictor)
                .allowedLateness(allowedLateness)
                // apply join
                .apply(clean(function), implicitly[TypeInformation[T]]))
            }

    再看 Java的 JoinedStreams 的 apply 放,将 JoinedStreams 转成 CoGroupedStreams 来处理 join 算子,input、where、equalTo 等直接平移过来,最后调用 CoGroupedStreams 的 apply 方法

    JoinedStreams.java

    public <T> DataStream<T> apply(JoinFunction<T1, T2, T> function, TypeInformation<T> resultType) {
          //clean the closure
          function = input1.getExecutionEnvironment().clean(function);
          // join 变 coGroup 了, input1 input2 还是 他们
          coGroupedWindowedStream = input1.coGroup(input2)
            .where(keySelector1)
            .equalTo(keySelector2)
            .window(windowAssigner)
            .trigger(trigger)
            .evictor(evictor)
            .allowedLateness(allowedLateness);
          // 调用 coGroupedWindowedStream 的 apply 处理
          return coGroupedWindowedStream
              .apply(new JoinCoGroupFunction<>(function), resultType);
        }

    CoGroupedStreams 的 apply 方法 看着就更有意思了,将 input1、input2 转成类型是 TaggedUnion<T1, T2> 的 DataStream, 对两个新流调用 map(new Input1Tagger<T1, T2>())、map(new Input2Tagger<T1, T2>()) 方法,将两个流的类型转成一样,只是在输出数据是,只有自己这边有数据,另一边直接给 null

    有将两个流的 keySelect 组合成 unionKeySelector

    使用 union 后的流,创建 KeyedStream 传入 unionKeySelector, 指定分区的 PartitionTransformation, 并生成窗口

    最后调用 windowedStream.apply 方法

    CoGroupedStreams.java

    public <T> DataStream<T> apply(CoGroupFunction<T1, T2, T> function, TypeInformation<T> resultType) {
          //clean the closure
          function = input1.getExecutionEnvironment().clean(function);
    
          // 定义 union 的 UnionTypeInfo, 两种类型组合
          UnionTypeInfo<T1, T2> unionType = new UnionTypeInfo<>(input1.getType(), input2.getType());
          // 定义 union 的 KeySelector,两个 keySelector
          UnionKeySelector<T1, T2, KEY> unionKeySelector = new UnionKeySelector<>(keySelector1, keySelector2);
    
          // input1 创建 DataStream<TaggedUnion<T1, T2>> 指定返回类型是 unionType
          DataStream<TaggedUnion<T1, T2>> taggedInput1 = input1
              .map(new Input1Tagger<T1, T2>())
              .setParallelism(input1.getParallelism())
              .returns(unionType);
          // input2 创建 DataStream<TaggedUnion<T1, T2>> 指定返回类型是 unionType
          DataStream<TaggedUnion<T1, T2>> taggedInput2 = input2
              .map(new Input2Tagger<T1, T2>())
              .setParallelism(input2.getParallelism())
              .returns(unionType);
          // join 两个流,上面已经将两个流的类型转为一样了: DataStream<TaggedUnion<T1, T2>>
          DataStream<TaggedUnion<T1, T2>> unionStream = taggedInput1.union(taggedInput2);
    
          // we explicitly create the keyed stream to manually pass the key type information in
          // 使用 union 的 Stream 创建 KeyedStream,同时指定 分区的 PartitionTransformation
          // 调用window 生成 windowStream
          windowedStream =
              new KeyedStream<TaggedUnion<T1, T2>, KEY>(unionStream, unionKeySelector, keyType)
              .window(windowAssigner);
    
          if (trigger != null) {
            windowedStream.trigger(trigger);
          }
          if (evictor != null) {
            windowedStream.evictor(evictor);
          }
          if (allowedLateness != null) {
            windowedStream.allowedLateness(allowedLateness);
          }
          // 调用 windowedStream apply 方法,参数是个 CoGroupWindowFunction
          return windowedStream.apply(new CoGroupWindowFunction<T1, T2, T, KEY, W>(function), resultType);

    Input1Tagger/Input2Tagger 的 map 方法

    private static class Input1Tagger<T1, T2> implements MapFunction<T1, TaggedUnion<T1, T2>> {
        private static final long serialVersionUID = 1L;
    
        @Override
        public TaggedUnion<T1, T2> map(T1 value) throws Exception {
          return TaggedUnion.one(value);
        }
      }
    
      private static class Input2Tagger<T1, T2> implements MapFunction<T2, TaggedUnion<T1, T2>> {
        private static final long serialVersionUID = 1L;
    
        @Override
        public TaggedUnion<T1, T2> map(T2 value) throws Exception {
          return TaggedUnion.two(value);
        }
      }

    TaggedUnion 的 one/two 方法

    public static <T1, T2> TaggedUnion<T1, T2> one(T1 one) {
          // one 方法 右边数据为 null
          return new TaggedUnion<>(one, null);
        }
    
        public static <T1, T2> TaggedUnion<T1, T2> two(T2 two) {
          // two 方法 左边数据为 null
          return new TaggedUnion<>(null, two);
        }

    KeyedStream

    public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
        this(
          dataStream,
          new PartitionTransformation<>(
            dataStream.getTransformation(),
            new KeyGroupStreamPartitioner<>(keySelector, StreamGraphGenerator.DEFAULT_LOWER_BOUND_MAX_PARALLELISM)),
          keySelector,
          keyType);
      }


    windowStream.apply 方法

    public <R> SingleOutputStreamOperator<R> apply(WindowFunction<T, R, K, W> function, TypeInformation<R> resultType) {
        function = input.getExecutionEnvironment().clean(function);
        return apply(new InternalIterableWindowFunction<>(function), resultType, function);
      }

    再 apply,看到这里,又看到了熟悉的样子,先 operator,再 Transformation

    private <R> SingleOutputStreamOperator<R> apply(InternalWindowFunction<Iterable<T>, R, K, W> function, TypeInformation<R> resultType, Function originalFunction) {
    
        // operatorName
        final String opName = generateOperatorName(windowAssigner, trigger, evictor, originalFunction, null);
        // keySelector 就是之前的 UnionKeySelector
        KeySelector<T, K> keySel = input.getKeySelector();
    
        WindowOperator<K, T, Iterable<T>, R, W> operator;
    
        if (evictor != null) {
          @SuppressWarnings({"unchecked", "rawtypes"})
          TypeSerializer<StreamRecord<T>> streamRecordSerializer =
              (TypeSerializer<StreamRecord<T>>) new StreamElementSerializer(input.getType().createSerializer(getExecutionEnvironment().getConfig()));
    
          ListStateDescriptor<StreamRecord<T>> stateDesc =
              new ListStateDescriptor<>("window-contents", streamRecordSerializer);
    
          operator =
            new EvictingWindowOperator<>(windowAssigner,
              windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
              keySel,
              input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
              stateDesc,
              function,
              trigger,
              evictor,
              allowedLateness,
              lateDataOutputTag);
    
        } else {
          ListStateDescriptor<T> stateDesc = new ListStateDescriptor<>("window-contents",
            input.getType().createSerializer(getExecutionEnvironment().getConfig()));
          // 创建 window 的 WindowOperator
          operator =
            new WindowOperator<>(windowAssigner,
              windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
              keySel,
              input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
              stateDesc,
              function,
              trigger,
              allowedLateness,
              lateDataOutputTag);
        }
        // 调用 transform 方法 生成 Transformation
        return input.transform(opName, resultType, operator);
      }

    join 算子 doTransform 的过程,先生成了一个 OneInputTransformation, 再用 OneInputTransformation 生成了一个  SingleOutputStreamOperator 返回,所以最后是个 SingleOutputStreamOperator 的 DataStream

    @PublicEvolving
      public <R> SingleOutputStreamOperator<R> transform(
          String operatorName,
          TypeInformation<R> outTypeInfo,
          OneInputStreamOperatorFactory<T, R> operatorFactory) {
    
        return doTransform(operatorName, outTypeInfo, operatorFactory);
      }
    
      protected <R> SingleOutputStreamOperator<R> doTransform(
          String operatorName,
          TypeInformation<R> outTypeInfo,
          StreamOperatorFactory<R> operatorFactory) {
    
        // read the output type of the input Transform to coax out errors about MissingTypeInfo
        transformation.getOutputType();
    
        OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>(
            this.transformation,
            operatorName,
            operatorFactory,
            outTypeInfo,
            environment.getParallelism());
    
        @SuppressWarnings({"unchecked", "rawtypes"})
        SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(environment, resultTransform);
    
        getExecutionEnvironment().addOperator(resultTransform);
    
        return returnStream;
      }

    到这里简单总结下中间转换的过程:

    * 1 先是将 join 转成 Scala JoinedStreams,再到 Java JoinedStreams
    * 2 Java JoinedStreams 转成 CoGroupedStreams
    * 3 CoGroupedStreams 转成 union (UnionTransformation )
    * 4 再转成 KeyedStream (PartitionTransformation)
    * 5 再转成 SingleOutputStreamOperator (OneInputTransformation)

    看起来略复杂,但是总体来说,还是跟之前的差不多

    欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文

  • 相关阅读:
    ora-01034 ora-27101解决方法(亲测)
    windows C++内存检测
    oracle求特定字符的个数
    ORACLE查看并修改最大连接数
    你必须用角色管理工具安装Microsoft .NET Framework 3.5
    让VC编译的Release版本程序在其他机器上顺利运行
    创建数据库连接
    C++ 判断进程是否存在
    excel
    毕设学习笔记
  • 原文地址:https://www.cnblogs.com/Springmoon-venn/p/13931838.html
Copyright © 2020-2023  润新知