• Flink运行时架构


     1. 系统架构

    Flink 运行时的两大架构。JobManager作业管理器和TaskManager任务管理器。

    JobManager:真正的管理者(master),负责管理和调度。在不考虑高可用的情况下只有一个。

    TaskManager:可以理解为工作中(worker, slave)。可以有一个或者多个。

    作业提交和任务处理时的系统如下:

      客户端并不是处理系统的一部分,只负责作业的提交。负责调用程序的main 方法,将代码转换成数据流图(Dataflow Graph) ,并且最终生成作业图(Job Graph),然后发送给JobManager。也可以从JobManager 获取任务的执行状态和结果。 TaskManager 启动之后,JobManager 会与它建立连接,并将作业图(Job Graph) 转换成可执行的执行图(ExecutionGraph) 然后分发给可用的TaskManager。

    1. JobManager

        一个集群中任务管理和调度中心。JobManager 又包含三个组件:

    1. JobMaster

      负责处理单独的作业,JobMaster 接收要执行的应用。包括:jar包、数据流图、作业图。JobMaster 会把JobGraph 转换成一个物理层面的数据流图,叫做执行图(包含了所有可以并发执行的任务)。JobMaster 会向资源管理器(ResourceManager)发出请求,申请执行任务必要的资源,一旦获取到足够资源就会将执行图分发到真正运行的TaskManager 上。

      在运行过程中,JobMaster 会负责所有需要中央协调的操作,比如说检查点的协调等。

    2. ResourceManager

      主要负责资源的分配和管理。所谓资源主要是指TaskManager 的任务槽。任务槽是Flink 集群中的资源协调单元,包含了机器用来执行计算的一组CPU和内存资源。 每一个任务都需要分配到一个slot 上执行。

    3. Dispatcher

      主要负责提供一个REST 接口,用来提交应用,并且负责为每一个新提交的作业启动一个新的JobMaster 组件。 Dispatcher 也会启动一个WEB UI,用来方便的展示和监控作业执行的信息。

    2. TaskManager

      Flink 中的工作进程,也被称为worker。 一个集群包含一个或多个TaskManager,每个TaskManager 都包含一定数量的任务槽(task slots)。slot 的数量限制了TaskManager 能并行处理的任务数量。

      启动后,TaskManager 向资源管理器注册它的slots;收到资源管理器的指令后会将一个或多个槽位提供给JobMaster 调用,用于分配任务。

      执行过程中,TaskManager 可以缓冲数据,还可以跟其他运行同一应用的TaskManager 交换数据。

    2. 作业提交流程

       可以用下图表示

    (1)客户端APP通过分发器提供REST 接口,将作业提交给JobManager

    (2)分发器启动JobMaster, 将作业(包含JobGraph)提交给JobMaster

    (3)JobMaster 将JobGraph 解析为ExecutionGraph,得到所需的资源数量,然后向资源管理器请求资源(slots)

    (4)资源管理器协调资源

    (5)Taskmanager 启动只会向ResourceManager 注册自己的可以slots

    (6)资源管理器通知TaskManager 为新的作业提供slots

    (7)TaskManager连接到对应的JobMaster,提供slots

    (8)JobMaster 将需要执行的任务分发给TaskManager

    (9)TaskManager 执行任务,互相之间可以交换

    3. 重要概念

      通过这些核心概念,我们可以明白:

    1》怎样从Flink程序得到任务?

    2》一个流处理程序,到底包含多少个任务?

    3》最终执行任务,需要占用多少slot?

    1. 数据流图(Dataflow Graph)

      Flink 是流式计算框架,它的程序结构其实就是定义了一连串的操作,每个数据输入之后都会调用每一个步骤一次计算,每一个操作都叫做"算子"(operator),可以理解为我们的程序是一串算子构成的管道,数据则像水流一样有序地流过。 

      所有的程序都由三部分组成。source(源算子,负责读取数据)、Transformation(转换算子,负责处理数据)、Sink(下沉子算子,负责数据的输出)。

      在运行时,Flink 程序会被映射成所有算子按照逻辑顺序拼接成一张图,这种图被称为逻辑数据流(数据流图)。数据流图类似于任意的有向无环图(DAG-Directed Acyclic Graph)。图中的每一条数据流以一个或者多个source 开始,以一个或者多个sink 结束。

      代码中,除了source和sink,其他可以被称为代码中如果返回值是 SingleOutputStreamOperator 的API 就可以称为一个算子,否则不会计算为算子(只能理解为中间的转换操作),比如:keyBy 返回值是 KeyedStream 就不是一个算子;org.apache.flink.streaming.api.datastream.KeyedStream#sum(int) 就是一个算子。

      常见的算子:

    source:读txt、socket、自定义输入等

    transformation:flatMap、map、filter、process 处理操作,还有sum、max、maxBy、min、minBy 等也都是聚合算子(名字都是Keyed Aggregation)

    sink: print、printToErr、writeAsText、writeAsCsv 等

    org.apache.flink.streaming.api.datastream.DataStream 源码可以看出每个算子都有一个特定名称:

    //
    // Source code recreated from a .class file by IntelliJ IDEA
    // (powered by FernFlower decompiler)
    //
    
    package org.apache.flink.streaming.api.datastream;
    
    import java.util.ArrayList;
    import java.util.List;
    import java.util.UUID;
    import org.apache.flink.annotation.Experimental;
    import org.apache.flink.annotation.Internal;
    import org.apache.flink.annotation.Public;
    import org.apache.flink.annotation.PublicEvolving;
    import org.apache.flink.api.common.ExecutionConfig;
    import org.apache.flink.api.common.eventtime.WatermarkStrategy;
    import org.apache.flink.api.common.functions.FilterFunction;
    import org.apache.flink.api.common.functions.FlatMapFunction;
    import org.apache.flink.api.common.functions.MapFunction;
    import org.apache.flink.api.common.functions.Partitioner;
    import org.apache.flink.api.common.io.OutputFormat;
    import org.apache.flink.api.common.operators.Keys;
    import org.apache.flink.api.common.operators.ResourceSpec;
    import org.apache.flink.api.common.operators.Keys.ExpressionKeys;
    import org.apache.flink.api.common.serialization.SerializationSchema;
    import org.apache.flink.api.common.state.MapStateDescriptor;
    import org.apache.flink.api.common.typeinfo.BasicArrayTypeInfo;
    import org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo;
    import org.apache.flink.api.common.typeinfo.TypeInformation;
    import org.apache.flink.api.common.typeutils.TypeSerializer;
    import org.apache.flink.api.connector.sink.Sink;
    import org.apache.flink.api.dag.Transformation;
    import org.apache.flink.api.java.Utils;
    import org.apache.flink.api.java.functions.KeySelector;
    import org.apache.flink.api.java.io.CsvOutputFormat;
    import org.apache.flink.api.java.io.TextOutputFormat;
    import org.apache.flink.api.java.tuple.Tuple;
    import org.apache.flink.api.java.typeutils.InputTypeConfigurable;
    import org.apache.flink.api.java.typeutils.TypeExtractor;
    import org.apache.flink.core.execution.JobClient;
    import org.apache.flink.core.fs.Path;
    import org.apache.flink.core.fs.FileSystem.WriteMode;
    import org.apache.flink.streaming.api.TimeCharacteristic;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
    import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
    import org.apache.flink.streaming.api.functions.ProcessFunction;
    import org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction;
    import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
    import org.apache.flink.streaming.api.functions.sink.SinkFunction;
    import org.apache.flink.streaming.api.functions.sink.SocketClientSink;
    import org.apache.flink.streaming.api.operators.OneInputStreamOperator;
    import org.apache.flink.streaming.api.operators.OneInputStreamOperatorFactory;
    import org.apache.flink.streaming.api.operators.ProcessOperator;
    import org.apache.flink.streaming.api.operators.SimpleOperatorFactory;
    import org.apache.flink.streaming.api.operators.StreamFilter;
    import org.apache.flink.streaming.api.operators.StreamFlatMap;
    import org.apache.flink.streaming.api.operators.StreamMap;
    import org.apache.flink.streaming.api.operators.StreamOperatorFactory;
    import org.apache.flink.streaming.api.operators.StreamSink;
    import org.apache.flink.streaming.api.operators.collect.ClientAndIterator;
    import org.apache.flink.streaming.api.operators.collect.CollectResultIterator;
    import org.apache.flink.streaming.api.operators.collect.CollectSinkOperator;
    import org.apache.flink.streaming.api.operators.collect.CollectSinkOperatorFactory;
    import org.apache.flink.streaming.api.operators.collect.CollectStreamSink;
    import org.apache.flink.streaming.api.transformations.OneInputTransformation;
    import org.apache.flink.streaming.api.transformations.PartitionTransformation;
    import org.apache.flink.streaming.api.transformations.TimestampsAndWatermarksTransformation;
    import org.apache.flink.streaming.api.transformations.UnionTransformation;
    import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
    import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
    import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
    import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
    import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
    import org.apache.flink.streaming.api.windowing.assigners.WindowAssigner;
    import org.apache.flink.streaming.api.windowing.evictors.CountEvictor;
    import org.apache.flink.streaming.api.windowing.time.Time;
    import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
    import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger;
    import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
    import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
    import org.apache.flink.streaming.api.windowing.windows.Window;
    import org.apache.flink.streaming.runtime.operators.util.AssignerWithPeriodicWatermarksAdapter.Strategy;
    import org.apache.flink.streaming.runtime.partitioner.BroadcastPartitioner;
    import org.apache.flink.streaming.runtime.partitioner.CustomPartitionerWrapper;
    import org.apache.flink.streaming.runtime.partitioner.ForwardPartitioner;
    import org.apache.flink.streaming.runtime.partitioner.GlobalPartitioner;
    import org.apache.flink.streaming.runtime.partitioner.RebalancePartitioner;
    import org.apache.flink.streaming.runtime.partitioner.RescalePartitioner;
    import org.apache.flink.streaming.runtime.partitioner.ShufflePartitioner;
    import org.apache.flink.streaming.runtime.partitioner.StreamPartitioner;
    import org.apache.flink.streaming.util.keys.KeySelectorUtil;
    import org.apache.flink.util.CloseableIterator;
    import org.apache.flink.util.Preconditions;
    
    @Public
    public class DataStream<T> {
        protected final StreamExecutionEnvironment environment;
        protected final Transformation<T> transformation;
    
        public DataStream(StreamExecutionEnvironment environment, Transformation<T> transformation) {
            this.environment = (StreamExecutionEnvironment)Preconditions.checkNotNull(environment, "Execution Environment must not be null.");
            this.transformation = (Transformation)Preconditions.checkNotNull(transformation, "Stream Transformation must not be null.");
        }
    
        @Internal
        public int getId() {
            return this.transformation.getId();
        }
    
        public int getParallelism() {
            return this.transformation.getParallelism();
        }
    
        @PublicEvolving
        public ResourceSpec getMinResources() {
            return this.transformation.getMinResources();
        }
    
        @PublicEvolving
        public ResourceSpec getPreferredResources() {
            return this.transformation.getPreferredResources();
        }
    
        public TypeInformation<T> getType() {
            return this.transformation.getOutputType();
        }
    
        protected <F> F clean(F f) {
            return this.getExecutionEnvironment().clean(f);
        }
    
        public StreamExecutionEnvironment getExecutionEnvironment() {
            return this.environment;
        }
    
        public ExecutionConfig getExecutionConfig() {
            return this.environment.getConfig();
        }
    
        @SafeVarargs
        public final DataStream<T> union(DataStream<T>... streams) {
            List<Transformation<T>> unionedTransforms = new ArrayList();
            unionedTransforms.add(this.transformation);
            DataStream[] var3 = streams;
            int var4 = streams.length;
    
            for(int var5 = 0; var5 < var4; ++var5) {
                DataStream<T> newStream = var3[var5];
                if (!this.getType().equals(newStream.getType())) {
                    throw new IllegalArgumentException("Cannot union streams of different types: " + this.getType() + " and " + newStream.getType());
                }
    
                unionedTransforms.add(newStream.getTransformation());
            }
    
            return new DataStream(this.environment, new UnionTransformation(unionedTransforms));
        }
    
        public <R> ConnectedStreams<T, R> connect(DataStream<R> dataStream) {
            return new ConnectedStreams(this.environment, this, dataStream);
        }
    
        @PublicEvolving
        public <R> BroadcastConnectedStream<T, R> connect(BroadcastStream<R> broadcastStream) {
            return new BroadcastConnectedStream(this.environment, this, (BroadcastStream)Preconditions.checkNotNull(broadcastStream), broadcastStream.getBroadcastStateDescriptors());
        }
    
        public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) {
            Preconditions.checkNotNull(key);
            return new KeyedStream(this, (KeySelector)this.clean(key));
        }
    
        public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key, TypeInformation<K> keyType) {
            Preconditions.checkNotNull(key);
            Preconditions.checkNotNull(keyType);
            return new KeyedStream(this, (KeySelector)this.clean(key), keyType);
        }
    
        /** @deprecated */
        @Deprecated
        public KeyedStream<T, Tuple> keyBy(int... fields) {
            return !(this.getType() instanceof BasicArrayTypeInfo) && !(this.getType() instanceof PrimitiveArrayTypeInfo) ? this.keyBy((Keys)(new ExpressionKeys(fields, this.getType()))) : this.keyBy((KeySelector)KeySelectorUtil.getSelectorForArray(fields, this.getType()));
        }
    
        /** @deprecated */
        @Deprecated
        public KeyedStream<T, Tuple> keyBy(String... fields) {
            return this.keyBy((Keys)(new ExpressionKeys(fields, this.getType())));
        }
    
        private KeyedStream<T, Tuple> keyBy(Keys<T> keys) {
            return new KeyedStream(this, (KeySelector)this.clean(KeySelectorUtil.getSelectorForKeys(keys, this.getType(), this.getExecutionConfig())));
        }
    
        /** @deprecated */
        @Deprecated
        public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, int field) {
            ExpressionKeys<T> outExpressionKeys = new ExpressionKeys(new int[]{field}, this.getType());
            return this.partitionCustom(partitioner, (Keys)outExpressionKeys);
        }
    
        /** @deprecated */
        @Deprecated
        public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, String field) {
            ExpressionKeys<T> outExpressionKeys = new ExpressionKeys(new String[]{field}, this.getType());
            return this.partitionCustom(partitioner, (Keys)outExpressionKeys);
        }
    
        public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, KeySelector<T, K> keySelector) {
            return this.setConnectionType(new CustomPartitionerWrapper((Partitioner)this.clean(partitioner), (KeySelector)this.clean(keySelector)));
        }
    
        private <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, Keys<T> keys) {
            KeySelector<T, K> keySelector = KeySelectorUtil.getSelectorForOneKey(keys, partitioner, this.getType(), this.getExecutionConfig());
            return this.setConnectionType(new CustomPartitionerWrapper((Partitioner)this.clean(partitioner), (KeySelector)this.clean(keySelector)));
        }
    
        public DataStream<T> broadcast() {
            return this.setConnectionType(new BroadcastPartitioner());
        }
    
        @PublicEvolving
        public BroadcastStream<T> broadcast(MapStateDescriptor<?, ?>... broadcastStateDescriptors) {
            Preconditions.checkNotNull(broadcastStateDescriptors);
            DataStream<T> broadcastStream = this.setConnectionType(new BroadcastPartitioner());
            return new BroadcastStream(this.environment, broadcastStream, broadcastStateDescriptors);
        }
    
        @PublicEvolving
        public DataStream<T> shuffle() {
            return this.setConnectionType(new ShufflePartitioner());
        }
    
        public DataStream<T> forward() {
            return this.setConnectionType(new ForwardPartitioner());
        }
    
        public DataStream<T> rebalance() {
            return this.setConnectionType(new RebalancePartitioner());
        }
    
        @PublicEvolving
        public DataStream<T> rescale() {
            return this.setConnectionType(new RescalePartitioner());
        }
    
        @PublicEvolving
        public DataStream<T> global() {
            return this.setConnectionType(new GlobalPartitioner());
        }
    
        @PublicEvolving
        public IterativeStream<T> iterate() {
            return new IterativeStream(this, 0L);
        }
    
        @PublicEvolving
        public IterativeStream<T> iterate(long maxWaitTimeMillis) {
            return new IterativeStream(this, maxWaitTimeMillis);
        }
    
        public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) {
            TypeInformation<R> outType = TypeExtractor.getMapReturnTypes((MapFunction)this.clean(mapper), this.getType(), Utils.getCallLocationName(), true);
            return this.map(mapper, outType);
        }
    
        public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper, TypeInformation<R> outputType) {
            return this.transform("Map", outputType, (OneInputStreamOperator)(new StreamMap((MapFunction)this.clean(mapper))));
        }
    
        public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper) {
            TypeInformation<R> outType = TypeExtractor.getFlatMapReturnTypes((FlatMapFunction)this.clean(flatMapper), this.getType(), Utils.getCallLocationName(), true);
            return this.flatMap(flatMapper, outType);
        }
    
        public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper, TypeInformation<R> outputType) {
            return this.transform("Flat Map", outputType, (OneInputStreamOperator)(new StreamFlatMap((FlatMapFunction)this.clean(flatMapper))));
        }
    
        @PublicEvolving
        public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction) {
            TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(processFunction, ProcessFunction.class, 0, 1, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true);
            return this.process(processFunction, outType);
        }
    
        @Internal
        public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction, TypeInformation<R> outputType) {
            ProcessOperator<T, R> operator = new ProcessOperator((ProcessFunction)this.clean(processFunction));
            return this.transform("Process", outputType, (OneInputStreamOperator)operator);
        }
    
        public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter) {
            return this.transform("Filter", this.getType(), (OneInputStreamOperator)(new StreamFilter((FilterFunction)this.clean(filter))));
        }
    
        @PublicEvolving
        public <R extends Tuple> SingleOutputStreamOperator<R> project(int... fieldIndexes) {
            return (new StreamProjection(this, fieldIndexes)).projectTupleX();
        }
    
        public <T2> CoGroupedStreams<T, T2> coGroup(DataStream<T2> otherStream) {
            return new CoGroupedStreams(this, otherStream);
        }
    
        public <T2> JoinedStreams<T, T2> join(DataStream<T2> otherStream) {
            return new JoinedStreams(this, otherStream);
        }
    
        /** @deprecated */
        @Deprecated
        public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size) {
            return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.windowAll(TumblingProcessingTimeWindows.of(size)) : this.windowAll(TumblingEventTimeWindows.of(size));
        }
    
        /** @deprecated */
        @Deprecated
        public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size, Time slide) {
            return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.windowAll(SlidingProcessingTimeWindows.of(size, slide)) : this.windowAll(SlidingEventTimeWindows.of(size, slide));
        }
    
        public AllWindowedStream<T, GlobalWindow> countWindowAll(long size) {
            return this.windowAll(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size)));
        }
    
        public AllWindowedStream<T, GlobalWindow> countWindowAll(long size, long slide) {
            return this.windowAll(GlobalWindows.create()).evictor(CountEvictor.of(size)).trigger(CountTrigger.of(slide));
        }
    
        @PublicEvolving
        public <W extends Window> AllWindowedStream<T, W> windowAll(WindowAssigner<? super T, W> assigner) {
            return new AllWindowedStream(this, assigner);
        }
    
        public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(WatermarkStrategy<T> watermarkStrategy) {
            WatermarkStrategy<T> cleanedStrategy = (WatermarkStrategy)this.clean(watermarkStrategy);
            int inputParallelism = this.getTransformation().getParallelism();
            TimestampsAndWatermarksTransformation<T> transformation = new TimestampsAndWatermarksTransformation("Timestamps/Watermarks", inputParallelism, this.getTransformation(), cleanedStrategy);
            this.getExecutionEnvironment().addOperator(transformation);
            return new SingleOutputStreamOperator(this.getExecutionEnvironment(), transformation);
        }
    
        /** @deprecated */
        @Deprecated
        public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks<T> timestampAndWatermarkAssigner) {
            AssignerWithPeriodicWatermarks<T> cleanedAssigner = (AssignerWithPeriodicWatermarks)this.clean(timestampAndWatermarkAssigner);
            WatermarkStrategy<T> wms = new Strategy(cleanedAssigner);
            return this.assignTimestampsAndWatermarks((WatermarkStrategy)wms);
        }
    
        /** @deprecated */
        @Deprecated
        public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks<T> timestampAndWatermarkAssigner) {
            AssignerWithPunctuatedWatermarks<T> cleanedAssigner = (AssignerWithPunctuatedWatermarks)this.clean(timestampAndWatermarkAssigner);
            WatermarkStrategy<T> wms = new org.apache.flink.streaming.runtime.operators.util.AssignerWithPunctuatedWatermarksAdapter.Strategy(cleanedAssigner);
            return this.assignTimestampsAndWatermarks((WatermarkStrategy)wms);
        }
    
        @PublicEvolving
        public DataStreamSink<T> print() {
            PrintSinkFunction<T> printFunction = new PrintSinkFunction();
            return this.addSink(printFunction).name("Print to Std. Out");
        }
    
        @PublicEvolving
        public DataStreamSink<T> printToErr() {
            PrintSinkFunction<T> printFunction = new PrintSinkFunction(true);
            return this.addSink(printFunction).name("Print to Std. Err");
        }
    
        @PublicEvolving
        public DataStreamSink<T> print(String sinkIdentifier) {
            PrintSinkFunction<T> printFunction = new PrintSinkFunction(sinkIdentifier, false);
            return this.addSink(printFunction).name("Print to Std. Out");
        }
    
        @PublicEvolving
        public DataStreamSink<T> printToErr(String sinkIdentifier) {
            PrintSinkFunction<T> printFunction = new PrintSinkFunction(sinkIdentifier, true);
            return this.addSink(printFunction).name("Print to Std. Err");
        }
    
        /** @deprecated */
        @Deprecated
        @PublicEvolving
        public DataStreamSink<T> writeAsText(String path) {
            return this.writeUsingOutputFormat(new TextOutputFormat(new Path(path)));
        }
    
        /** @deprecated */
        @Deprecated
        @PublicEvolving
        public DataStreamSink<T> writeAsText(String path, WriteMode writeMode) {
            TextOutputFormat<T> tof = new TextOutputFormat(new Path(path));
            tof.setWriteMode(writeMode);
            return this.writeUsingOutputFormat(tof);
        }
    
        /** @deprecated */
        @Deprecated
        @PublicEvolving
        public DataStreamSink<T> writeAsCsv(String path) {
            return this.writeAsCsv(path, (WriteMode)null, "\n", CsvOutputFormat.DEFAULT_FIELD_DELIMITER);
        }
    
        /** @deprecated */
        @Deprecated
        @PublicEvolving
        public DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode) {
            return this.writeAsCsv(path, writeMode, "\n", CsvOutputFormat.DEFAULT_FIELD_DELIMITER);
        }
    
        /** @deprecated */
        @Deprecated
        @PublicEvolving
        public <X extends Tuple> DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode, String rowDelimiter, String fieldDelimiter) {
            Preconditions.checkArgument(this.getType().isTupleType(), "The writeAsCsv() method can only be used on data streams of tuples.");
            CsvOutputFormat<X> of = new CsvOutputFormat(new Path(path), rowDelimiter, fieldDelimiter);
            if (writeMode != null) {
                of.setWriteMode(writeMode);
            }
    
            return this.writeUsingOutputFormat(of);
        }
    
        @PublicEvolving
        public DataStreamSink<T> writeToSocket(String hostName, int port, SerializationSchema<T> schema) {
            DataStreamSink<T> returnStream = this.addSink(new SocketClientSink(hostName, port, schema, 0));
            returnStream.setParallelism(1);
            return returnStream;
        }
    
        /** @deprecated */
        @Deprecated
        @PublicEvolving
        public DataStreamSink<T> writeUsingOutputFormat(OutputFormat<T> format) {
            return this.addSink(new OutputFormatSinkFunction(format));
        }
    
        @PublicEvolving
        public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperator<T, R> operator) {
            return this.doTransform(operatorName, outTypeInfo, SimpleOperatorFactory.of(operator));
        }
    
        @PublicEvolving
        public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperatorFactory<T, R> operatorFactory) {
            return this.doTransform(operatorName, outTypeInfo, operatorFactory);
        }
    
        protected <R> SingleOutputStreamOperator<R> doTransform(String operatorName, TypeInformation<R> outTypeInfo, StreamOperatorFactory<R> operatorFactory) {
            this.transformation.getOutputType();
            OneInputTransformation<T, R> resultTransform = new OneInputTransformation(this.transformation, operatorName, operatorFactory, outTypeInfo, this.environment.getParallelism());
            SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(this.environment, resultTransform);
            this.getExecutionEnvironment().addOperator(resultTransform);
            return returnStream;
        }
    
        protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) {
            return new DataStream(this.getExecutionEnvironment(), new PartitionTransformation(this.getTransformation(), partitioner));
        }
    
        public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction) {
            this.transformation.getOutputType();
            if (sinkFunction instanceof InputTypeConfigurable) {
                ((InputTypeConfigurable)sinkFunction).setInputType(this.getType(), this.getExecutionConfig());
            }
    
            StreamSink<T> sinkOperator = new StreamSink((SinkFunction)this.clean(sinkFunction));
            DataStreamSink<T> sink = new DataStreamSink(this, sinkOperator);
            this.getExecutionEnvironment().addOperator(sink.getTransformation());
            return sink;
        }
    
        @Experimental
        public DataStreamSink<T> sinkTo(Sink<T, ?, ?, ?> sink) {
            this.transformation.getOutputType();
            return new DataStreamSink(this, sink);
        }
    
        public CloseableIterator<T> executeAndCollect() throws Exception {
            return this.executeAndCollect("DataStream Collect");
        }
    
        public CloseableIterator<T> executeAndCollect(String jobExecutionName) throws Exception {
            return this.executeAndCollectWithClient(jobExecutionName).iterator;
        }
    
        public List<T> executeAndCollect(int limit) throws Exception {
            return this.executeAndCollect("DataStream Collect", limit);
        }
    
        public List<T> executeAndCollect(String jobExecutionName, int limit) throws Exception {
            Preconditions.checkState(limit > 0, "Limit must be greater than 0");
            ClientAndIterator<T> clientAndIterator = this.executeAndCollectWithClient(jobExecutionName);
            Throwable var4 = null;
    
            try {
                ArrayList results;
                for(results = new ArrayList(limit); clientAndIterator.iterator.hasNext() && limit > 0; --limit) {
                    results.add(clientAndIterator.iterator.next());
                }
    
                ArrayList var6 = results;
                return var6;
            } catch (Throwable var15) {
                var4 = var15;
                throw var15;
            } finally {
                if (clientAndIterator != null) {
                    if (var4 != null) {
                        try {
                            clientAndIterator.close();
                        } catch (Throwable var14) {
                            var4.addSuppressed(var14);
                        }
                    } else {
                        clientAndIterator.close();
                    }
                }
    
            }
        }
    
        ClientAndIterator<T> executeAndCollectWithClient(String jobExecutionName) throws Exception {
            TypeSerializer<T> serializer = this.getType().createSerializer(this.getExecutionEnvironment().getConfig());
            String accumulatorName = "dataStreamCollect_" + UUID.randomUUID().toString();
            StreamExecutionEnvironment env = this.getExecutionEnvironment();
            CollectSinkOperatorFactory<T> factory = new CollectSinkOperatorFactory(serializer, accumulatorName);
            CollectSinkOperator<T> operator = (CollectSinkOperator)factory.getOperator();
            CollectResultIterator<T> iterator = new CollectResultIterator(operator.getOperatorIdFuture(), serializer, accumulatorName, env.getCheckpointConfig());
            CollectStreamSink<T> sink = new CollectStreamSink(this, factory);
            sink.name("Data stream collect sink");
            env.addOperator(sink.getTransformation());
            JobClient jobClient = env.executeAsync(jobExecutionName);
            iterator.setJobClient(jobClient);
            return new ClientAndIterator(jobClient, iterator);
        }
    
        @Internal
        public Transformation<T> getTransformation() {
            return this.transformation;
        }
    }
    View Code

    org.apache.flink.streaming.api.datastream.KeyedStream 针对集合的算子API:

    //
    // Source code recreated from a .class file by IntelliJ IDEA
    // (powered by FernFlower decompiler)
    //
    
    package org.apache.flink.streaming.api.datastream;
    
    import java.util.ArrayList;
    import java.util.Stack;
    import java.util.UUID;
    import org.apache.commons.lang3.StringUtils;
    import org.apache.flink.annotation.Internal;
    import org.apache.flink.annotation.Public;
    import org.apache.flink.annotation.PublicEvolving;
    import org.apache.flink.api.common.InvalidProgramException;
    import org.apache.flink.api.common.functions.ReduceFunction;
    import org.apache.flink.api.common.state.ReducingStateDescriptor;
    import org.apache.flink.api.common.state.ValueStateDescriptor;
    import org.apache.flink.api.common.typeinfo.BasicArrayTypeInfo;
    import org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo;
    import org.apache.flink.api.common.typeinfo.TypeInformation;
    import org.apache.flink.api.java.Utils;
    import org.apache.flink.api.java.functions.KeySelector;
    import org.apache.flink.api.java.typeutils.EnumTypeInfo;
    import org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo;
    import org.apache.flink.api.java.typeutils.PojoTypeInfo;
    import org.apache.flink.api.java.typeutils.TupleTypeInfoBase;
    import org.apache.flink.api.java.typeutils.TypeExtractor;
    import org.apache.flink.streaming.api.TimeCharacteristic;
    import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
    import org.apache.flink.streaming.api.functions.ProcessFunction;
    import org.apache.flink.streaming.api.functions.aggregation.AggregationFunction;
    import org.apache.flink.streaming.api.functions.aggregation.ComparableAggregator;
    import org.apache.flink.streaming.api.functions.aggregation.SumAggregator;
    import org.apache.flink.streaming.api.functions.aggregation.AggregationFunction.AggregationType;
    import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
    import org.apache.flink.streaming.api.functions.query.QueryableAppendingStateOperator;
    import org.apache.flink.streaming.api.functions.query.QueryableValueStateOperator;
    import org.apache.flink.streaming.api.functions.sink.SinkFunction;
    import org.apache.flink.streaming.api.operators.KeyedProcessOperator;
    import org.apache.flink.streaming.api.operators.LegacyKeyedProcessOperator;
    import org.apache.flink.streaming.api.operators.StreamOperatorFactory;
    import org.apache.flink.streaming.api.operators.co.IntervalJoinOperator;
    import org.apache.flink.streaming.api.transformations.OneInputTransformation;
    import org.apache.flink.streaming.api.transformations.PartitionTransformation;
    import org.apache.flink.streaming.api.transformations.ReduceTransformation;
    import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
    import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
    import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
    import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
    import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
    import org.apache.flink.streaming.api.windowing.assigners.WindowAssigner;
    import org.apache.flink.streaming.api.windowing.evictors.CountEvictor;
    import org.apache.flink.streaming.api.windowing.time.Time;
    import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
    import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger;
    import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
    import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
    import org.apache.flink.streaming.api.windowing.windows.Window;
    import org.apache.flink.streaming.runtime.partitioner.KeyGroupStreamPartitioner;
    import org.apache.flink.streaming.runtime.partitioner.StreamPartitioner;
    import org.apache.flink.util.Preconditions;
    
    @Public
    public class KeyedStream<T, KEY> extends DataStream<T> {
        private final KeySelector<T, KEY> keySelector;
        private final TypeInformation<KEY> keyType;
    
        public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector) {
            this(dataStream, keySelector, TypeExtractor.getKeySelectorTypes(keySelector, dataStream.getType()));
        }
    
        public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
            this(dataStream, new PartitionTransformation(dataStream.getTransformation(), new KeyGroupStreamPartitioner(keySelector, 128)), keySelector, keyType);
        }
    
        @Internal
        KeyedStream(DataStream<T> stream, PartitionTransformation<T> partitionTransformation, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
            super(stream.getExecutionEnvironment(), partitionTransformation);
            this.keySelector = (KeySelector)this.clean(keySelector);
            this.keyType = this.validateKeyType(keyType);
        }
    
        private TypeInformation<KEY> validateKeyType(TypeInformation<KEY> keyType) {
            Stack<TypeInformation<?>> stack = new Stack();
            stack.push(keyType);
            ArrayList unsupportedTypes = new ArrayList();
    
            while(true) {
                TypeInformation typeInfo;
                do {
                    if (stack.isEmpty()) {
                        if (!unsupportedTypes.isEmpty()) {
                            throw new InvalidProgramException("Type " + keyType + " cannot be used as key. Contained UNSUPPORTED key types: " + StringUtils.join(unsupportedTypes, ", ") + ". Look at the keyBy() documentation for the conditions a type has to satisfy in order to be eligible for a key.");
                        }
    
                        return keyType;
                    }
    
                    typeInfo = (TypeInformation)stack.pop();
                    if (!this.validateKeyTypeIsHashable(typeInfo)) {
                        unsupportedTypes.add(typeInfo);
                    }
                } while(!(typeInfo instanceof TupleTypeInfoBase));
    
                for(int i = 0; i < typeInfo.getArity(); ++i) {
                    stack.push(((TupleTypeInfoBase)typeInfo).getTypeAt(i));
                }
            }
        }
    
        private boolean validateKeyTypeIsHashable(TypeInformation<?> type) {
            try {
                return type instanceof PojoTypeInfo ? !type.getTypeClass().getMethod("hashCode").getDeclaringClass().equals(Object.class) : !isArrayType(type) && !isEnumType(type);
            } catch (NoSuchMethodException var3) {
                return false;
            }
        }
    
        private static boolean isArrayType(TypeInformation<?> type) {
            return type instanceof PrimitiveArrayTypeInfo || type instanceof BasicArrayTypeInfo || type instanceof ObjectArrayTypeInfo;
        }
    
        private static boolean isEnumType(TypeInformation<?> type) {
            return type instanceof EnumTypeInfo;
        }
    
        @Internal
        public KeySelector<T, KEY> getKeySelector() {
            return this.keySelector;
        }
    
        @Internal
        public TypeInformation<KEY> getKeyType() {
            return this.keyType;
        }
    
        protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) {
            throw new UnsupportedOperationException("Cannot override partitioning for KeyedStream.");
        }
    
        protected <R> SingleOutputStreamOperator<R> doTransform(String operatorName, TypeInformation<R> outTypeInfo, StreamOperatorFactory<R> operatorFactory) {
            SingleOutputStreamOperator<R> returnStream = super.doTransform(operatorName, outTypeInfo, operatorFactory);
            OneInputTransformation<T, R> transform = (OneInputTransformation)returnStream.getTransformation();
            transform.setStateKeySelector(this.keySelector);
            transform.setStateKeyType(this.keyType);
            return returnStream;
        }
    
        public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction) {
            DataStreamSink<T> result = super.addSink(sinkFunction);
            result.getTransformation().setStateKeySelector(this.keySelector);
            result.getTransformation().setStateKeyType(this.keyType);
            return result;
        }
    
        /** @deprecated */
        @Deprecated
        @PublicEvolving
        public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction) {
            TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(processFunction, ProcessFunction.class, 0, 1, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true);
            return this.process(processFunction, outType);
        }
    
        /** @deprecated */
        @Deprecated
        @Internal
        public <R> SingleOutputStreamOperator<R> process(ProcessFunction<T, R> processFunction, TypeInformation<R> outputType) {
            LegacyKeyedProcessOperator<KEY, T, R> operator = new LegacyKeyedProcessOperator((ProcessFunction)this.clean(processFunction));
            return this.transform("Process", outputType, operator);
        }
    
        @PublicEvolving
        public <R> SingleOutputStreamOperator<R> process(KeyedProcessFunction<KEY, T, R> keyedProcessFunction) {
            TypeInformation<R> outType = TypeExtractor.getUnaryOperatorReturnType(keyedProcessFunction, KeyedProcessFunction.class, 1, 2, TypeExtractor.NO_INDEX, this.getType(), Utils.getCallLocationName(), true);
            return this.process(keyedProcessFunction, outType);
        }
    
        @Internal
        public <R> SingleOutputStreamOperator<R> process(KeyedProcessFunction<KEY, T, R> keyedProcessFunction, TypeInformation<R> outputType) {
            KeyedProcessOperator<KEY, T, R> operator = new KeyedProcessOperator((KeyedProcessFunction)this.clean(keyedProcessFunction));
            return this.transform("KeyedProcess", outputType, operator);
        }
    
        @PublicEvolving
        public <T1> KeyedStream.IntervalJoin<T, T1, KEY> intervalJoin(KeyedStream<T1, KEY> otherStream) {
            return new KeyedStream.IntervalJoin(this, otherStream);
        }
    
        /** @deprecated */
        @Deprecated
        public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size) {
            return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.window(TumblingProcessingTimeWindows.of(size)) : this.window(TumblingEventTimeWindows.of(size));
        }
    
        /** @deprecated */
        @Deprecated
        public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size, Time slide) {
            return this.environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime ? this.window(SlidingProcessingTimeWindows.of(size, slide)) : this.window(SlidingEventTimeWindows.of(size, slide));
        }
    
        public WindowedStream<T, KEY, GlobalWindow> countWindow(long size) {
            return this.window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(size)));
        }
    
        public WindowedStream<T, KEY, GlobalWindow> countWindow(long size, long slide) {
            return this.window(GlobalWindows.create()).evictor(CountEvictor.of(size)).trigger(CountTrigger.of(slide));
        }
    
        @PublicEvolving
        public <W extends Window> WindowedStream<T, KEY, W> window(WindowAssigner<? super T, W> assigner) {
            return new WindowedStream(this, assigner);
        }
    
        public SingleOutputStreamOperator<T> reduce(ReduceFunction<T> reducer) {
            ReduceTransformation<T, KEY> reduce = new ReduceTransformation("Keyed Reduce", this.environment.getParallelism(), this.transformation, (ReduceFunction)this.clean(reducer), this.keySelector, this.getKeyType());
            this.getExecutionEnvironment().addOperator(reduce);
            return new SingleOutputStreamOperator(this.getExecutionEnvironment(), reduce);
        }
    
        public SingleOutputStreamOperator<T> sum(int positionToSum) {
            return this.aggregate(new SumAggregator(positionToSum, this.getType(), this.getExecutionConfig()));
        }
    
        public SingleOutputStreamOperator<T> sum(String field) {
            return this.aggregate(new SumAggregator(field, this.getType(), this.getExecutionConfig()));
        }
    
        public SingleOutputStreamOperator<T> min(int positionToMin) {
            return this.aggregate(new ComparableAggregator(positionToMin, this.getType(), AggregationType.MIN, this.getExecutionConfig()));
        }
    
        public SingleOutputStreamOperator<T> min(String field) {
            return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MIN, false, this.getExecutionConfig()));
        }
    
        public SingleOutputStreamOperator<T> max(int positionToMax) {
            return this.aggregate(new ComparableAggregator(positionToMax, this.getType(), AggregationType.MAX, this.getExecutionConfig()));
        }
    
        public SingleOutputStreamOperator<T> max(String field) {
            return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MAX, false, this.getExecutionConfig()));
        }
    
        public SingleOutputStreamOperator<T> minBy(String field, boolean first) {
            return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MINBY, first, this.getExecutionConfig()));
        }
    
        public SingleOutputStreamOperator<T> maxBy(String field, boolean first) {
            return this.aggregate(new ComparableAggregator(field, this.getType(), AggregationType.MAXBY, first, this.getExecutionConfig()));
        }
    
        public SingleOutputStreamOperator<T> minBy(int positionToMinBy) {
            return this.minBy(positionToMinBy, true);
        }
    
        public SingleOutputStreamOperator<T> minBy(String positionToMinBy) {
            return this.minBy(positionToMinBy, true);
        }
    
        public SingleOutputStreamOperator<T> minBy(int positionToMinBy, boolean first) {
            return this.aggregate(new ComparableAggregator(positionToMinBy, this.getType(), AggregationType.MINBY, first, this.getExecutionConfig()));
        }
    
        public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy) {
            return this.maxBy(positionToMaxBy, true);
        }
    
        public SingleOutputStreamOperator<T> maxBy(String positionToMaxBy) {
            return this.maxBy(positionToMaxBy, true);
        }
    
        public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy, boolean first) {
            return this.aggregate(new ComparableAggregator(positionToMaxBy, this.getType(), AggregationType.MAXBY, first, this.getExecutionConfig()));
        }
    
        protected SingleOutputStreamOperator<T> aggregate(AggregationFunction<T> aggregate) {
            return this.reduce(aggregate).name("Keyed Aggregation");
        }
    
        @PublicEvolving
        public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName) {
            ValueStateDescriptor<T> valueStateDescriptor = new ValueStateDescriptor(UUID.randomUUID().toString(), this.getType());
            return this.asQueryableState(queryableStateName, valueStateDescriptor);
        }
    
        @PublicEvolving
        public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName, ValueStateDescriptor<T> stateDescriptor) {
            this.transform("Queryable state: " + queryableStateName, this.getType(), new QueryableValueStateOperator(queryableStateName, stateDescriptor));
            stateDescriptor.initializeSerializerUnlessSet(this.getExecutionConfig());
            return new QueryableStateStream(queryableStateName, stateDescriptor, this.getKeyType().createSerializer(this.getExecutionConfig()));
        }
    
        @PublicEvolving
        public QueryableStateStream<KEY, T> asQueryableState(String queryableStateName, ReducingStateDescriptor<T> stateDescriptor) {
            this.transform("Queryable state: " + queryableStateName, this.getType(), new QueryableAppendingStateOperator(queryableStateName, stateDescriptor));
            stateDescriptor.initializeSerializerUnlessSet(this.getExecutionConfig());
            return new QueryableStateStream(queryableStateName, stateDescriptor, this.getKeyType().createSerializer(this.getExecutionConfig()));
        }
    
        @PublicEvolving
        public static class IntervalJoined<IN1, IN2, KEY> {
            private final KeyedStream<IN1, KEY> left;
            private final KeyedStream<IN2, KEY> right;
            private final long lowerBound;
            private final long upperBound;
            private final KeySelector<IN1, KEY> keySelector1;
            private final KeySelector<IN2, KEY> keySelector2;
            private boolean lowerBoundInclusive;
            private boolean upperBoundInclusive;
    
            public IntervalJoined(KeyedStream<IN1, KEY> left, KeyedStream<IN2, KEY> right, long lowerBound, long upperBound, boolean lowerBoundInclusive, boolean upperBoundInclusive) {
                this.left = (KeyedStream)Preconditions.checkNotNull(left);
                this.right = (KeyedStream)Preconditions.checkNotNull(right);
                this.lowerBound = lowerBound;
                this.upperBound = upperBound;
                this.lowerBoundInclusive = lowerBoundInclusive;
                this.upperBoundInclusive = upperBoundInclusive;
                this.keySelector1 = left.getKeySelector();
                this.keySelector2 = right.getKeySelector();
            }
    
            @PublicEvolving
            public KeyedStream.IntervalJoined<IN1, IN2, KEY> upperBoundExclusive() {
                this.upperBoundInclusive = false;
                return this;
            }
    
            @PublicEvolving
            public KeyedStream.IntervalJoined<IN1, IN2, KEY> lowerBoundExclusive() {
                this.lowerBoundInclusive = false;
                return this;
            }
    
            @PublicEvolving
            public <OUT> SingleOutputStreamOperator<OUT> process(ProcessJoinFunction<IN1, IN2, OUT> processJoinFunction) {
                Preconditions.checkNotNull(processJoinFunction);
                TypeInformation<OUT> outputType = TypeExtractor.getBinaryOperatorReturnType(processJoinFunction, ProcessJoinFunction.class, 0, 1, 2, TypeExtractor.NO_INDEX, this.left.getType(), this.right.getType(), Utils.getCallLocationName(), true);
                return this.process(processJoinFunction, outputType);
            }
    
            @PublicEvolving
            public <OUT> SingleOutputStreamOperator<OUT> process(ProcessJoinFunction<IN1, IN2, OUT> processJoinFunction, TypeInformation<OUT> outputType) {
                Preconditions.checkNotNull(processJoinFunction);
                Preconditions.checkNotNull(outputType);
                ProcessJoinFunction<IN1, IN2, OUT> cleanedUdf = (ProcessJoinFunction)this.left.getExecutionEnvironment().clean(processJoinFunction);
                IntervalJoinOperator<KEY, IN1, IN2, OUT> operator = new IntervalJoinOperator(this.lowerBound, this.upperBound, this.lowerBoundInclusive, this.upperBoundInclusive, this.left.getType().createSerializer(this.left.getExecutionConfig()), this.right.getType().createSerializer(this.right.getExecutionConfig()), cleanedUdf);
                return this.left.connect(this.right).keyBy(this.keySelector1, this.keySelector2).transform("Interval Join", outputType, operator);
            }
        }
    
        @PublicEvolving
        public static class IntervalJoin<T1, T2, KEY> {
            private final KeyedStream<T1, KEY> streamOne;
            private final KeyedStream<T2, KEY> streamTwo;
            private KeyedStream.IntervalJoin.TimeBehaviour timeBehaviour;
    
            IntervalJoin(KeyedStream<T1, KEY> streamOne, KeyedStream<T2, KEY> streamTwo) {
                this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.EventTime;
                this.streamOne = (KeyedStream)Preconditions.checkNotNull(streamOne);
                this.streamTwo = (KeyedStream)Preconditions.checkNotNull(streamTwo);
            }
    
            public KeyedStream.IntervalJoin<T1, T2, KEY> inEventTime() {
                this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.EventTime;
                return this;
            }
    
            public KeyedStream.IntervalJoin<T1, T2, KEY> inProcessingTime() {
                this.timeBehaviour = KeyedStream.IntervalJoin.TimeBehaviour.ProcessingTime;
                return this;
            }
    
            @PublicEvolving
            public KeyedStream.IntervalJoined<T1, T2, KEY> between(Time lowerBound, Time upperBound) {
                if (this.timeBehaviour != KeyedStream.IntervalJoin.TimeBehaviour.EventTime) {
                    throw new UnsupportedTimeCharacteristicException("Time-bounded stream joins are only supported in event time");
                } else {
                    Preconditions.checkNotNull(lowerBound, "A lower bound needs to be provided for a time-bounded join");
                    Preconditions.checkNotNull(upperBound, "An upper bound needs to be provided for a time-bounded join");
                    return new KeyedStream.IntervalJoined(this.streamOne, this.streamTwo, lowerBound.toMilliseconds(), upperBound.toMilliseconds(), true, true);
                }
            }
    
            static enum TimeBehaviour {
                ProcessingTime,
                EventTime;
    
                private TimeBehaviour() {
                }
            }
        }
    }
    View Code

    比如之前的socket 例子:show plan 显示几个框可以理解为几个任务(一个任务可能有多个子任务,子任务的数量可以理解为并行度),两者为什么会这么合并在了解合并算子链(并行度相同的一对一算子会合并算子链)后就会明白。

    (1) 并行度设置为2的时候show plan 计划如下:

     (2)并行度设置为1 的时候show plan 如下

    2. 并行度

    1. 什么是并行计算

      可以理解为,我们期望的是“数据并行”。也就是多条数据同时到来,我们可以同时读入,并且在不同的节点进行flatMap 等操作。

    2. 并行子任务和并行度

      为了实现并行操作,我们把一个算子操作,复制多分到多个节点,数据来了之后到其中任意一个执行。这样一来,一个算子任务就被拆分成了多个并行的子任务,再将他们分发到不同节点,就实现了真正的并行计算。

      在Flink 执行过程中,每个算子(operator)可以包含一个或多个子任务,这些子任务在不同的线程、物理机或者容器中完全独立地执行。

      一个特定算子的子任务的个数被=称之为并行度(parallelism)。一个流程序的并行度可以认为是其所有算子中最大的并行度。一个程序中不同的算子可能具有不同的并行度。

    如下图:

      当前数据流中有source、map、window、sink 四个算子,除最后sink外其他的算子的并行度都为2.整个程序包含7个子任务,至少需要两个分区来执行,可以认为这段程序的并行度就是2。

    3. 并行度设置

    设置按照最近原则,最先设置的优先生效。

    (1)代码中设置

    // 全局设置
    executionEnvironment.setParallelism(3);
    
    // 对单个算子设置
    txtDataSource
                    .flatMap((String line, Collector<String> words) -> {
                        Arrays.stream(line.split(" ")).forEach(words::collect);
                    }).setParallelism(3)

     (2)提交时设置(webui也可以设置)

    ./flink-1.13.0/bin/flink run -c cn.qz.SocketStreamWordCount -p 2 ./study-flink-1.0-SNAPSHOT.jar

    (3) 在集群的配置文件 flink-conf.yaml 中直接更改默认并行度:

    parallelism.default: 1

      这些参数都不是必须的,会按照由近到远的原则匹配(单个设置<env<-p<默认)。需要注意,有的算子即使设置了并行度也不会生效,比如读取socket 文本流的算子本身就不支持并行。在开发环境中,默认的并行度为当前机器的CPU核数(默认的任务槽的数量也是CPU核数)。

     4. 测试例子

      还是以socket 流为例子。

    (1) 提交时选择并行度为2, 查看任务:

       如上。 name 是每个算子的名称,我们在源码中可以看到为这些算子起的名称。 后面有子任务的数量。

    (2) 7777 端口输入

    [root@k8smaster01 conf]# nc -l 7777
    hello china and beijing
    what is your name?
    my name is qz.

    (3) 查看输出任务的详细信息

     查看子任务信息:

     (4) 查看两个子任务所在机器的标准输出: 可以看出输出前面加的序号(可以理解为分区序号、任务插槽号)

    第一个子任务所在机器输出:

     第二个机器:

       这里自己理解最大并行度就是一个任务最多能分到几个资源(任务槽),任务会同时并行处理,可以理解为在不同的机器直接并行处理(至于每个机器并行几个线程跑,后面任务槽进行研究,目前是每个机器一个任务槽)

    补充: 针对1、2 进行的测试

      比如如下程序:

    package cn.qz;
    
    import org.apache.flink.api.common.typeinfo.Types;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.api.datastream.DataStreamSource;
    import org.apache.flink.streaming.api.datastream.KeyedStream;
    import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.util.Collector;
    
    import java.util.Arrays;
    
    public class SocketStreamWordCount {
    
        public static void main(String[] args) throws Exception {
            // 1. 创建执行环境(流处理执行环境)
            StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
            // 2. 读取文件
            DataStreamSource<String> txtDataSource = executionEnvironment.socketTextStream("192.168.13.107", 7777);
            // 3. 转换数据格式
            SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                    .flatMap((String line, Collector<String> words) -> {
                        Arrays.stream(line.split(" ")).forEach(words::collect);
                    })
                    .returns(Types.STRING)
                    .map(word -> Tuple2.of(word, 1L))
                    .returns(Types.TUPLE(Types.STRING, Types.LONG));// lambda 使用泛型,由于泛型擦除,需要显示的声明类型信息
            // 4. 分组
            KeyedStream<Tuple2<String, Long>, String> tuple2StringKeyedStream = singleOutputStreamOperator.keyBy(t -> t.f0);
            // 5. 求和
            SingleOutputStreamOperator<Tuple2<String, Long>> sum = tuple2StringKeyedStream.sum(1);
            // 6. 打印
            sum.print();
            System.out.println("========");
            // 7. 求最大
            SingleOutputStreamOperator<Tuple2<String, Long>> tuple2SingleOutputStreamOperator = singleOutputStreamOperator.keyBy(t -> t.f0).max(1);
            tuple2SingleOutputStreamOperator.printToErr();
            // 8. 执行
            executionEnvironment.execute();
        }
    }

      debug 查看其相关对象:可以看到默认的并行度和相关的转换

    3. 算子链

      我们观察webui 给出的计划图发现,节点数量和代码中的算子不是一一对应的。 有的节点会把多个任务连接在一起合并成一个大任务。下面解释其原因。

    1.算子间的数据传输

      一个数据流在算子之间传输数据的形式可以是一对一的直通模式(forwarding),也可以是打乱的重分区(redistributing)模式,具体是哪一种取决于算子的种类

    (1)一对一直通

      这种模式下,数据流维护着分区以及元素的顺序。比如图中的source和map 算子,source 读取完之后可以直接发给map 做处理,不需要重新分区,也不需要调整数据的顺序。这意味着map算子的子任务,看到的元素个数和顺序跟source 算子的子任务产生的完全一样,保证一对一的关系。map、filter、flatMap等算子都是这种一对一的关系。

    (2)重分区

      这种模式下,数据流的分区会发生改变。比如图中的map和后面的keyBy/window/apply算子、以及keyBy/window算子和sink 算子之间。

      每一个算子的子任务会根据数据传输的策略,把数据发送到不同的下游目标任务。例如:keyBy是分组操作,本质上是基于key进行hash后重分区;比如从并行度为2的window 算子传递到并行度为1 的sink,这时的数据传输方式是再平衡(rebalance),会把数据均匀的向下游子任务分发出去。这些传输方式都会引起重分区(redistribute)。

    2.合并算子链

      并行度相同的一对一算子操作,可以直接连接在一起形成一个大的任务(task),这样原来算子就成了合并任务里的一部分。每个任务被一个线程执行。这就是合并算子链。合并后如下图:

     

    合并后就有五个任务,由五个线程并行执行。合并算子链可以减少线程之间的转换,提升吞吐量。

    Flink 默认按照算子链的原则进行合并,如果想禁止合并或者自定义,可以在代码对算子做一些特定设置:

    // 禁用算子链
    SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                    .flatMap((String line, Collector<String> words) -> {
                        Arrays.stream(line.split(" ")).forEach(words::collect);
                    }).disableChaining()
        
    // 从当前算子开始新链
            SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                    .flatMap((String line, Collector<String> words) -> {
                        Arrays.stream(line.split(" ")).forEach(words::collect);
                    }).startNewChain()    

    4. 作业图与执行图

    Flink任务调度执行的图,其按照顺序分为四层:

    逻辑流图-》作业图-》执行图-》物理图

    比如以soclet 为例子,其转换过程如下:

     1. 逻辑流图

      图中的节点一般对应算子操作。客户端完成的。

    2. 作业图

      数据流图经过优化就是作业图。主要的优化为将符合条件的节点连接在一起合并成一个任务节点,行成算子链。也是客户端完成的,作业提交时传递给JobMaster。

    3. 执行图

      JobMaster 收到JobGraph后用它生成执行图。执行图是JobGraph的并行化版本,是调度处最核心的数据结构。和作业图区别是对子任务进行了拆分,并明确任务之间传递数据的方式。

    4. 物理图

      JobMaster 生成执行图后,将它分发给TaskManager。TasjkManager 根据执行图部署任务,最终的物理执行过程行成物理图。

      物理图在执行图的基础上,进一步确定数据存放的位置和收发的具体方式。

    5. 任务与任务槽

       在之前的测试中,我们三个taskManager的slots任务槽为3。提交任务cn.qz.SocketStreamWordCount 的时候选择的并行度为2, 显示的任务应该是有5个(1+2+2),但是却占据了两个任务槽,下面解释其原因。

     1.任务槽

    flink中一个worker(taskmanager)是一个JVM进程,既然是进程就可以启动多个独立的线程来执行多个子任务(subtask)。

    flink 中的多个独立的执行任务的线程数量就是任务槽,默认为1,可以进行修改。修改 flink-conf.yaml,如下修改后每个节点变为4个槽,总共3个节点就是12个slot。

    taskmanager.numberOfTaskSlots: 4

    需要注意的是,slot目前用来隔离内存,不涉及cpu的隔离。具体应用需要根据cpu 核心数进行调整。

    2.任务对任务槽的共享

      默认情况下,flink 允许子任务共享slot。所以2个子任务两个slot(最大的子任务数量)就可以完成。

      不同任务节点的子任务可以共享一个slot, 换句话说同一个任务的多个子任务必须放置在不同的slot。比如并行度为2,可能的结果就是

      到这里可能有个疑问就是既然想要最大利用计算资源,为什么又在一个任务槽并行处理多个任务了(一个线程干多件事)?

      原因是: 不同的任务对资源占用不同,比如source、map 、sink可能处理时间极短,而window等转换操作时间长(资源密集型任务)。如果每个任务一个slot,造成的现象就是上游的source(等待下游的window任务发通知而阻塞,相当于背压)和下游的sink可能长时间浪费,但是windows却忙死,出现资源利用不平衡。于是出现了任务共享,将资源密集型和非密集型放到一个slot,这样就可以自行分配对资源占用的比例。

      如果想某个任务独占一个slot,或者只有某部分算子共享slot,可以设置共享组:只有属于一个slot组的子任务才会开启共享slot,不同组之间的任务必须分配到不同的slot 数量。

            SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource
                    .flatMap((String line, Collector<String> words) -> {
                        Arrays.stream(line.split(" ")).forEach(words::collect);
                    }).slotSharingGroup("1")

    3.任务槽和并行度的关系

      整个流处理程序的并行度,应该是所有算子中并行度最大的那个,也就是所需要的slot 数量(这种是不指定插槽组的情况)。

  • 相关阅读:
    如何提高工作效率,重复利用时间
    好记性不如烂笔头
    如何应对面试中关于“测试框架”的问题
    通宵修复BUG的思考
    工作方法的思考
    别认为那是一件简单的事情
    开发人员需要熟悉缺陷管理办法
    不了解系统功能的思考
    如何布置任务
    事事有回音
  • 原文地址:https://www.cnblogs.com/qlqwjy/p/16379555.html
Copyright © 2020-2023  润新知