• Flink DataStream Operators(Java)


    ==官网地址==

    https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/

    ==nc工具==

    【Windows】

    配置windows的nc端口,在网上下载nc.exe(https://eternallybored.org/misc/netcat/)

    使用命令开始nc制定端口为9000(nc -L -p 9001 -v) 启动插件

    【Linux】

    Linux平台上可以使用

    nc -lk 9000

    ==Git代码==

    以下的样例代码都上传到了Git上,供参考。

    https://github.com/quchunhui/demo-macket/tree/master/flink

    ==参考博客==

    以下测试部分参考了网友的博客

    https://blog.csdn.net/chybin500/article/details/87260869

    ==算子关系==

    ==Map==

    [DataStream -> DataStream]

    map可以理解为映射,对每个元素进行一定的变换后,映射为另一个元素。

    调用用户定义的MapFunction对DataStream[T]数据进行处理,形成新的DataStream[T],其中数据格式可能会发生变化,常用作对数据集内数据的清洗和转换。

    package demo;
    
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    
    public class MapDemo {
        private static int index = 1;
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            //2.定义加载或创建数据源(source),监听9000端口的socket消息
            DataStream<String> textStream = env.socketTextStream("localhost", 9001, "
    ");
            //3.map操作。
            DataStream<String> result = textStream.map(s -> (index++) + ".您输入的是:" + s);
            //4.打印输出sink
            result.print();
            //5.开始执行
            env.execute();
        }
    }

    socket输入:

    控制台显示:

    ==FlatMap==

    [DataStream -> DataStream]

    flatmap可以理解为将元素摊平,每个元素可以变为0个、1个、或者多个元素。

    package demo;
    
    import org.apache.flink.api.common.functions.FlatMapFunction;
    import org.apache.flink.api.common.typeinfo.Types;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.util.Collector;
    
    public class FlatMapDemo {
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            //2.定义加载或创建数据源(source),监听9000端口的socket消息
            DataStream<String> textStream = env.socketTextStream("localhost", Constants.PORT, "
    ");
            //3.flatMap操作,对每一行字符串进行分割
            DataStream<String> result = textStream.flatMap(new Splitter())
            //这个地方要注意,在flatMap这种参数里有泛型算子中。
            //如果用lambda表达式,必须将参数的类型显式地定义出来。
            //并且要有returns,指定返回的类型
            //详情可以参考Flink官方文档:https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/java_lambdas.html
            .returns(Types.STRING);
    
            //4.打印输出sink
            result.print();
            //5.开始执行
            env.execute();
        }
    
        public static class Splitter implements FlatMapFunction<String, String> {
            @Override
            public void flatMap(String sentence, Collector<String> out) {
                for (String word: sentence.split("")) {
                    out.collect(word);
                }
            }
        }
    }

    socket输入内容:

    控制台输出的结果:

    ==Fliter==

    [DataStream -> DataStream]

    该算子将按照条件对输入数据集进行筛选操作,将符合条件的数据集输出,将不符合条件的数据过滤掉。

    package demo;
    
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    
    public class FilterDemo {
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            //2.定义加载或创建数据源(source),监听9000端口的socket消息
            DataStream<String> textStream =
                env.socketTextStream("localhost", Constants.PORT, "
    ");
            //3.filter操作,筛选非空行。
            DataStream<String> result = textStream.filter(line->!line.trim().equals(""));
            //4.打印输出sink
            result.print();
            //5.开始执行
            env.execute();
        }
    }

    socket中输入:

    控制台输出的结果:

    ==keyBy==

    [DataStream -> KeyedStream]

    该算子根据指定的key将输入的DataStream[T]数据格式转换为KeyedStream[T],也就是在数据集中执行Partition操作,将相同的key值的数据放置在相同的分区中。简单来说,就是sql里面的group by。

    逻辑上将Stream根据指定的Key进行分区,是根据key的散列值进行分区的。

    package demo;
    
    import java.util.concurrent.TimeUnit;
    import org.apache.flink.api.common.typeinfo.Types;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.windowing.time.Time;
    
    public class KeyByDemo {
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    
            //2.定义加载或创建数据源(source),监听端口的socket消息
            DataStream<String> textStream =
                env.socketTextStream("localhost", Constants.PORT, "
    ");
    
            //3.
            DataStream<Tuple2<String, Integer>> result = textStream
                //map是将每一行单词变为一个tuple2
                .map(line -> Tuple2.of(line.trim(), 1))
                //如果要用Lambda表示是,Tuple2是泛型,那就得用returns指定类型。
                .returns(Types.TUPLE(Types.STRING, Types.INT))
                //keyBy进行分区,按照第一列,也就是按照单词进行分区
                .keyBy(0)
                //指定窗口,每10秒个计算一次
                .timeWindow(Time.of(10, TimeUnit.SECONDS))
                //计算个数,计算第1列
                .sum(1);
    
            //4.打印输出sink
            result.print();
    
            //5.开始执行
            env.execute();
        }
    }

    socket输入:

    控制台输出的结果:

    ==Reduce==

    [KeyedStream -> DataStream]

    该算子和MapReduce的Reduce原理基本一致,主要目的是将输入的KeyedStream通过传入的用户自定义的ReduceFunction滚动的进行数据聚合处理,其中定义的ReduceFunction必须满足运算结合律和交换律。

    package demo;
    
    import java.util.concurrent.TimeUnit;
    import org.apache.flink.api.common.typeinfo.Types;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.windowing.time.Time;
    
    public class ReduceDemo {
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    
            //2.定义加载或创建数据源(source),监听端口的socket消息
            DataStream<String> textStream =
                env.socketTextStream("localhost", Constants.PORT, "
    ");
    
            //3.
            DataStream<Tuple2<String, Integer>> result = textStream
                //map是将每一行单词变为一个tuple2
                .map(line -> Tuple2.of(line.trim(), 1))
                //如果要用Lambda表示是,Tuple2是泛型,那就得用returns指定类型。
                .returns(Types.TUPLE(Types.STRING, Types.INT))
                //keyBy进行分区,按照第一列,也就是按照单词进行分区
                .keyBy(0)
                //指定窗口,每10秒个计算一次
                .timeWindow(Time.of(10, TimeUnit.SECONDS))
                //对每一组内的元素进行归并操作,即第一个和第二个归并,结果再与第三个归并...
                .reduce((Tuple2<String, Integer> t1, Tuple2<String, Integer> t2) -> new Tuple2(t1.f0, t1.f1 + t2.f1));
    
            //4.打印输出sink
            result.print();
            //5.开始执行
            env.execute();
        }
    }

    socket输入:

    控制台输出的结果:

    ==Fold==

    给定一个初始值,将各个元素逐个归并计算。它将KeyedStream转变为DataStream。

    不过这个API过时了,不建议使用

    package demo;
    
    import java.util.concurrent.TimeUnit;
    import org.apache.flink.api.common.typeinfo.Types;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.windowing.time.Time;
    
    public class FoldDemo {
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            //2.定义加载或创建数据源(source),监听端口的socket消息
            DataStream<String> textStream =
                env.socketTextStream("localhost", Constants.PORT, "
    ");
            //3.
            DataStream<String> result = textStream
                //map是将每一行单词变为一个tuple2
                .map(line -> Tuple2.of(line.trim(), 1))
                //如果要用Lambda表示是,Tuple2是泛型,那就得用returns指定类型。
                .returns(Types.TUPLE(Types.STRING, Types.INT))
                //keyBy进行分区,按照第一列,也就是按照单词进行分区
                .keyBy(0)
                //指定窗口,每10秒个计算一次
                .timeWindow(Time.of(10, TimeUnit.SECONDS))
                //指定一个开始的值,对每一组内的元素进行归并操作,即第一个和第二个归并,结果再与第三个归并...
                .fold("结果:",(String current, Tuple2<String, Integer> t2) -> current+t2.f0+",");
    
            //4.打印输出sink
            result.print();
            //5.开始执行
            env.execute();
        }
    }

    socket输入:

    控制台显示:

    ==Union==

    [DataStream -> DataStream]

    将两个或者多个输入的数据集合并成一个数据集,需要保证两个数据集的格式一致,输出的数据集的格式和输入的数据集格式保持一致。

    package demo;
    
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    
    public class UnionDemo {
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            //2.定义加载或创建数据源(source),监听9000端口的socket消息
            DataStream<String> textStream9000 = env.socketTextStream("localhost", 9000, "
    ");
            DataStream<String> textStream9001 = env.socketTextStream("localhost", 9001, "
    ");
            DataStream<String> textStream9002 = env.socketTextStream("localhost", 9002, "
    ");
    
            DataStream<String> mapStream9000 = textStream9000.map(s->"来自9000端口:" + s);
            DataStream<String> mapStream9001 = textStream9001.map(s->"来自9001端口:" + s);
            DataStream<String> mapStream9002 = textStream9002.map(s->"来自9002端口:" + s);
    
            //3.union用来合并两个或者多个流的数据,统一到一个流中
            DataStream<String> result =  mapStream9000.union(mapStream9001, mapStream9002);
    
            //4.打印输出sink
            result.print();
    
            //5.开始执行
            env.execute();
        }
    }

    socket1(9000)输入:a

    socket2(9001)输入:b

    socket2(9002)输入:c 

    控制台输出的结果:

    ==Join==

    根据指定的Key将两个流进行关联。

    package demo;
    
    import org.apache.flink.api.common.functions.MapFunction;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
    import org.apache.flink.streaming.api.windowing.time.Time;
    
    public class JoinDemo {
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    
            //2.定义加载或创建数据源(source),监听9000端口的socket消息
            DataStream<String> textStream9000 = env.socketTextStream("localhost", 9000, "
    ");
            DataStream<String> textStream9001 = env.socketTextStream("localhost", 9001, "
    ");
            //将输入处理一下,变为tuple2
            DataStream<Tuple2<String,String>> mapStream9000 = textStream9000
                .map(new MapFunction<String, Tuple2<String,String>>() {
                    @Override
                    public Tuple2<String, String> map(String s) throws Exception {
                        return Tuple2.of(s, "来自9000端口:" + s);
                    }
                });
    
            DataStream<Tuple2<String,String>> mapStream9001 = textStream9001
                .map(new MapFunction<String, Tuple2<String,String>>() {
                    @Override
                    public Tuple2<String, String> map(String s) throws Exception {
                        return Tuple2.of(s, "来自9001端口:" + s);
                    }
                });
    
            //3.两个流进行join操作,是inner join,关联上的才能保留下来
            DataStream<String> result =  mapStream9000.join(mapStream9001)
                //关联条件,以第0列关联(两个source输入的字符串)
                .where(t1->t1.getField(0)).equalTo(t2->t2.getField(0))
                //以处理时间,每10秒一个滚动窗口
                .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                //关联后输出
                .apply((t1,t2)->t1.getField(1)+"|"+t2.getField(1));
    
            //4.打印输出sink
            result.print();
    
            //5.开始执行
            env.execute();
        }
    }

    socket9000输入:a

    socket9001输入:a

    控制台输出的结果:

    如果两个socket输入内容不一样,则控制台没有输出。

    ==CoGroup==

    关联两个流,关联不上的也保留下来。

    package demo;
    
    import org.apache.flink.api.common.functions.CoGroupFunction;
    import org.apache.flink.api.common.functions.MapFunction;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
    import org.apache.flink.streaming.api.windowing.time.Time;
    import org.apache.flink.util.Collector;
    
    public class CoGroupDemo {
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    
            //2.定义加载或创建数据源(source),监听9000端口的socket消息
            DataStream<String> textStream9000 = env.socketTextStream("localhost", 9000, "
    ");
            DataStream<String> textStream9001 = env.socketTextStream("localhost", 9001, "
    ");
            //将输入处理一下,变为tuple2
            DataStream<Tuple2<String, String>> mapStream9000 = textStream9000
                .map(new MapFunction<String, Tuple2<String, String>>() {
                    @Override
                    public Tuple2<String, String> map(String s) throws Exception {
                        return Tuple2.of(s, "来自9000端口:" + s);
                    }
                });
    
            DataStream<Tuple2<String, String>> mapStream9001 = textStream9001
                .map(new MapFunction<String, Tuple2<String, String>>() {
                    @Override
                    public Tuple2<String, String> map(String s) throws Exception {
                        return Tuple2.of(s, "来自9001端口:" + s);
                    }
                });
    
            //3.两个流进行coGroup操作,没有关联上的也保留下来,功能更强大
            DataStream<String> result = mapStream9000.coGroup(mapStream9001)
                //关联条件,以第0列关联(两个source输入的字符串)
                .where(t1 -> t1.getField(0)).equalTo(t2 -> t2.getField(0))
                //以处理时间,每10秒一个滚动窗口
                .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                //关联后输出
                .apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
                    @Override
                    public void coGroup(Iterable<Tuple2<String, String>> iterable, Iterable<Tuple2<String, String>> iterable1, Collector<String> collector) throws Exception {
                        StringBuffer stringBuffer = new StringBuffer();
                        stringBuffer.append("来自9000的stream:");
                        for (Tuple2<String, String> item : iterable) {
                            stringBuffer.append(item.f1 + ",");
                        }
                        stringBuffer.append("来自9001的stream:");
                        for (Tuple2<String, String> item : iterable1) {
                            stringBuffer.append(item.f1 + ",");
                        }
                        collector.collect(stringBuffer.toString());
                    }
                });
    
            //4.打印输出sink
            result.print();
            //5.开始执行
            env.execute();
        }
    }

    socket9000上输入:abc

    socket9001上输入:def

    控制台输出的结果:

    socket9000上输入:111

    socket9001上输入:111

    控制台输出的结果:

    ==Connect, CoMap, CoFlatMap==

    [DataStream -> DataStream]

    参考:https://www.jianshu.com/p/5b0574d466f8

    将两个流纵向地连接起来。

    Connect算子主要是为了合并两种后者多种不同数据类型的数据集,合并后会保留原来的数据集的数据类型。连接操作允许共享状态数据,也就是说在多个数据集之间可以操作和查看对方数据集的状态。

    package demo;
    
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.functions.co.CoMapFunction;
    
    public class ConnectDemo {
        public static void main(String[] args) throws Exception {
            //1.获取执行环境配置信息
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            //2.定义加载或创建数据源(source),监听9000端口的socket消息
            DataStream<String> textStream9000 = env.socketTextStream("localhost", 9000, "
    ");
            DataStream<String> textStream9001 = env.socketTextStream("localhost", 9001, "
    ");
            //转为Integer类型流
            DataStream<Integer> intStream = textStream9000.filter(s -> isNumeric(s)).map(s -> Integer.valueOf(s));
            //连接起来,分别处理,返回同样的一种类型。
            SingleOutputStreamOperator result = intStream.connect(textStream9001)
                .map(new CoMapFunction<Integer, String, Tuple2<Integer, String>>() {
                    @Override
                    public Tuple2<Integer, String> map1(Integer value) throws Exception {
                        return Tuple2.of(value, "");
                    }
    
                    @Override
                    public Tuple2<Integer, String> map2(String value) throws Exception {
                        return Tuple2.of(null, value);
                    }
                });
            //4.打印输出sink
            result.print();
            //5.开始执行
            env.execute();
        }
    
        private static boolean isNumeric(String str) {
            Pattern pattern = Pattern.compile("[0-9]*");
            Matcher isNum = pattern.matcher(str);
            return isNum.matches();
        }
    }

    在socket9000和9001上随便进行了一些输入

    控制台上输出如下:

    ==Split-Select(已过时)==

    ==Side-Output(现推荐)==

    参考:https://blog.csdn.net/wangpei1949/article/details/99698868

    注意:
    Side-Output是从Flink 1.3.0开始提供的功能,支持了更灵活的多路输出。
    Side-Output可以以侧流的形式,以不同于主流的数据类型,向下游输出指定条件的数据、异常数据、迟到数据等等。
    Side-Output通过ProcessFunction将数据发送到侧路OutputTag。

    package demo;
    
    import org.apache.flink.api.java.tuple.Tuple3;
    import org.apache.flink.streaming.api.datastream.DataStreamSource;
    import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.functions.ProcessFunction;
    import org.apache.flink.util.Collector;
    import org.apache.flink.util.OutputTag;
    
    /**
     * 注意:
     * Side-Output是从Flink 1.3.0开始提供的功能,支持了更灵活的多路输出。
     * Side-Output可以以侧流的形式,以不同于主流的数据类型,向下游输出指定条件的数据、异常数据、迟到数据等等。
     * Side-Output通过ProcessFunction将数据发送到侧路OutputTag。
     */
    public class SplitStreamBySideOutput {
        public static void main(String[] args) throws Exception{
            //运行环境
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    
            //输入数据源
            DataStreamSource<Tuple3<String, String, String>> source = env.fromElements(
                new Tuple3<>("productID1", "click", "user_1"),
                new Tuple3<>("productID1", "click", "user_2"),
                new Tuple3<>("productID1", "browse", "user_1"),
                new Tuple3<>("productID2", "browse", "user_1"),
                new Tuple3<>("productID2", "click", "user_2"),
                new Tuple3<>("productID2", "click", "user_1")
            );
    
            //1、定义OutputTag
            OutputTag<Tuple3<String, String, String>>
                sideOutputTag = new OutputTag<Tuple3<String, String, String>>("side-output-tag"){};
    
            //2、在ProcessFunction中处理主流和分流
            SingleOutputStreamOperator<Tuple3<String, String, String>> processedStream =
                source.process(new ProcessFunction<Tuple3<String, String, String>, Tuple3<String, String, String>>() {
                @Override
                public void processElement(Tuple3<String, String, String> value, Context ctx,
                    Collector<Tuple3<String, String, String>> out) throws Exception {
                    //侧流-只输出特定数据
                    if (value.f0.equals("productID1")) {
                        System.out.println("1");
                        ctx.output(sideOutputTag, value);
                    //主流
                    }else {
                        System.out.println("2");
                        out.collect(value);
                    }
                }
            });
    
            //获取主流
            processedStream.print();
            //获取侧流
            processedStream.getSideOutput(sideOutputTag).print();
            //开始执行
            env.execute();
        }
    }

    --END--

  • 相关阅读:
    TypeError: Object of type 'type' is not JSON serializable解决方法
    Python Logging
    Mysql 主从赋值
    (转)Mysql 创建用户 查看权限 授权
    mysql you need (at least one of) the SYSTEM_USER privilege(s) for this operation
    java 多线程 六、线程状态
    java 多线程 五、线程组
    java 多线程 四、线程通信
    java Runtime类 Time类
    java 设计模式 ---- 饿汉 , 懒汉
  • 原文地址:https://www.cnblogs.com/quchunhui/p/12487809.html
Copyright © 2020-2023  润新知