Flink 有非常灵活的分层API设计,其中的核心层就是DataSet/DataStream API。新版本中DataSetAPI 被弃用,使用DataStream 即可实现流批一体。
DataStream(数据流)本身是Flink 中一个用来表示集合的类,我们编写的Flink 代码其实就是基于这种数据类型的处理,所以这套核心API就以DataStream 命名。对于流处理和批处理,都可以用这一套API 实现。
一个Flink 程序,其实几亅对DataStream的各种转换。具体来说,代码基本上由一下几部分组成:
1.获取执行环境(execution environment) 2.读取数据源(source) 3.定义基于数据的转换操作(transformation) 4.定义计算结果的输出位置(sink) 5.触发程序执行(execute)
获取执行环境和触发执行,都可以认为是针对执行环境的操作。接下来从执行环境、数据源、转换、输出 四部分进行研究。
1. 执行环境
1. 执行环境
编写Flink 程序的第一步就是创建执行环境。我们要获取的执行环境,是StreamExecutionEnvironment。
创建环境有下面三种方式:
1.创建本地环境
LocalStreamEnvironment localEnvironment = StreamExecutionEnvironment.createLocalEnvironment(); // 默认并行度是CPU 核数 // localEnvironment.setParallelism(3);
查看源码,默认的并行度为CPU核数:org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#createLocalEnvironment()
private static int defaultLocalParallelism = Runtime.getRuntime().availableProcessors(); public static LocalStreamEnvironment createLocalEnvironment() { return createLocalEnvironment(defaultLocalParallelism); }
另外创建本地环境也可以创建webui 界面:
(1)pom引入web依赖
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-runtime-web_${scala.binary.version}</artifactId> <version>${flink.version}</version> </dependency>
(2)代码创建webui,默认端口是8081
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
public static StreamExecutionEnvironment createLocalEnvironmentWithWebUI(Configuration conf) { checkNotNull(conf, "conf"); if (!conf.contains(RestOptions.PORT)) { // explicitly set this option so that it's not set to 0 later conf.setInteger(RestOptions.PORT, RestOptions.PORT.defaultValue()); } return createLocalEnvironment(conf); }
2. 创建远程环境
package cn.qz; import org.apache.flink.api.common.typeinfo.Types; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.KeyedStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.util.Collector; import java.util.Arrays; public class SocketStreamWordCount { public static void main(String[] args) throws Exception { // 1. 创建执行环境(流处理执行环境) StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.createRemoteEnvironment("192.168.13.107", 8081, "E:/study-flink-1.0-SNAPSHOT.jar"); // 2. 读取文件 DataStreamSource<String> txtDataSource = executionEnvironment.socketTextStream("192.168.13.107", 7777); // 3. 转换数据格式 SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource .flatMap((String line, Collector<String> words) -> { Arrays.stream(line.split(" ")).forEach(words::collect); }).slotSharingGroup("1") .returns(Types.STRING) .map(word -> Tuple2.of(word, 1L)) .returns(Types.TUPLE(Types.STRING, Types.LONG));// lambda 使用泛型,由于泛型擦除,需要显示的声明类型信息 // 4. 分组 KeyedStream<Tuple2<String, Long>, String> tuple2StringKeyedStream = singleOutputStreamOperator.keyBy(t -> t.f0); // 5. 求和 SingleOutputStreamOperator<Tuple2<String, Long>> sum = tuple2StringKeyedStream.sum(1); // 6. 打印 sum.print(); // 7. 执行 executionEnvironment.execute(); } }
这时候实际上数据将任务提交到远程地址上。这里需要指定jar包的位置,并且中间的代码也不能省,必须有算子才能执行。
3.getExecutionEnvironment选择执行环境(一般用这个API)
这个方法会根据当前的上下文进行选择,如果在集群环境就返回集群环境;否则就创建本地环境。查看源码org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#getExecutionEnvironment(org.apache.flink.configuration.Configuration):
public static StreamExecutionEnvironment getExecutionEnvironment(Configuration configuration) { return Utils.resolveFactory(threadLocalContextEnvironmentFactory, contextEnvironmentFactory) .map(factory -> factory.createExecutionEnvironment(configuration)) .orElseGet(() -> StreamExecutionEnvironment.createLocalEnvironment(configuration)); }
2. 执行模式
DataStream API默认的执行模式是流模式(STREAMING), 也可以选择批处理模式(BATCH), 自动模式(AUTOMIC) 根据输入数据源是否有界自动选择执行模式。参考类: org.apache.flink.api.common.RuntimeExecutionMode。这里需要注意并不是可以任意指定,无界数据流就不能使用BATCH 模式。
默认就是STREAMING模式。所以重点介绍一下BATCH模式的配置。
1.
bin/flink -Dexecution.runtime-mode=BATCH ...
(2)代码指定
executionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH);
测试:
文件内容:
hello world
hello flink
hello java
java nb
java pl
比如对于如下程序:
package cn.qz; import org.apache.flink.api.common.RuntimeExecutionMode; import org.apache.flink.api.common.typeinfo.Types; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.KeyedStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.util.Collector; import java.util.Arrays; public class BoundedStreamWordCount { public static void main(String[] args) throws Exception { // 1. 创建执行环境(流处理执行环境) StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); // 2. 读取文件 DataStreamSource<String> txtDataSource = executionEnvironment.readTextFile("file/words.txt"); // 3. 转换数据格式 SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource .flatMap((String line, Collector<String> words) -> { Arrays.stream(line.split(" ")).forEach(words::collect); }) .returns(Types.STRING) .map(word -> Tuple2.of(word, 1L)) .returns(Types.TUPLE(Types.STRING, Types.LONG));// lambda 使用泛型,由于泛型擦除,需要显示的声明类型信息 // 4. 分组 KeyedStream<Tuple2<String, Long>, String> tuple2StringKeyedStream = singleOutputStreamOperator.keyBy(t -> t.f0); // 5. 求和 (sum、min、max 可以用字段名称,也可以用字段顺序) SingleOutputStreamOperator<Tuple2<String, Long>> sum = tuple2StringKeyedStream.sum(1); // 6. 打印 sum.print(); // 7. 执行 executionEnvironment.execute(); } }
默认结果:(STREAMING)
2> (java,1) 2> (pl,1) 4> (nb,1) 3> (hello,1) 5> (world,1) 2> (java,2) 3> (hello,2) 3> (hello,3) 2> (java,3) 7> (flink,1)
BATCH 结果:
7> (flink,1) 3> (hello,3) 2> (pl,1) 2> (java,3) 5> (world,1) 4> (nb,1)
AUTOMATIC结果:
4> (nb,1) 7> (flink,1) 2> (pl,1) 3> (hello,3) 2> (java,3) 5> (world,1)
2. 什么时候选择BATCH
简单的原则:BATCH处理批量数据,STREAMING 处理流式数据。 数据有界的时候输出结果更加高效;数据无界的时候只能选择STREAMING流处理。
3. 程序执行
flink 是事件驱动的,只有数据真正到来才会进行计算,这也被称为延迟执行或懒执行。我们需要显示地执行execute()方法,来触发程序执行。execute 方法将一直等待作业完成,然后返回一个JobExecutionResult(内部包含JobId 等信息)。
2. Source
Flink 可以从各种来源获取数据,然后构建DataStream 进行转换处理。 一般数据的输入来源称为数据源(data source), 而读取的算子就是源算子(source operator)。Flink 中通用的添加source 的方式,是调用执行环境的addSource 方法, 且方法的入参是一个SourceFunction 接口,返回的是DataStreamSource。
1.自定义数据源的用法
参考: org.apache.flink.streaming.api.functions.source.FromElementsFunction
两个核心方法:
run: 通过sourceContext 向上下游发送数据。
cancel: 通过标识位控制退出循环,来达到中断数据源的效果。
1. 串行数据源
数据源:
package cn.qz.source; import lombok.extern.slf4j.Slf4j; import org.apache.commons.lang3.RandomStringUtils; import org.apache.flink.streaming.api.functions.source.SourceFunction; @Slf4j public class MySourceFunction implements SourceFunction<String> { private volatile boolean isRunning = true; @Override public void run(SourceContext<String> ctx) throws Exception { log.info("cn.qz.source.MySourceFunction.run start"); for (int index = 0; index < 100 && isRunning; index++) { Thread.sleep(2 * 1000); ctx.collect(RandomStringUtils.randomAlphanumeric(4)); } log.info("cn.qz.source.MySourceFunction.run end"); } /** * 界面点取消任务的时候执行 */ @Override public void cancel() { isRunning = false; log.info("cn.qz.source.MySourceFunction.cancel"); } }
测试类:
package cn.qz.source; import org.apache.flink.configuration.Configuration; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class CustomSource { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration()); executionEnvironment.setParallelism(1); DataStreamSource<String> stringDataStreamSource = executionEnvironment.addSource(new MySourceFunction()); stringDataStreamSource.print(); executionEnvironment.execute(); } }
这种数据源的并行度只能设置为1,否则会报错如下:
Exception in thread "main" java.lang.IllegalArgumentException: The parallelism of non parallel operator must be 1. at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) at org.apache.flink.api.common.operators.util.OperatorValidationUtils.validateParallelism(OperatorValidationUtils.java:35) at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:114)
2. 并行数据源
数据源:
package cn.qz.source; import lombok.extern.slf4j.Slf4j; import org.apache.commons.lang3.RandomStringUtils; import org.apache.flink.streaming.api.functions.source.ParallelSourceFunction; import org.apache.flink.streaming.api.functions.source.SourceFunction; import scala.collection.Parallel; @Slf4j public class MySourceFunction2 implements ParallelSourceFunction<User> { private volatile boolean isRunning = true; @Override public void run(SourceContext<User> ctx) throws Exception { log.info("cn.qz.source.MySourceFunction2.run start"); for (int index = 0; index < 3 && isRunning; index++) { String s = RandomStringUtils.randomAlphabetic(4); User user = new User(s, s, 100 + index); ctx.collect(user); } log.info("cn.qz.source.MySourceFunction2.run end"); } @Override public void cancel() { log.info("cn.qz.source.MySourceFunction2.run cancel"); isRunning = false; } }
测试类:
package cn.qz.source; import org.apache.flink.configuration.Configuration; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class CustomSource2 { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration()); executionEnvironment.setParallelism(1); DataStreamSource<User> stringDataStreamSource = executionEnvironment.addSource(new MySourceFunction2()); stringDataStreamSource.setParallelism(3); stringDataStreamSource.print(); executionEnvironment.execute(); } }
结果:
2022-06-23 19:28:36 [Legacy Source Thread - Source: Custom Source (1/3)#0] [cn.qz.source.MySourceFunction2]-[INFO] cn.qz.source.MySourceFunction2.run start 2022-06-23 19:28:36 [Legacy Source Thread - Source: Custom Source (3/3)#0] [cn.qz.source.MySourceFunction2]-[INFO] cn.qz.source.MySourceFunction2.run start 2022-06-23 19:28:36 [Legacy Source Thread - Source: Custom Source (2/3)#0] [cn.qz.source.MySourceFunction2]-[INFO] cn.qz.source.MySourceFunction2.run start 2022-06-23 19:28:36 [Legacy Source Thread - Source: Custom Source (3/3)#0] [cn.qz.source.MySourceFunction2]-[INFO] cn.qz.source.MySourceFunction2.run end 2022-06-23 19:28:36 [Legacy Source Thread - Source: Custom Source (1/3)#0] [cn.qz.source.MySourceFunction2]-[INFO] cn.qz.source.MySourceFunction2.run end 2022-06-23 19:28:36 [Legacy Source Thread - Source: Custom Source (2/3)#0] [cn.qz.source.MySourceFunction2]-[INFO] cn.qz.source.MySourceFunction2.run end User(username=jCxn, fullname=jCxn, age=100) User(username=SNjM, fullname=SNjM, age=101) User(username=Wfwj, fullname=Wfwj, age=102) User(username=bMnT, fullname=bMnT, age=100) User(username=YkrF, fullname=YkrF, age=101) User(username=CXsf, fullname=CXsf, age=102) User(username=STOo, fullname=STOo, age=100) User(username=kuCI, fullname=kuCI, age=101) User(username=EJMx, fullname=EJMx, age=102)
2. 取集合的数据源
pojo
package cn.qz.source; import lombok.AllArgsConstructor; import lombok.Data; import lombok.NoArgsConstructor; /** * 用户信息, 有如下要求: * 1. 类是公有的 * 2. 有一个无参的构造方法 * 3. 所有属性都是公有的 * 4. 所有属性的类型都是可以序列化的 */ @Data @AllArgsConstructor @NoArgsConstructor public class User { public String username; public String fullname; public int age; }
测试代码
package cn.qz.source; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import java.util.ArrayList; import java.util.List; public class Client1 { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<User> dataStreamSource = executionEnvironment.fromCollection(listUsers()); dataStreamSource.print(); executionEnvironment.execute(); } private static List<User> listUsers() { List<User> users = new ArrayList<>(); users.add(new User("zs", "张三", 22)); users.add(new User("ls", "李四", 23)); users.add(new User("ww", "王五", 21)); return users; } }
3. 从文件读取
package cn.qz; import org.apache.flink.api.common.typeinfo.Types; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.KeyedStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.util.Collector; import java.util.Arrays; public class BoundedStreamWordCount { public static void main(String[] args) throws Exception { // 1. 创建执行环境(流处理执行环境) StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); // 2. 读取文件 DataStreamSource<String> txtDataSource = executionEnvironment.readTextFile("file/words.txt"); // 3. 转换数据格式 SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource .flatMap((String line, Collector<String> words) -> { Arrays.stream(line.split(" ")).forEach(words::collect); }) .returns(Types.STRING) .map(word -> Tuple2.of(word, 1L)) .returns(Types.TUPLE(Types.STRING, Types.LONG));// lambda 使用泛型,由于泛型擦除,需要显示的声明类型信息 // 4. 分组 KeyedStream<Tuple2<String, Long>, String> tuple2StringKeyedStream = singleOutputStreamOperator.keyBy(t -> t.f0); // 5. 求和 (sum、min、max 可以用字段名称,也可以用字段顺序) SingleOutputStreamOperator<Tuple2<String, Long>> sum = tuple2StringKeyedStream.sum(1); // 6. 打印 sum.print(); // 7. 执行 executionEnvironment.execute(); } }
4. 从socket 读取
package cn.qz; import org.apache.flink.api.common.typeinfo.Types; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.KeyedStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.util.Collector; import java.util.Arrays; public class SocketStreamWordCount { public static void main(String[] args) throws Exception { // 1. 创建执行环境(流处理执行环境) StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); // 2. 读取文件 DataStreamSource<String> txtDataSource = executionEnvironment.socketTextStream("localhost", 7777); // 3. 转换数据格式 SingleOutputStreamOperator<Tuple2<String, Long>> singleOutputStreamOperator = txtDataSource .flatMap((String line, Collector<String> words) -> { Arrays.stream(line.split(" ")).forEach(words::collect); }) .returns(Types.STRING) .map(word -> Tuple2.of(word, 1L)) .returns(Types.TUPLE(Types.STRING, Types.LONG));// lambda 使用泛型,由于泛型擦除,需要显示的声明类型信息 // 4. 分组 KeyedStream<Tuple2<String, Long>, String> tuple2StringKeyedStream = singleOutputStreamOperator.keyBy(t -> t.f0); // 5. 求和 SingleOutputStreamOperator<Tuple2<String, Long>> sum = tuple2StringKeyedStream.sum(1); // 6. 打印 sum.print(); // 7. 执行 executionEnvironment.execute(); } }
5. 支持的数据类型
为了方便地处理数据,Flink 有一整套类型系统。Flink 使用"类型信息"(TypeInformation)来统一表示数据类型。TypeInformation类是Flink 中所有类型描述符的基类。它涵盖了类型的一些基本属性,并为每个数据类型生成特定的序列化器、反序列化器和比较器。
1.flink支持的数据类型
简单来说,常见的java和scala数据类型,flink都是支持的。flink内部对支持不同的类型进行了划分,这些类型可以在org.apache.flink.api.common.typeinfo.Types工具类中找到:
(1)基本类型
所有Java基本类型及其包装类,再加上Void、String、Date、BigDecimal、BigInteger。
(2)数组类型
包括基本类型数据数组(PRIMITIVE_ARRAY)和对象数组(OBJECT_ARRAY)
(3)复合数据类型
Java元组类型(TUPLE): 这是Flink 内置的元祖类型,是Java API的一部分。最多25个字段,也就是Tuple0-Tuple25,不支持空字段。
Scala样例类以及Scale 元组:不支持空字段
行类型(ROW):可认为是具有任意个字段的元祖,并支持空字段
POJO类:Flink自定义的类似于Javabean 模式的类。要求如下:
类是公告的和独立的(没有非静态的内部类)
类有一个公共的无参构造方法
类中的所有字段是public且非final 的;或者有一个公共的getter、setter 方法。这些方法符合javabean的命名规范。
(4)辅助类型
Option、Either、List、Map等
(5)泛型类型
如果不是上面类型定义的类型,Flink会当做泛型来处理。Flink会把泛型当做黑盒子,无法获取它们内部的属性;它们也不是由Flink本身序列化的,而是由Kryo 序列化的。
2.类型提示
Flink 还具有一个类型提取系统,可以分析函数的输入和返回类型,自动获取类型信息,从而获取序列化器和反序列化器。由于泛型擦除,有时候需要显示地提供类型信息,才能使应用程序正常工作或提高性能。为了解决这类问题,JavaAPI提供了类提示(type hints)。
(1)方式一:
.returns(Types.TUPLE(Types.STRING, Types.LONG))
(2)方式二:
.returns(new TypeHint<Tuple2<String, Long>>() {})
3. 转换算子
一个Flink程序的核心,其实就是所有的转换操作,它们决定了处理的业务逻辑。下面研究常见的转换算子。
1. 常见转换
map\filter\flatMap
package cn.qz.transformation; import cn.qz.source.User; import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.util.Collector; public class MapTrans { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<User> dataStreamSource = executionEnvironment.fromElements( new User("zs", "张三", 22), new User("ls", "李四", 23), new User("ww", "王五", 21), new User("ww", "王五", 22), new User("zl", "赵六", 27) ); // map 映射转换 SingleOutputStreamOperator<String> map = dataStreamSource.map(User::getUsername); // filter 过滤 SingleOutputStreamOperator<String> filter = map.filter(str -> "ww".equals(str)); // flatMap 扁平映射, 主要是将数据流中的整体(一般是集合类型)拆分成一个一个的个体 // 接收的参数是一个FlatMapFunction,泛型第一个实际类型是传入的实际类型,第二个是返回的时候返回的类型。Collector 用于收集返回结果。 SingleOutputStreamOperator<String> stringSingleOutputStreamOperator = filter.flatMap(new FlatMapFunction<String, String>() { @Override public void flatMap(String value, Collector<String> out) throws Exception { out.collect(value); } }); executionEnvironment.execute(); } }
2.聚合
(1)按键分区(keyBy)
keyBy是聚合前必须要用到的一个算子。keyBy通过指定键(key),可以将一条数据分成不同的分区(partitions)。这里所说的分区实际就是并行处理的子任务,也就是对应不同的任务槽。
在keyBy 内部,是通过计算key的哈希值,对分区数进行取模运算来实现。所以这里的key 如果是POJO的话,必须要重写 hashCode() 方法。
需要注意,keyBy 得到的结果将不再是DataStream,而是会将DataStream 转换为KeyedStream(分区流、键控流),继承自DataStream,所以其操作也是基于DataStream API。KeyedStream 是一个非常重要的数据结构,可以做后续的聚合,比如sum、reduce等; 而且它可以将当前子任务的状态也按照key 进行划分、先定位仅对当前key 有效。keyBy 可以传入可变数组(int、String)、也可以传入一个KeySelector 。
// 使用匿名内部类 KeyedStream<User, String> userStringKeyedStream = dataStreamSource.keyBy(User::getUsername); // 传入KeySelector /*KeyedStream<User, String> userStringKeyedStream = dataStreamSource.keyBy(new KeySelector<User, String>() { @Override public String getKey(User value) throws Exception { return value.getUsername(); } });*/
(2)简单聚合
sum: 在输入流上,对指定的字段做叠加求和的操作
min:在输入流上,对指定的字段求最小值
max:在输入流上,对指定字段求最大值
minBy: 与min 类似。 不同的是,min 只计算指定字段的最小值,其他字段会保留最初第一个数据的值; 而minBy 则会返回包含字段最小值的整条数据
maxBy:与max类似。区别同上
上面方法api 可需要指定指定的字段。可以指定位置和指定名称。(POJO类型只能指定名称,不能指定位置)
package cn.qz.transformation; import org.apache.flink.api.common.RuntimeExecutionMode; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class MapTrans { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<Tuple2> dataStreamSource = executionEnvironment.fromElements( Tuple2.of("a", 1), Tuple2.of("b", 5), Tuple2.of("c", 3), Tuple2.of("d", 4), Tuple2.of("d", 7) ); // dataStreamSource.keyBy(r -> r.f0).sum(1).print(); /** * 5> (d,4) * 5> (d,11) * 4> (c,3) * 2> (b,5) * 6> (a,1) */ // dataStreamSource.keyBy(r -> r.f0).sum("f1").print(); // 结果同上 // dataStreamSource.keyBy(r -> r.f0).maxBy("f1").print(); /** * 4> (c,3) * 2> (b,5) * 5> (d,4) * 6> (a,1) * 5> (d,7) */ // dataStreamSource.keyBy(r -> r.f0).max(1).print(); /** * 4> (c,3) * 5> (d,4) * 2> (b,5) * 5> (d,7) * 6> (a,1) */ // dataStreamSource.keyBy(r -> r.f0).min(1).print(); /** * 6> (a,1) * 2> (b,5) * 5> (d,4) * 4> (c,3) * 5> (d,4) */ dataStreamSource.keyBy(r -> r.f0).minBy(1).print(); /** * 2> (b,5) * 6> (a,1) * 5> (d,4) * 4> (c,3) * 5> (d,4) */ executionEnvironment.execute(); } }
一个聚合算子,会为每一个key 保存一个聚合的值,在Flink 中我们把它叫做"状态"(state)。所以没当有一个新的数据输入,算子就回更新保存的聚合结果,并发送一个带有更新后聚合值的事件到下游算子。
(3)归约聚合 reduce
@Public @FunctionalInterface public interface ReduceFunction<T> extends Function, Serializable { /** * The core method of ReduceFunction, combining two values into one value of the same type. The * reduce function is consecutively applied to all values of a group until only a single value * remains. * * @param value1 The first value to combine. * @param value2 The second value to combine. * @return The combined value of both input values. * @throws Exception This method may throw exceptions. Throwing an exception will cause the * operation to fail and may trigger recovery. */ T reduce(T value1, T value2) throws Exception; }
package cn.qz.transformation; import cn.qz.source.User; import org.apache.flink.api.common.RuntimeExecutionMode; import org.apache.flink.api.common.functions.ReduceFunction; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class Reduce { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); executionEnvironment.setParallelism(1); executionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH); DataStreamSource<User> dataStreamSource = executionEnvironment.fromElements( new User("zs", "张三", 22), new User("ls", "李四", 23), new User("ww", "王五1", 21), new User("ww", "王五", 22), new User("zl", "赵六", 27) ); dataStreamSource // 第一次分组,按username 分组 .keyBy(User::getUsername) // 第一次reduce, 将相同的年领求和,求和之后设置到相同组的第一个用户的user 身上 .reduce(new ReduceFunction<User>() { @Override public User reduce(User value1, User value2) throws Exception { value1.setAge(value1.getAge() + value2.getAge()); return value1; } }) // 为每条数据分配同一个key,划分到同一个组 .keyBy(r -> true) // 找出年龄最大的,可以用maxBy。 // .maxBy("age") // 找出年龄最大的也可以用reduce 再次实现 .reduce((User u1, User u2) -> { if (u1.getAge() > u2.getAge()) { return u1; } return u2; }) // 打印 .print(); executionEnvironment.execute(); } }
3. 用户自定义函数(UDF-user definied function)
Flink的DataStream API 编程风格其实是一致的:全部以算子操作名称 + Function 命令,比如源算子SourceFunction、MapFunction、FilterFunction 等,这些都继承Function(一个空接口)。
比如:
package cn.qz.transformation; import cn.qz.source.User; import org.apache.flink.api.common.RuntimeExecutionMode; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.functions.ReduceFunction; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class UDFTest { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); executionEnvironment.setParallelism(1); executionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH); DataStreamSource<User> dataStreamSource = executionEnvironment.fromElements( new User("zs", "张三", 22), new User("ls", "李四", 23), new User("ww", "王五1", 21), new User("ww", "王五", 22), new User("zl", "赵六", 27) ); // 传入的参数表示需要加密的username dataStreamSource.map(new MyMapFunction("ww")).print(); executionEnvironment.execute(); } private static class MyMapFunction implements MapFunction<User, String> { private String username; public MyMapFunction(String username) { this.username = username; } @Override public String map(User value) throws Exception { if (value.getUsername().equals(username)) { return "******"; } return value.getUsername(); } } }
4.富函数类
富函数类也是DataStream API 提供 的一个函数的接口,所有的Flink 函数类都有其Rich版本。富函数类以抽象类的形式出现。例如:RichMapFunction、RichReduceFunction 等。与常规函数类的不同主要在于,富函数类可以获取运行环境的上下文,并拥有一些生命周期方法,所以可以实现更复杂的功能。
RichFunction 有生命周期的概念。典型的生命周期方法有:
open 方法:RichFunction 的初始化方法,会开启一个算子的生命周期。当一个算子的实际工作方法例如map、filter 被调用之前,open 首先会被调用。可以用于数据库连接等。
close:生命周期中的最后一个调用的方法。一般用于清理工作。
package cn.qz.transformation; import cn.qz.source.User; import org.apache.flink.api.common.functions.RichMapFunction; import org.apache.flink.configuration.Configuration; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class RichMapTest { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); executionEnvironment.setParallelism(2); DataStreamSource<User> dataStreamSource = executionEnvironment.fromElements( new User("zs", "张三", 22), new User("ls", "李四", 23), new User("ww", "王五1", 21), new User("ww", "王五", 22), new User("zl", "赵六", 27) ); SingleOutputStreamOperator<String> map = dataStreamSource.map(new RichMapFunction<User, String>() { @Override public void open(Configuration parameters) throws Exception { super.open(parameters); System.out.println("任务索引为(start):" + getRuntimeContext().getIndexOfThisSubtask()); } @Override public String map(User value) throws Exception { return value.getUsername(); } @Override public void close() throws Exception { super.close(); System.out.println("任务索引为(close):" + getRuntimeContext().getIndexOfThisSubtask()); } }); map.print(); executionEnvironment.execute(); } }
结果:
任务索引为(start):1 任务索引为(start):0 2> ls 1> zs 2> ww 1> ww 1> zl 任务索引为(close):0 任务索引为(close):1
4. 输出sink
所有的算子都可以自定义函数来实现,所以都可以通过实现函数来自定义处理逻辑,所以只要有读写客户端,与外部系统的交互在任何一个算子中都可以实现。例如可以考虑用RichMapFunction,在open() 做连接操作、close() 做资源清理操作。但是,这种做法的缺点就是Flink 提供的一致性检查点(checkpoint) 无法有效利用,发生故障很难恢复到从前。
Flink的DataStream API提供了专门向外部写入数据的方法:addSink。 接受的参数是一个SinkFunction。比如我们常见的print 函数就是一个SinkFunction。
org.apache.flink.streaming.api.datastream.DataStream#print() 源码如下:
public DataStreamSink<T> print() { PrintSinkFunction<T> printFunction = new PrintSinkFunction<>(); return addSink(printFunction).name("Print to Std. Out"); } package org.apache.flink.api.common.functions.util; import org.apache.flink.annotation.Internal; import java.io.PrintStream; import java.io.Serializable; /** Print sink output writer for DataStream and DataSet print API. */ @Internal public class PrintSinkOutputWriter<IN> implements Serializable { private static final long serialVersionUID = 1L; private static final boolean STD_OUT = false; private static final boolean STD_ERR = true; private final boolean target; private transient PrintStream stream; private final String sinkIdentifier; private transient String completedPrefix; public PrintSinkOutputWriter() { this("", STD_OUT); } public PrintSinkOutputWriter(final boolean stdErr) { this("", stdErr); } public PrintSinkOutputWriter(final String sinkIdentifier, final boolean stdErr) { this.target = stdErr; this.sinkIdentifier = (sinkIdentifier == null ? "" : sinkIdentifier); } public void open(int subtaskIndex, int numParallelSubtasks) { // get the target stream stream = target == STD_OUT ? System.out : System.err; completedPrefix = sinkIdentifier; if (numParallelSubtasks > 1) { if (!completedPrefix.isEmpty()) { completedPrefix += ":"; } completedPrefix += (subtaskIndex + 1); } if (!completedPrefix.isEmpty()) { completedPrefix += "> "; } } public void write(IN record) { stream.println(completedPrefix + record.toString()); } @Override public String toString() { return "Print to " + (target == STD_OUT ? "System.out" : "System.err"); } } // 然后输出交给 java.io.PrintStream
像kafka之类的流式系统,Flink 提供了完美对接,source/sink 两段都能连接,可读可写; 而对于ES、FileSystem、JDBC 等则只提供了输出写入的sink 连接器。另外Apache 提供了redis的输出以及ActiveMQ的输入等。除此之外就需要自定义实现sink 连接器了。
1.输出到文件
flink 本来提供了简单的输出到文件的方法 writeAsText()、writeAsCsv()等,但是这种sink 的并行度只能为1,而且对于故障恢复也没有任何保证。因此被弃用。
public class StreamingFileSink<IN> extends RichSinkFunction<IN> implements CheckpointedFunction, CheckpointListener {
1. 输出到文件
package cn.qz.sink; import cn.qz.source.User; import org.apache.flink.api.common.serialization.SimpleStringEncoder; import org.apache.flink.core.fs.Path; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink; import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy; import java.util.concurrent.TimeUnit; public class FileSink { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); executionEnvironment.setParallelism(2); DataStreamSource<User> dataStreamSource = executionEnvironment.fromElements( new User("zs1", "张三", 22), new User("ls2", "李四", 23), new User("ww3", "王五1", 21), new User("ww4", "王五", 22), new User("ww5", "王五", 22), new User("ww6", "王五", 22), new User("ww7", "王五", 22), new User("zl8", "赵六", 27) ); StreamingFileSink<String> sink = StreamingFileSink // 指定输出的目录路径以及编码 .<String>forRowFormat(new Path("./output"), new SimpleStringEncoder<>("UTF-8")) /** *指定滚动策略。 下面三种情况会进行分区文件 * 至少包含15 分支的数据 * 最近五分钟没有收到新的数据 * 文件大小已达到1GB */ .withRollingPolicy(DefaultRollingPolicy.builder() .withRolloverInterval(TimeUnit.MINUTES.toMillis(15)) .withInactivityInterval(TimeUnit.MINUTES.toMillis(5)) .withMaxPartSize(1024 * 1024 * 1024) .build()).build(); dataStreamSource.map(User::toString).addSink(sink); executionEnvironment.execute(); } }
结果:两个文件分别对应两个任务槽,均等平分任务。
如果想输出到一个文件,可以用global 将分区数设置为1, 然后进行输出:
dataStreamSource.map(User::toString).global().addSink(sink);
2.输出到redis
(1)pom新增
<!--redis--> <dependency> <groupId>org.apache.bahir</groupId> <artifactId>flink-connector-redis_2.11</artifactId> <version>1.0</version> </dependency>
(2)测试代码
package cn.qz.sink; import cn.qz.source.User; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.redis.RedisSink; import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig; import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommand; import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommandDescription; import org.apache.flink.streaming.connectors.redis.common.mapper.RedisMapper; /** * @author 乔利强 * @date 2022/6/30 19:42 * @description */ public class RedisSinkTest { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); executionEnvironment.setParallelism(2); DataStreamSource<User> dataStreamSource = executionEnvironment.fromElements( new User("zs1", "张三", 22), new User("ls2", "李四", 23), new User("ww3", "王五1", 21), new User("ww4", "王五", 22), new User("ww5", "王五", 22), new User("ww6", "王五", 22), new User("ww7", "王五", 22), new User("zl8", "赵六", 27) ); // 创建一个到redis连接的配置 FlinkJedisPoolConfig conf = new FlinkJedisPoolConfig.Builder() .setHost("localhost") .build(); dataStreamSource.addSink(new RedisSink<User>(conf, new MyRedisMapper())); executionEnvironment.execute(); } private static class MyRedisMapper implements RedisMapper<User> { /** * 返回当前redis 操作命令的描述:表示将所有的用户信息放入hset 存入 * * @return */ @Override public RedisCommandDescription getCommandDescription() { return new RedisCommandDescription(RedisCommand.HSET, "users"); } /** * hash 中的 key * * @param data * @return */ @Override public String getKeyFromData(User data) { return data.getUsername(); } /** * 返回的值 * * @param data * @return */ @Override public String getValueFromData(User data) { return data.getFullname(); } } }
(3)redis查看
127.0.0.1:6379> keys * 1) "users" 127.0.0.1:6379> type users hash 127.0.0.1:6379> hgetall users 1) "ls2" 2) "\xe6\x9d\x8e\xe5\x9b\x9b" 3) "zs1" 4) "\xe5\xbc\xa0\xe4\xb8\x89" 5) "ww3" 6) "\xe7\x8e\x8b\xe4\xba\x941" 7) "ww4" 8) "\xe7\x8e\x8b\xe4\xba\x94" 9) "ww5" 10) "\xe7\x8e\x8b\xe4\xba\x94" 11) "ww6" 12) "\xe7\x8e\x8b\xe4\xba\x94" 13) "ww7" 14) "\xe7\x8e\x8b\xe4\xba\x94" 15) "zl8" 16) "\xe8\xb5\xb5\xe5\x85\xad"
3.输出到mysql
(1)pom新增
<!--mysql--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-jdbc_${scala.binary.version}</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.47</version> </dependency>
(2)测试代码
package cn.qz.sink; import cn.qz.source.User; import org.apache.flink.connector.jdbc.JdbcConnectionOptions; import org.apache.flink.connector.jdbc.JdbcSink; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class MysqlSink { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); executionEnvironment.setParallelism(2); DataStreamSource<User> dataStreamSource = executionEnvironment.fromElements( new User("zs1", "张三", 22), new User("ls2", "李四", 23), new User("ww3", "王五1", 21), new User("ww4", "王五", 22), new User("ww5", "王五", 22), new User("ww6", "王五", 22), new User("ww7", "王五", 22), new User("zl8", "赵六", 27) ); dataStreamSource.addSink( JdbcSink.sink( "INSERT INTO user (username, fullname, age) VALUES (?, ?, ?)", (statement, r) -> { statement.setString(1, r.getUsername()); statement.setString(2, r.getFullname()); statement.setInt(3, r.getAge()); }, new JdbcConnectionOptions.JdbcConnectionOptionsBuilder() .withUrl("jdbc:mysql://localhost:3306/flink") .withDriverName("com.mysql.jdbc.Driver") .withUsername("root") .withPassword("123456") .build() ) ); executionEnvironment.execute(); } }
4.自定义输出实现输出到mysql
package cn.qz.sink; import cn.qz.source.User; import org.apache.flink.configuration.Configuration; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.sink.RichSinkFunction; import java.sql.Connection; import java.sql.DriverManager; import java.sql.Statement; public class CustomSink { public static void main(String[] args) throws Exception { StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); executionEnvironment.setParallelism(2); DataStreamSource<User> dataStreamSource = executionEnvironment.fromElements( new User("zs1", "张三", 22), new User("ls2", "李四", 23), new User("ww3", "王五1", 21), new User("ww4", "王五", 22), new User("ww5", "王五", 22), new User("ww6", "王五", 22), new User("ww7", "王五", 22), new User("zl8", "赵六", 27) ); dataStreamSource.addSink(new RichSinkFunction<User>() { Connection connection = null; Statement statement = null; private volatile int age = 0; @Override public void open(Configuration parameters) throws Exception { super.open(parameters); Class<?> aClass = Class.forName("com.mysql.jdbc.Driver"); connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/flink", "root", "123456"); statement = connection.createStatement(); System.out.println(Thread.currentThread().getName() + "\topen"); } @Override public void invoke(User value, Context context) throws Exception { super.invoke(value, context); ++age; statement.execute(String.format("INSERT INTO USER VALUES('%s', '%s', %s)", "username" + age, "用户全名" + age, age)); } @Override public void close() throws Exception { super.close(); statement.close(); connection.close(); System.out.println(Thread.currentThread().getName() + "\tclose"); } }).setParallelism(2); executionEnvironment.execute(); } }
结果:
Sink: Unnamed (2/2)#0 open Sink: Unnamed (1/2)#0 open Sink: Unnamed (2/2)#0 close Sink: Unnamed (1/2)#0 close