背景
上一篇我们介绍了Kafka Streams中的消息过滤操作filter,今天我们展示一个对消息进行转换Key的操作,依然是结合一个具体的实例展开介绍。所谓转换Key是指对流处理中每条消息的Key进行变换操作,以方便后面进行各种groupByKey操作。
演示功能说明
本篇演示selectKey的用法,即根据指定的Key选择逻辑对每条消息的Key进行变换操作。今天使用的输入topic消息格式如下:
ID | First Name | Last Name | Phone Number
比如这样:
3 | San | Zhang | 13910010000
我们的目标是提取出手机号的号段(比如1391)作为消息的新Key,然后输出到一个新的Kafka主题上。
初始化项目
创建项目目录:
mkdir selectKey-streams
cd selectKey-streams/
配置项目
在selectKey-streams目录下创建build.gradle文件,内容如下:
buildscript {
repositories {
jcenter()
}
dependencies {
classpath 'com.github.jengelman.gradle.plugins:shadow:4.0.2'
}
}
plugins {
id 'java'
}
apply plugin: 'com.github.johnrengelman.shadow'
repositories {
mavenCentral()
jcenter()
maven {
url 'http://packages.confluent.io/maven'
}
}
group 'huxihx.kafkastreams'
sourceCompatibility = 1.8
targetCompatibility = '1.8'
version = '0.0.1'
dependencies {
implementation 'org.slf4j:slf4j-simple:1.7.26'
implementation 'org.apache.kafka:kafka-streams:2.3.0'
testCompile group: 'junit', name: 'junit', version: '4.12'
}
jar {
manifest {
attributes(
'Class-Path': configurations.compile.collect { it.getName() }.join(' '),
'Main-Class': 'huxihx.kafkastreams.SelectKeyStreamsApp'
)
}
}
shadowJar {
archiveName = "kstreams-transform-standalone-${version}.${extension}"
}
然后执行下列命令下载Gradle的wrapper套件:
gradle wrapper
之后在selectKey-streams目录下创建一个名为configuration的文件夹用于保存我们的参数配置文件:
mkdir configuration
创建一个名为dev.properties的文件:
application.id=selectKey-app
bootstrap.servers=localhost:9092input.topic.name=nonkeyed-records
input.topic.partitions=1
input.topic.replication.factor=1output.topic.name=keyed-records
output.topic.partitions=1
output.topic.replication.factor=1
开发主流程
创建src/main/java/huxihx/kafkastreams目录,并在该目录下创建SelectKeyStreamsApp.java文件:
package huxihx.kafkastreams; import org.apache.kafka.clients.admin.AdminClient; import org.apache.kafka.clients.admin.NewTopic; import org.apache.kafka.common.serialization.Serdes; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.kstream.Consumed; import java.io.FileInputStream; import java.io.IOException; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.Set; import java.util.concurrent.CountDownLatch; public class SelectKeyStreamsApp { public static void main(String[] args) throws Exception { if (args.length < 1) { throw new IllegalArgumentException("Environment configuration file must be specified."); } SelectKeyStreamsApp app = new SelectKeyStreamsApp(); Properties envProps = app.loadEnvProperties(args[0]); Properties streamProps = app.buildStreamsProperties(envProps); app.preCreateTopics(envProps); Topology topology = app.buildTopology(envProps); final KafkaStreams streams = new KafkaStreams(topology, streamProps); final CountDownLatch latch = new CountDownLatch(1); Runtime.getRuntime().addShutdownHook(new Thread("streams-jvm-shutdown-hook") { @Override public void run() { streams.close(); latch.countDown(); } }); try { streams.start(); latch.await(); } catch (Exception e) { System.exit(1); } System.exit(0); } private Topology buildTopology(Properties envProps) { final StreamsBuilder builder = new StreamsBuilder(); final String inputTopic = envProps.getProperty("input.topic.name"); final String outputTopic = envProps.getProperty("output.topic.name"); builder.stream(inputTopic, Consumed.with(Serdes.String(), Serdes.String())) .selectKey((noKey, value) -> { String[] fields = value.split("\|"); return fields[fields.length - 1].trim().substring(0, 4); }) .to(outputTopic); return builder.build(); } private Properties buildStreamsProperties(Properties envProps) { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, envProps.getProperty("application.id")); props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers")); props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass()); props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass()); return props; } private void preCreateTopics(Properties envProps) throws Exception { Map<String, Object> config = new HashMap<>(); config.put("bootstrap.servers", envProps.getProperty("bootstrap.servers")); try (AdminClient client = AdminClient.create(config)) { Set<String> existingTopics = client.listTopics().names().get(); List<NewTopic> topics = new ArrayList<>(); String inputTopic = envProps.getProperty("input.topic.name"); if (!existingTopics.contains(inputTopic)) { topics.add(new NewTopic(inputTopic, Integer.parseInt(envProps.getProperty("input.topic.partitions")), Short.parseShort(envProps.getProperty("input.topic.replication.factor")))); } String outputTopic = envProps.getProperty("output.topic.name"); if (!existingTopics.contains(outputTopic)) { topics.add(new NewTopic(outputTopic, Integer.parseInt(envProps.getProperty("output.topic.partitions")), Short.parseShort(envProps.getProperty("output.topic.replication.factor")))); } client.createTopics(topics); } } private Properties loadEnvProperties(String filePath) throws IOException { Properties envProps = new Properties(); try (FileInputStream input = new FileInputStream(filePath)) { envProps.load(input); } return envProps; } }
测试
首先我们运行下列命令构建项目:
./gradlew clean shadowJar
然后启动Kafka集群,之后运行Kafka Streams应用:
java -jar build/libs/kstreams-transform-standalone-0.0.1.jar configuration/dev.properties
之后启动一个Console Consumer去测试输出topic的Key值是否真的设置了我们的手机号段:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic keyed-records --property print.key=true
最后启动一个Console Producer按照规定的事件格式去生成对应的消息:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic nonkeyed-records
>1 | Wang | Wu | 18601234567
>2 | Li | Si | 13901234567
>3 | Zhang | San | 13921234567
>4 | Alice | Joe | 13901234568
如果一切正常,你应该可以在Console Consumer的输出中看到:
1860 1 | Wang | Wu | 18601234567
1390 2 | Li | Si | 13901234567
1390 3 | Zhang | San | 13921234567
1390 4 | Alice | Joe | 13901234568
前面的4位数字就是我们提取的手机号段信息。后面你可以使用这个Key进行各种groupBy操作,比如统计各个号段的人数等。
总结
很多场合下我们都需要修改原始消息中的Key值,方便后续进行统计操作。本例演示了如何使用selectKey函数方便地对消息Key进行变换。