1. 背景
上一篇演示了split操作算子的用法。今天展示一下split的逆操作:merge。Merge算子的作用是把多股实时消息流合并到一个单一的流中。
2. 功能演示说明
假设我们有多个Kafka topic,每个topic表示某类特定音乐类型的歌曲,比如有摇滚乐、古典乐等。本例中我们演示如何使用Kafka Streams将这些歌曲合并到一个Kafka topic中。我们依然使用Protocol Buffer对歌曲进行序列化和反序列化。你大概可以认为歌曲可以用下面的格式来表示:
{"artist": "Metallica", "title": "Fade to Black"}
{"artist": "Smashing Pumpkins", "title": "Today"}
3. 初始化项目
首先创建项目目录:
$ mkdir merge-streams
$ cd merge-streams/
4. 配置项目
创建Gradle项目配置文件build.gradle:
buildscript { repositories { jcenter() } dependencies { classpath 'com.github.jengelman.gradle.plugins:shadow:4.0.2' } } plugins { id 'java' id "com.google.protobuf" version "0.8.10" } apply plugin: 'com.github.johnrengelman.shadow' repositories { mavenCentral() jcenter() maven { url 'http://packages.confluent.io/maven' } } group 'huxihx.kafkastreams' sourceCompatibility = 1.8 targetCompatibility = '1.8' version = '0.0.1' dependencies { implementation 'com.google.protobuf:protobuf-java:3.0.0' implementation 'org.slf4j:slf4j-simple:1.7.26' implementation 'org.apache.kafka:kafka-streams:2.3.0' implementation 'com.google.protobuf:protobuf-java:3.9.1' testCompile group: 'junit', name: 'junit', version: '4.12' } protobuf { generatedFilesBaseDir = "$projectDir/src/" protoc { artifact = 'com.google.protobuf:protoc:3.0.0' } } jar { manifest { attributes( 'Class-Path': configurations.compile.collect { it.getName() }.join(' '), 'Main-Class': 'huxihx.kafkastreams.MergeStreams' ) } } shadowJar { archiveName = "kstreams-transform-standalone-${version}.${extension}" }
保存上面的文件,然后执行下列命令下载Gradle的wrapper套件:
$ gradle wrapper
之后在merge-streams目录下创建一个名为configuration的文件夹用于保存我们的参数配置文件dev.properties:
application.id=merging-app bootstrap.servers=localhost:9092 input.rock.topic.name=rock-song-events input.rock.topic.partitions=1 input.rock.topic.replication.factor=1 input.classical.topic.name=classical-song-events input.classical.topic.partitions=1 input.classical.topic.replication.factor=1 output.topic.name=all-song-events output.topic.partitions=1 output.topic.replication.factor=1
这里我们配置了两个输入topic,分别表示摇滚乐和古典乐歌曲。同时我们还创建了一个输出topic,用于保存merge之后的歌曲流。
5. 创建消息Schema
接下来创建用到的topic的schema。在merge-streams下执行命令创建保存schema的文件夹:
$ mkdir -p src/main/proto
之后在proto文件夹下创建名为song_event.proto文件,内容如下:
syntax = "proto3"; package huxihx.kafkastreams.proto; message SongEvent { string name = 1; string title = 2; }
保存之后在merge-stream下运行gradlew命令:
$ ./gradlew build
此时,你应该可以在merge-streams/src/main/java/huxihx/kafkastreams/proto下看到生成的Java类:SongEventOuterClass。
6. 创建Serdes
这一步我们为所需的topic消息创建各自的Serdes。首先执行下面的命令创建对应的文件夹目录:
$ mkdir -p src/main/java/huxihx/kafkastreams/serdes
之后在新创建的serdes文件夹下创建ProtobufSerializer.java:
package huxihx.kafkastreams.serdes; import com.google.protobuf.MessageLite; import org.apache.kafka.common.serialization.Serializer; public class ProtobufSerializer<T extends MessageLite> implements Serializer<T> { @Override public byte[] serialize(String topic, T data) { return data == null ? new byte[0] : data.toByteArray(); } }
然后是ProtobufDeserializer.java:
package huxihx.kafkastreams.serdes; import com.google.protobuf.InvalidProtocolBufferException; import com.google.protobuf.MessageLite; import com.google.protobuf.Parser; import org.apache.kafka.common.errors.SerializationException; import org.apache.kafka.common.serialization.Deserializer; import java.util.Map; public class ProtobufDeserializer<T extends MessageLite> implements Deserializer<T> { private Parser<T> parser; @Override public void configure(Map<String, ?> configs, boolean isKey) { parser = (Parser<T>) configs.get("parser"); } @Override public T deserialize(String topic, byte[] data) { try { return parser.parseFrom(data); } catch (InvalidProtocolBufferException e) { throw new SerializationException("Failed to deserialize from a protobuf byte array.", e); } } }
最后是ProtobufSerdes.java:
package huxihx.kafkastreams.serdes; import com.google.protobuf.MessageLite; import com.google.protobuf.Parser; import org.apache.kafka.common.serialization.Deserializer; import org.apache.kafka.common.serialization.Serde; import org.apache.kafka.common.serialization.Serializer; import java.util.HashMap; import java.util.Map; public class ProtobufSerdes<T extends MessageLite> implements Serde<T> { private final Serializer<T> serializer; private final Deserializer<T> deserializer; public ProtobufSerdes(Parser<T> parser) { serializer = new ProtobufSerializer<>(); deserializer = new ProtobufDeserializer<>(); Map<String, Parser<T>> config = new HashMap<>(); config.put("parser", parser); deserializer.configure(config, false); } @Override public Serializer<T> serializer() { return serializer; } @Override public Deserializer<T> deserializer() { return deserializer; } }
7. 开发主流程
在src/main/java/huxihx/kafkastreams下创建MergeStreams.java文件:
package huxihx.kafkastreams; import huxihx.kafkastreams.proto.SongEventOuterClass; import huxihx.kafkastreams.serdes.ProtobufSerdes; import org.apache.kafka.clients.admin.AdminClient; import org.apache.kafka.clients.admin.AdminClientConfig; import org.apache.kafka.clients.admin.NewTopic; import org.apache.kafka.clients.admin.TopicListing; import org.apache.kafka.common.serialization.Serdes; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.kstream.Consumed; import org.apache.kafka.streams.kstream.KStream; import java.io.FileInputStream; import java.io.IOException; import java.util.ArrayList; import java.util.Collection; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.stream.Collectors; public class MergeStreams { public static void main(String[] args) throws Exception { if (args.length < 1) { throw new IllegalArgumentException("Config file path must be specified."); } MergeStreams app = new MergeStreams(); Properties envProps = app.loadEnvProperties(args[0]); Properties streamProps = app.createStreamsProperties(envProps); Topology topology = app.buildTopology(envProps); app.preCreateTopics(envProps); final KafkaStreams streams = new KafkaStreams(topology, streamProps); final CountDownLatch latch = new CountDownLatch(1); Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") { @Override public void run() { streams.close(); latch.countDown(); } }); try { streams.start(); latch.await(); } catch (Exception e) { System.exit(1); } System.exit(0); } private Topology buildTopology(Properties envProps) { final StreamsBuilder builder = new StreamsBuilder(); final String rockEvents = envProps.getProperty("input.rock.topic.name"); final String classicalEvents = envProps.getProperty("input.classical.topic.name"); final String allEvents = envProps.getProperty("output.topic.name"); KStream<String, SongEventOuterClass.SongEvent> rockStreams = builder.stream(rockEvents, Consumed.with(Serdes.String(), songEventProtobufSerdes())); KStream<String, SongEventOuterClass.SongEvent> classicalStreams = builder.stream(classicalEvents, Consumed.with(Serdes.String(), songEventProtobufSerdes())); KStream<String, SongEventOuterClass.SongEvent> allStreams = rockStreams.merge(classicalStreams); allStreams.to(allEvents, Produced.with(Serdes.String(), songEventProtobufSerdes())); return builder.build(); } private Properties createStreamsProperties(Properties envProps) { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, envProps.getProperty("application.id")); props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers")); props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass()); props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass()); return props; } private void preCreateTopics(Properties envProps) throws Exception { Map<String, Object> config = new HashMap<>(); config.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers")); String inputTopic1 = envProps.getProperty("input.rock.topic.name"); String inputTopic2 = envProps.getProperty("input.classical.topic.name"); String outputTopic = envProps.getProperty("output.topic.name"); try (AdminClient client = AdminClient.create(config)) { Collection<TopicListing> existingTopics = client.listTopics().listings().get(); List<NewTopic> topics = new ArrayList<>(); List<String> topicNames = existingTopics.stream().map(TopicListing::name).collect(Collectors.toList()); if (!topicNames.contains(inputTopic1)) topics.add(new NewTopic( envProps.getProperty("input.rock.topic.name"), Integer.parseInt(envProps.getProperty("input.rock.topic.partitions")), Short.parseShort(envProps.getProperty("input.rock.topic.replication.factor")))); if (!topicNames.contains(inputTopic2)) topics.add(new NewTopic( envProps.getProperty("input.classical.topic.name"), Integer.parseInt(envProps.getProperty("input.classical.topic.partitions")), Short.parseShort(envProps.getProperty("input.classical.topic.replication.factor")))); if (!topicNames.contains(outputTopic)) topics.add(new NewTopic( envProps.getProperty("output.topic.name"), Integer.parseInt(envProps.getProperty("output.topic.partitions")), Short.parseShort(envProps.getProperty("output.topic.replication.factor")))); if (!topics.isEmpty()) client.createTopics(topics).all().get(); } } private Properties loadEnvProperties(String fileName) throws IOException { Properties envProps = new Properties(); try (FileInputStream input = new FileInputStream(fileName)) { envProps.load(input); } return envProps; } private static ProtobufSerdes<SongEventOuterClass.SongEvent> songEventProtobufSerdes() { return new ProtobufSerdes<>(SongEventOuterClass.SongEvent.parser()); } }
主要的逻辑在buildTopology方法中,我们调用KStream的merge方法将两个输入消息流合并到一个输出消息流中。
8. 编写测试Producer和Consumer
和之前的入门系列一样,我们编写TestProducer和TestConsumer类。在src/main/java/huxihx/kafkastreams/tests/TestProducer.java和TestConsumer.java,内容分别如下:
TestProducer.java
package huxihx.kafkastreams.tests; import huxihx.kafkastreams.proto.SongEventOuterClass; import huxihx.kafkastreams.serdes.ProtobufSerializer; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerConfig; import org.apache.kafka.clients.producer.ProducerRecord; import java.util.Arrays; import java.util.List; import java.util.Properties; public class TestProducer { // 测试输入消息集1 private static final List<SongEventOuterClass.SongEvent> TEST_SONG_EVENTS1 = Arrays.asList( SongEventOuterClass.SongEvent.newBuilder().setName("Metallica").setTitle("Fade to Black").build(), SongEventOuterClass.SongEvent.newBuilder().setName("Smashing Pumpkins").setTitle("Today").build(), SongEventOuterClass.SongEvent.newBuilder().setName("Pink Floyd").setTitle("Another Brick in the Wall").build(), SongEventOuterClass.SongEvent.newBuilder().setName("Van Halen").setTitle("Jump").build(), SongEventOuterClass.SongEvent.newBuilder().setName("Led Zeppelin").setTitle("Kashmir").build() ); // 测试输入消息集2 private static final List<SongEventOuterClass.SongEvent> TEST_SONG_EVENTS2 = Arrays.asList( SongEventOuterClass.SongEvent.newBuilder().setName("Wolfgang Amadeus Mozart").setTitle("The Magic Flute").build(), SongEventOuterClass.SongEvent.newBuilder().setName("Johann Pachelbel").setTitle("Canon").build(), SongEventOuterClass.SongEvent.newBuilder().setName("Ludwig van Beethoven").setTitle("Symphony No. 5").build(), SongEventOuterClass.SongEvent.newBuilder().setName("Edward Elgar").setTitle("Pomp and Circumstance").build() ); public static void main(String[] args) { if (args.length < 1) { throw new IllegalArgumentException("Must specify a test set (1 or 2)."); } int choice = Integer.parseInt(args[0]); if (choice != 1 && choice != 2) { throw new IllegalArgumentException("Must specify a test set (1 or 2)."); } Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); props.put(ProducerConfig.ACKS_CONFIG, "all"); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer"); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, new ProtobufSerializer<SongEventOuterClass.SongEvent>().getClass()); try (final Producer<String, SongEventOuterClass.SongEvent> producer = new KafkaProducer<>(props)) { if (choice == 1) { TEST_SONG_EVENTS1.stream().map(song -> new ProducerRecord<String, SongEventOuterClass.SongEvent>("rock-song-events", song)) .forEach(producer::send); } else { TEST_SONG_EVENTS2.stream().map(song -> new ProducerRecord<String, SongEventOuterClass.SongEvent>("classical-song-events", song)) .forEach(producer::send); } } } }
TestConsumer.java
package huxihx.kafkastreams.tests; import com.google.protobuf.Parser; import huxihx.kafkastreams.proto.SongEventOuterClass; import huxihx.kafkastreams.serdes.ProtobufDeserializer; import org.apache.kafka.clients.consumer.Consumer; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.clients.consumer.ConsumerRecord; import org.apache.kafka.clients.consumer.ConsumerRecords; import org.apache.kafka.clients.consumer.KafkaConsumer; import org.apache.kafka.common.serialization.Deserializer; import org.apache.kafka.common.serialization.StringDeserializer; import java.time.Duration; import java.util.Arrays; import java.util.HashMap; import java.util.Map; import java.util.Properties; public class TestConsumer { public static void main(String[] args) { if (args.length < 1) { throw new IllegalStateException("Must specify an output topic name."); } Deserializer<SongEventOuterClass.SongEvent> deserializer = new ProtobufDeserializer<>(); Map<String, Parser<SongEventOuterClass.SongEvent>> config = new HashMap<>(); config.put("parser", SongEventOuterClass.SongEvent.parser()); deserializer.configure(config, false); Properties props = new Properties(); props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); props.put(ConsumerConfig.GROUP_ID_CONFIG, "test-group"); props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000"); props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"); try (final Consumer<String, SongEventOuterClass.SongEvent> consumer = new KafkaConsumer<>(props, new StringDeserializer(), deserializer)) { consumer.subscribe(Arrays.asList(args[0])); while (true) { ConsumerRecords<String, SongEventOuterClass.SongEvent> records = consumer.poll(Duration.ofMillis(500)); for (ConsumerRecord<String, SongEventOuterClass.SongEvent> record : records) { System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value()); } } } } }
9. 测试
首先我们运行下列命令构建项目:
$ ./gradlew shadowJar
然后启动Kafka集群,之后运行Kafka Streams应用:
$ java -jar build/libs/kstreams-transform-standalone-0.0.1.jar configuration/dev.properties
现在启动两个终端分别测试两组TestProducer发送测试事件:
$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestProducer 1
$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestProducer 2
最后启动TestConsumer验证Kafka Streams将两个topic输入消息流合并到一个输出消息流:
$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestConsumer all-song-events
offset = 0, key = null, value = name: "Metallica"
title: "Fade to Black"offset = 1, key = null, value = name: "Metallica"
title: "Fade to Black"offset = 2, key = null, value = name: "Smashing Pumpkins"
title: "Today"offset = 3, key = null, value = name: "Smashing Pumpkins"
title: "Today"offset = 4, key = null, value = name: "Pink Floyd"
title: "Another Brick in the Wall"offset = 5, key = null, value = name: "Pink Floyd"
title: "Another Brick in the Wall"offset = 6, key = null, value = name: "Van Halen"
title: "Jump"offset = 7, key = null, value = name: "Van Halen"
title: "Jump"offset = 8, key = null, value = name: "Led Zeppelin"
title: "Kashmir"offset = 9, key = null, value = name: "Led Zeppelin"
title: "Kashmir"offset = 10, key = null, value = name: "Wolfgang Amadeus Mozart"
title: "The Magic Flute"offset = 11, key = null, value = name: "Johann Pachelbel"
title: "Canon"offset = 12, key = null, value = name: "Ludwig van Beethoven"
title: "Symphony No. 5"offset = 13, key = null, value = name: "Edward Elgar"
title: "Pomp and Circumstance"