• Kafka Streams开发入门(6)


    1. 背景

    上一篇介绍了merge算子的作用。这一篇介绍如何从一个Kafka Streams中过滤掉那些重复出现的事件,只留下那些唯一的事件。

    2. 功能演示说明

    假设我们要执行去重逻辑的事件格式如下:

    {"ip":"10.0.0.1","url":"https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html","timestamp":"2019-09-16T14:53:43+00:00"}

    每条事件依然由Protocol Buffer进行序列化,由三部分组成:ip + url + timestamp

    3. 配置项目 

    首先创建项目路径

    $ mkdir distinct-events && cd distinct-events

    然后,在distinct-events目录下创建Gradle配置文件build.gradle,内容如下:

    buildscript {
      
        repositories {
            jcenter()
        }
        dependencies {
            classpath 'com.github.jengelman.gradle.plugins:shadow:4.0.2'
        }
    }
      
    plugins {
        id 'java'
        id "com.google.protobuf" version "0.8.10"
    }
    apply plugin: 'com.github.johnrengelman.shadow'
      
      
    repositories {
        mavenCentral()
        jcenter()
      
        maven {
            url 'http://packages.confluent.io/maven'
        }
    }
      
    group 'huxihx.kafkastreams'
      
    sourceCompatibility = 1.8
    targetCompatibility = '1.8'
    version = '0.0.1'
      
    dependencies {
        implementation 'com.google.protobuf:protobuf-java:3.0.0'
        implementation 'org.slf4j:slf4j-simple:1.7.26'
        implementation 'org.apache.kafka:kafka-streams:2.3.0'
        implementation 'com.google.protobuf:protobuf-java:3.9.1'
      
        testCompile group: 'junit', name: 'junit', version: '4.12'
    }
      
    protobuf {
        generatedFilesBaseDir = "$projectDir/src/"
        protoc {
            artifact = 'com.google.protobuf:protoc:3.0.0'
        }
    }
      
    jar {
        manifest {
            attributes(
                    'Class-Path': configurations.compile.collect { it.getName() }.join(' '),
                    'Main-Class': 'huxihx.kafkastreams.FindDistinctEvents'
            )
        }
    }
      
    shadowJar {
        archiveName = "kstreams-transform-standalone-${version}.${extension}"
    }

    注意我们设定的主类名称是huxihx.kafkastreams.FindDistinctEvents。

    保存上面的文件,然后执行下列命令下载Gradle的wrapper套件:

    $ gradle wrapper

    做完这些之后,我们在distinct-events目录下创建名为configuration的子目录,用于保存我们的参数配置文件dev.properties:

    $ mkdir configuration

    application.id=find-distinct-app
    bootstrap.servers=localhost:9092
    
    input.topic.name=clicks
    input.topic.partitions=1
    input.topic.replication.factor=1
    
    output.topic.name=distinct-clicks
    output.topic.partitions=1
    output.topic.replication.factor=1
    

    这里我们配置了一个输入topic和一个输出topic,分别保存输入消息流和去重之后的新消息流。

    4. 创建消息Schema

    接下来创建用到的topic的schema。在distinct-events下执行命令创建保存schema的文件夹:

    $ mkdir -p src/main/proto

    之后在proto文件夹下创建名为click.proto文件,内容如下:

    syntax = "proto3";
      
    package huxihx.kafkastreams.proto;
      
    message Click {
        string ip = 1;
        string url = 2;
        string timestamp = 3;
    }
    

    保存之后在distinct-events目录下运行gradlew命令:

    $ ./gradlew build  

    此时,你应该可以在distinct-events/src/main/java/huxihx/kafkastreams/proto下看到生成的Java类:ClickOuterClass。

    5. 创建Serdes

    这一步我们为所需的topic消息创建各自的Serdes。首先在distinct-events目录下执行下面的命令创建对应的文件夹目录:

    $ mkdir -p src/main/java/huxihx/kafkastreams/serdes

    之后在新创建的serdes文件夹下创建ProtobufSerializer.java:

    package huxihx.kafkastreams.serdes;
      
    import com.google.protobuf.MessageLite;
    import org.apache.kafka.common.serialization.Serializer;
      
    public class ProtobufSerializer<T extends MessageLite> implements Serializer<T> {
        @Override
        public byte[] serialize(String topic, T data) {
            return data == null ? new byte[0] : data.toByteArray();
        }
    }

    然后是ProtobufDeserializer.java:

    package huxihx.kafkastreams.serdes;
      
    import com.google.protobuf.InvalidProtocolBufferException;
    import com.google.protobuf.MessageLite;
    import com.google.protobuf.Parser;
    import org.apache.kafka.common.errors.SerializationException;
    import org.apache.kafka.common.serialization.Deserializer;
      
    import java.util.Map;
      
    public class ProtobufDeserializer<T extends MessageLite> implements Deserializer<T> {
      
        private Parser<T> parser;
      
        @Override
        public void configure(Map<String, ?> configs, boolean isKey) {
            parser = (Parser<T>) configs.get("parser");
        }
      
        @Override
        public T deserialize(String topic, byte[] data) {
            try {
                return parser.parseFrom(data);
            } catch (InvalidProtocolBufferException e) {
                throw new SerializationException("Failed to deserialize from a protobuf byte array.", e);
            }
        }
    }

    最后是ProtobufSerdes.java:  

    package huxihx.kafkastreams.serdes;
      
    import com.google.protobuf.MessageLite;
    import com.google.protobuf.Parser;
    import org.apache.kafka.common.serialization.Deserializer;
    import org.apache.kafka.common.serialization.Serde;
    import org.apache.kafka.common.serialization.Serializer;
      
    import java.util.HashMap;
    import java.util.Map;
      
    public class ProtobufSerdes<T extends MessageLite> implements Serde<T> {
      
        private final Serializer<T> serializer;
        private final Deserializer<T> deserializer;
      
        public ProtobufSerdes(Parser<T> parser) {
            serializer = new ProtobufSerializer<>();
            deserializer = new ProtobufDeserializer<>();
            Map<String, Parser<T>> config = new HashMap<>();
            config.put("parser", parser);
            deserializer.configure(config, false);
        }
      
        @Override
        public Serializer<T> serializer() {
            return serializer;
        }
      
        @Override
        public Deserializer<T> deserializer() {
            return deserializer;
        }
    }

    6.  开发主流程

    首先在src/main/java/huxihx/kafkastreams下创建DeduplicationTransformer.java。该Java类用于实现去重逻辑:

    package huxihx.kafkastreams;
    
    import org.apache.kafka.streams.KeyValue;
    import org.apache.kafka.streams.kstream.KeyValueMapper;
    import org.apache.kafka.streams.kstream.Transformer;
    import org.apache.kafka.streams.processor.ProcessorContext;
    import org.apache.kafka.streams.state.WindowStore;
    import org.apache.kafka.streams.state.WindowStoreIterator;
    
    /**
     * 根据ip地址执行去重逻辑
     * @param <K>
     * @param <V>
     * @param <E>
     */
    public class DeduplicationTransformer<K, V, E> implements Transformer<K, V, KeyValue<K, V>> {
    
        private static final String storeName = "eventId-store";
    
        private ProcessorContext context;
        private WindowStore<E, Long> eventIdStore;
    
        private final long leftDurationMs;
        private final long rightDurationMs;
    
        private final KeyValueMapper<K, V, E> idExtractor;
    
        DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) {
            if (maintainDurationPerEventInMs < 1) {
                throw new IllegalArgumentException("maintain duration per event must be >= 1");
            }
    
            leftDurationMs = maintainDurationPerEventInMs / 2;
            rightDurationMs = maintainDurationPerEventInMs - leftDurationMs;
            this.idExtractor = idExtractor;
        }
    
        @Override
        public void init(ProcessorContext context) {
            this.context = context;
            eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName);
        }
    
        @Override
        public KeyValue<K, V> transform(K key, V value) {
            final E eventId = idExtractor.apply(key, value);
            if (eventId == null) {
                return KeyValue.pair(key, value);
            } else {
                final KeyValue<K, V> output;
                if (isDuplicate(eventId)) {
                    output = null;
                    updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp());
                } else {
                    output = KeyValue.pair(key, value);
                    rememberNewEvent(eventId, context.timestamp());
                }
                return output;
            }
        }
    
        private boolean isDuplicate(final E eventId) {
            final long eventTime = context.timestamp();
            final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
                    eventId, eventTime - leftDurationMs, eventTime + rightDurationMs);
            final boolean isDuplicate = timeIterator.hasNext();
            timeIterator.close();
            return isDuplicate;
        }
    
        private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) {
            eventIdStore.put(eventId, newTimestamp, newTimestamp);
        }
    
        private void rememberNewEvent(final E eventId, final long timestamp) {
            eventIdStore.put(eventId, timestamp, timestamp);
        }
    
        @Override
        public void close() {
    
        }
    }

    然后,在src/main/java/huxihx/kafkastreams下创建FindDistinctEvents.java文件:

    package huxihx.kafkastreams;
    
    import huxihx.kafkastreams.proto.ClickOuterClass;
    import huxihx.kafkastreams.serdes.ProtobufSerdes;
    import org.apache.kafka.clients.admin.AdminClient;
    import org.apache.kafka.clients.admin.AdminClientConfig;
    import org.apache.kafka.clients.admin.NewTopic;
    import org.apache.kafka.clients.admin.TopicListing;
    import org.apache.kafka.common.serialization.Serdes;
    import org.apache.kafka.streams.KafkaStreams;
    import org.apache.kafka.streams.StreamsBuilder;
    import org.apache.kafka.streams.StreamsConfig;
    import org.apache.kafka.streams.Topology;
    import org.apache.kafka.streams.kstream.Consumed;
    import org.apache.kafka.streams.kstream.Produced;
    import org.apache.kafka.streams.state.StoreBuilder;
    import org.apache.kafka.streams.state.Stores;
    import org.apache.kafka.streams.state.WindowStore;
    
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.time.Duration;
    import java.util.ArrayList;
    import java.util.Collection;
    import java.util.HashMap;
    import java.util.List;
    import java.util.Map;
    import java.util.Properties;
    import java.util.concurrent.CountDownLatch;
    import java.util.stream.Collectors;
    
    public class FindDistinctEvents {
    
        private static final String storeName = "eventId-store";
    
        public static void main(String[] args) throws Exception {
            if (args.length < 1) {
                throw new IllegalArgumentException("Config file path must be specified.");
            }
    
            FindDistinctEvents app = new FindDistinctEvents();
            Properties envProps = app.loadEnvProperties(args[0]);
            Properties streamProps = app.createStreamsProperties(envProps);
            Topology topology = app.buildTopology(envProps);
    
            app.preCreateTopics(envProps);
    
            final KafkaStreams streams = new KafkaStreams(topology, streamProps);
            final CountDownLatch latch = new CountDownLatch(1);
    
            Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
                @Override
                public void run() {
                    streams.close();
                    latch.countDown();
                }
            });
    
            try {
                streams.start();
                latch.await();
            } catch (Exception e) {
                System.exit(1);
            }
            System.exit(0);
        }
    
        private Topology buildTopology(Properties envProps) {
            final StreamsBuilder builder = new StreamsBuilder();
            final ProtobufSerdes<ClickOuterClass.Click> clickSerdes = clickProtobufSerdes();
    
            final String inputTopic = envProps.getProperty("input.topic.name");
            final String outputTopic = envProps.getProperty("output.topic.name");
            final Duration windowSize = Duration.ofMinutes(2);
    
            final StoreBuilder<WindowStore<String, Long>> dedupStoreBuilder = Stores.windowStoreBuilder(
                    Stores.persistentWindowStore(storeName,
                            windowSize,
                            windowSize,
                            false
                    ),
                    Serdes.String(),
                    Serdes.Long());
    
            builder.addStateStore(dedupStoreBuilder);
            builder.stream(inputTopic, Consumed.with(Serdes.String(), clickSerdes))
                    .transform(() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value.getIp()), storeName)
                    .to(outputTopic, Produced.with(Serdes.String(), clickSerdes));
            return builder.build();
        }
    
        private Properties createStreamsProperties(Properties envProps) {
            Properties props = new Properties();
            props.put(StreamsConfig.APPLICATION_ID_CONFIG, envProps.getProperty("application.id"));
            props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
            props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
            props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
            return props;
        }
    
        private void preCreateTopics(Properties envProps) throws Exception {
            Map<String, Object> config = new HashMap<>();
            config.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
            String inputTopic = envProps.getProperty("input.topic.name");
            String outputTopic = envProps.getProperty("output.topic.name");
            try (AdminClient client = AdminClient.create(config)) {
                Collection<TopicListing> existingTopics = client.listTopics().listings().get();
    
                List<NewTopic> topics = new ArrayList<>();
                List<String> topicNames = existingTopics.stream().map(TopicListing::name).collect(Collectors.toList());
                if (!topicNames.contains(inputTopic))
                    topics.add(new NewTopic(
                            inputTopic,
                            Integer.parseInt(envProps.getProperty("input.topic.partitions")),
                            Short.parseShort(envProps.getProperty("input.topic.replication.factor"))));
    
                if (!topicNames.contains(outputTopic))
                    topics.add(new NewTopic(
                            outputTopic,
                            Integer.parseInt(envProps.getProperty("output.topic.partitions")),
                            Short.parseShort(envProps.getProperty("output.topic.replication.factor"))));
    
                if (!topics.isEmpty())
                    client.createTopics(topics).all().get();
            }
        }
    
        private Properties loadEnvProperties(String fileName) throws IOException {
            Properties envProps = new Properties();
            try (FileInputStream input = new FileInputStream(fileName)) {
                envProps.load(input);
            }
            return envProps;
        }
    
        private static ProtobufSerdes<ClickOuterClass.Click> clickProtobufSerdes() {
            return new ProtobufSerdes<>(ClickOuterClass.Click.parser());
        }
    }

    主要的逻辑在buildTopology方法中,我们使用自定义的DeduplicationTransformer来实现2分钟的窗口化去重逻辑。

    7. 编写测试Producer和Consumer

    和之前的入门系列一样,我们编写TestProducer和TestConsumer类。在src/main/java/huxihx/kafkastreams/tests/TestProducer.java和TestConsumer.java,内容分别如下:

     TestProducer.java:

    package huxihx.kafkastreams.tests;
    
    import huxihx.kafkastreams.proto.ClickOuterClass;
    import huxihx.kafkastreams.serdes.ProtobufSerializer;
    import org.apache.kafka.clients.producer.KafkaProducer;
    import org.apache.kafka.clients.producer.Producer;
    import org.apache.kafka.clients.producer.ProducerConfig;
    import org.apache.kafka.clients.producer.ProducerRecord;
    
    import java.util.Arrays;
    import java.util.List;
    import java.util.Properties;
    
    public class TestProducer {
        private static final List<ClickOuterClass.Click> TEST_CLICK_EVENTS = Arrays.asList(
                ClickOuterClass.Click.newBuilder().setIp("10.0.0.1")
                        .setUrl("https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html")
                        .setTimestamp("2019-09-16T14:53:43+00:00").build(),
                ClickOuterClass.Click.newBuilder().setIp("10.0.0.2")
                        .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen")
                        .setTimestamp("2019-09-16T14:53:43+00:01").build(),
                ClickOuterClass.Click.newBuilder().setIp("10.0.0.3")
                        .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen")
                        .setTimestamp("2019-09-16T14:53:43+00:03").build(),
                ClickOuterClass.Click.newBuilder().setIp("10.0.0.1")
                        .setUrl("https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html")
                        .setTimestamp("2019-09-16T14:53:43+00:00").build(),
                ClickOuterClass.Click.newBuilder().setIp("10.0.0.2")
                        .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen")
                        .setTimestamp("2019-09-16T14:53:43+00:01").build(),
                ClickOuterClass.Click.newBuilder().setIp("10.0.0.3")
                        .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen")
                        .setTimestamp("2019-09-16T14:53:43+00:03").build()
        );
    
        public static void main(String[] args) {
            Properties props = new Properties();
            props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
            props.put(ProducerConfig.ACKS_CONFIG, "all");
            props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
            props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, new ProtobufSerializer<ClickOuterClass.Click>().getClass());
    
            try (final Producer<String, ClickOuterClass.Click> producer = new KafkaProducer<>(props)) {
                TEST_CLICK_EVENTS.stream().map(click -> new ProducerRecord<String, ClickOuterClass.Click>("clicks", click)).forEach(producer::send);
            }
        }
    }
    

    TestConsumer.java:

    package huxihx.kafkastreams.tests;
    
    import com.google.protobuf.Parser;
    import huxihx.kafkastreams.proto.ClickOuterClass;
    import huxihx.kafkastreams.serdes.ProtobufDeserializer;
    import org.apache.kafka.clients.consumer.Consumer;
    import org.apache.kafka.clients.consumer.ConsumerConfig;
    import org.apache.kafka.clients.consumer.ConsumerRecord;
    import org.apache.kafka.clients.consumer.ConsumerRecords;
    import org.apache.kafka.clients.consumer.KafkaConsumer;
    import org.apache.kafka.common.serialization.Deserializer;
    import org.apache.kafka.common.serialization.StringDeserializer;
    
    import java.time.Duration;
    import java.util.Arrays;
    import java.util.HashMap;
    import java.util.Map;
    import java.util.Properties;
    
    public class TestConsumer {
    
        public static void main(String[] args) {
            Deserializer<ClickOuterClass.Click> deserializer = new ProtobufDeserializer<>();
            Map<String, Parser<ClickOuterClass.Click>> config = new HashMap<>();
            config.put("parser", ClickOuterClass.Click.parser());
            deserializer.configure(config, false);
    
            Properties props = new Properties();
            props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
            props.put(ConsumerConfig.GROUP_ID_CONFIG, "test-group01");
            props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000");
            props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
    
            try (final Consumer<String, ClickOuterClass.Click> consumer = new KafkaConsumer<>(props, new StringDeserializer(), deserializer)) {
                consumer.subscribe(Arrays.asList("distinct-clicks"));
                while (true) {
                    ConsumerRecords<String, ClickOuterClass.Click> records = consumer.poll(Duration.ofMillis(1000));
                    for (ConsumerRecord<String, ClickOuterClass.Click> record : records) {
                        System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
                    }
                }
            }
        }
    }
    

    8. 测试

    首先我们运行下列命令构建项目:

    $ ./gradlew shadowJar  

    然后启动Kafka集群,之后运行Kafka Streams应用:

    $ java -jar build/libs/kstreams-transform-standalone-0.0.1.jar configuration/dev.properties

    现在启动两个终端分别测试Producer和Consumer:

    $ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestProducer

     $ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestConsumer

     如果一切正常的话,那么TestConsumer应该会输出3条消息:

    offset = 0, key = null, value = ip: "10.0.0.1"

    url: "https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html"

    timestamp: "2019-09-16T14:53:43+00:00"

     

    offset = 1, key = null, value = ip: "10.0.0.2"

    url: "https://www.confluent.io/hub/confluentinc/kafka-connect-datagen"

    timestamp: "2019-09-16T14:53:43+00:01"

     

    offset = 2, key = null, value = ip: "10.0.0.3"

    url: "https://www.confluent.io/hub/confluentinc/kafka-connect-datagen"

    timestamp: "2019-09-16T14:53:43+00:03"

  • 相关阅读:
    python ast模块使用
    编译原理:编译过程概述
    TVM 安卓环境搭建部署
    TVM: 编译流程
    TVM:Relay算子实现流程
    23第四章:【08】消息消费重试机制
    21第四章:【06】消息过滤
    20第四章:【05】批量消息
    19第四章:【04】事务消息
    17第四章:【02】顺序消息
  • 原文地址:https://www.cnblogs.com/huxi2b/p/12154771.html
Copyright © 2020-2023  润新知