• Kafka Streams开发入门(1)


     背景

    最近发现Confluent公司在官网上发布了Kafka Streams教程,共有10节课,每节课给出了Kafka Streams的一个功能介绍。这个系列教程对于我们了解Kafka Streams还是很有帮助的。为什么要了解Kafka Streams?其实我一直觉得国内对于Flink有点过于迷恋了。大厂使用Flink尚自合理,毕竟数据量大且需要整套的集群管理调度监控功能。但一般的中小公司业务简单体量小,何必要费时费力地搭建一整套Flink集群。有很多简单的流处理业务场景使用Kafka Streams绰绰有余了,更何况Kafka Streams在设计上特别是在消息精确处理语义上一点都不必Flink差,只是它的定位不同罢了。如果你对Kafka Streams感兴趣,不妨关注一下这个系列。不过有点令人不爽的是,该教程使用Confluent Kafka作为开发环境进行演示,而且大量用到了Avro进行消息的序列化/反序列化。Confluent Kafka默认提供Schema Regstry、avro-console-producer和avro-console-consumer工具可以很方便地把消息转换成Avro格式并进行测试,但是如果你使用原生的Kafka(也就是社区版的Apache Kafka),这些功能要自行开发,非常地不方便。

    鉴于此,我打算对照着这系列教程中的例子,使用Apache Kafka重新实现一遍。虽然有些科普,不过我觉得这是个很好的学习过程。需要注意的是,我不是使用Avro,而是使用Google的Protocol Buffer(下称protobuf)来进行演示。

     演示功能说明

    第一篇要演示的功能很简单,就是流处理的map功能:map函数或算子将一个消息流中的每条事件进行转换,变更成另一个格式或另一条新的事件。今天输入消息是表示电影的消息,格式如下:

    {"id": 294, "title": "Die Hard::1988", "genre": "action"}

    我们使用Kafka Streams实时地将每条消息中的title字段分开,将里面的发行年份字段提取出来,变成下面这样:

    {"id":294,"title":"Die Hard","release_year":1988,"genre":"action"}

     初始化项目

    第一步是创建项目文件夹。在执行这步前,你需要安装并配置好Java环境和Gradle环境。Gradle是用于构建Java工程用的,下载地址是:https://gradle.org/。然后执行下列命令去创建项目:

    mkdir movie-streams/
    cd movie-streams/

     配置项目

    在movie-streams目录下,创建build.gradle文件——该文件是Gradle的项目配置文件,类似于Maven的pom.xml。该文件内容如下:

     buildscript {

        repositories {
    jcenter()
    }
    dependencies {
    classpath 'com.github.jengelman.gradle.plugins:shadow:4.0.2'
    }
    }

    plugins {
    id 'java'
    id "com.google.protobuf" version "0.8.10"
    }
    apply plugin: 'com.github.johnrengelman.shadow'


    repositories {
    mavenCentral()
    jcenter()

    maven {
    url 'http://packages.confluent.io/maven'
    }
    }

    group 'huxihx.kafkastreams'

    sourceCompatibility = 1.8
    targetCompatibility = '1.8'
    version = '0.0.1'

    dependencies {
    implementation 'com.google.protobuf:protobuf-java:3.0.0'
    implementation 'org.slf4j:slf4j-simple:1.7.26'
    implementation 'org.apache.kafka:kafka-streams:2.3.0'
    implementation 'com.google.protobuf:protobuf-java:3.9.1'

    testCompile group: 'junit', name: 'junit', version: '4.12'
    }

    protobuf {
    generatedFilesBaseDir = "$projectDir/src/"
    protoc {
    artifact = 'com.google.protobuf:protoc:3.0.0'
    }
    }

    jar {
    manifest {
    attributes(
    'Class-Path': configurations.compile.collect { it.getName() }.join(' '),
    'Main-Class': 'huxihx.kafkastreams.MovieStreamApp'
    )
    }
    }

    shadowJar {
    archiveName = "kstreams-transform-standalone-${version}.${extension}"
    }

     其中比较关键的是:首先我们要引入Gradle的shadow插件,用于fat jar打包;其次,我们指定了Gradle的protobuf插件用于帮助我们将*.proto文件自动编译成Java类。保存上面的文件,然后执行下列命令下载Gradle的wrapper套件:

    gradle wrapper

     之后在movie-streams目录下创建一个名为configuration的文件夹用于保存我们的参数配置文件:

    mkdir configuration

    同时创建配置文件dev.properties,内容如下:

    application.id=movie-transformer

    bootstrap.servers=localhost:9092

    input.topic.name=raw-movies
    input.topic.partitions=1
    input.topic.replication.factor=1

    output.topic.name=movies
    output.topic.partitions=1
    output.topic.replication.factor=1

     该文件设置了我们要连接的Kafka集群信息以及输入topic和输出topic详情。

     创建消息Schema

    下一步是创建输入消息和输出消息的schema。在movie-streams下执行命令创建保存schema的文件夹:

    mkdir -p src/main/proto 

     然后分别创建raw-movie.proto和parsed-movie.proto文件,内容分别是:

    syntax = "proto3";

    package huxihx.kafkastreams.proto;

    message RawMovie {
    uint64 id = 1;
    string title = 2;
    string genre = 3;
    }
    syntax = "proto3";

    package huxihx.kafkastreams.proto;

    message Movie {
    uint64 id = 1;
    string title = 2;
    uint32 release_year = 3;
    string genre = 4;
    }

    文件内容是标准的protobuf语法,定义了电影事件的id、title、release_year和genre信息。保存这两个文件,在movie-streams下之后运行gradlew命令将它们自动编译成Java类:

    ./gradlew build

    此时,你应该可以在src/main/java/huxihx/kafkastreams/proto下看到生成的两个Java类:ParsedMovie和RawMovieOuterClass。

     创建Serdes

    这一步我们为RawMovie和Movie消息创建各自的Serdes。所谓的Serdes是serializer和deserializer的合称。Kafka Streams程序在读取消息时需要用到Serdes中的deserializer将消息字节序列转换成对应的Java对象实例,而生产消息时则会用到Serdes的serializer将Java对象实例转换成字节序列。由于我们使用protobuf框架进行序列化和反序列化,因此我们需要创建支持protobuf的Serdes。

    在movie-streams下执行:

    mkdir -p src/main/java/huxihx/kafkastreams/serdes

    然后在新创建的serdes文件夹下创建ProtobufSerializer.java:

    package huxihx.kafkastreams.serdes;
    
    import com.google.protobuf.MessageLite;
    import org.apache.kafka.common.serialization.Serializer;
    
    public class ProtobufSerializer<T extends MessageLite> implements Serializer<T> {
        @Override
        public byte[] serialize(String topic, T data) {
            return data == null ? new byte[0] : data.toByteArray();
        }
    }

    然后创建ProtobufDeserializer.java:

    package huxihx.kafkastreams.serdes;
    
    import com.google.protobuf.InvalidProtocolBufferException;
    import com.google.protobuf.MessageLite;
    import com.google.protobuf.Parser;
    import org.apache.kafka.common.errors.SerializationException;
    import org.apache.kafka.common.serialization.Deserializer;
    
    import java.util.Map;
    
    public class ProtobufDeserializer<T extends MessageLite> implements Deserializer<T> {
    
        private Parser<T> parser;
    
        @Override
        public void configure(Map<String, ?> configs, boolean isKey) {
            parser = (Parser<T>) configs.get("parser");
        }
    
        @Override
        public T deserialize(String topic, byte[] data) {
            try {
                return parser.parseFrom(data);
            } catch (InvalidProtocolBufferException e) {
                throw new SerializationException("Failed to deserialize from a protobuf byte array.", e);
            }
        }
    }

    最后创建ProtobufSerdes.java:

    package huxihx.kafkastreams.serdes;
    
    import com.google.protobuf.MessageLite;
    import com.google.protobuf.Parser;
    import org.apache.kafka.common.serialization.Deserializer;
    import org.apache.kafka.common.serialization.Serde;
    import org.apache.kafka.common.serialization.Serializer;
    
    import java.util.HashMap;
    import java.util.Map;
    
    public class ProtobufSerdes<T extends MessageLite> implements Serde<T> {
    
        private final Serializer<T> serializer;
        private final Deserializer<T> deserializer;
    
        public ProtobufSerdes(Parser<T> parser) {
            serializer = new ProtobufSerializer<>();
            deserializer = new ProtobufDeserializer<>();
            Map<String, Parser<T>> config = new HashMap<>();
            config.put("parser", parser);
            deserializer.configure(config, false);
        }
    
        @Override
        public Serializer<T> serializer() {
            return serializer;
        }
    
        @Override
        public Deserializer<T> deserializer() {
            return deserializer;
        }
    }

     开发主流程

    在src/main/java/huxihx/kafkastreams下创建MovieStreamApp.java文件:

    package huxihx.kafkastreams;
    
    import huxihx.kafkastreams.proto.ParsedMovie;
    import huxihx.kafkastreams.proto.RawMovieOuterClass;
    import huxihx.kafkastreams.serdes.ProtobufSerdes;
    import org.apache.kafka.clients.admin.AdminClient;
    import org.apache.kafka.clients.admin.AdminClientConfig;
    import org.apache.kafka.clients.admin.NewTopic;
    import org.apache.kafka.clients.admin.TopicListing;
    import org.apache.kafka.common.serialization.Serdes;
    import org.apache.kafka.streams.KafkaStreams;
    import org.apache.kafka.streams.KeyValue;
    import org.apache.kafka.streams.StreamsBuilder;
    import org.apache.kafka.streams.StreamsConfig;
    import org.apache.kafka.streams.Topology;
    import org.apache.kafka.streams.kstream.Consumed;
    import org.apache.kafka.streams.kstream.Produced;
    
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.Collection;
    import java.util.HashMap;
    import java.util.List;
    import java.util.Map;
    import java.util.Properties;
    import java.util.concurrent.CountDownLatch;
    import java.util.stream.Collectors;
    
    public class MovieStreamApp {
    
        public static void main(String[] args) throws Exception {
            if (args.length < 1) {
                throw new IllegalArgumentException("Config file path must be specified.");
            }
    
            MovieStreamApp app = new MovieStreamApp();
            Properties envProps = app.loadEnvProperties(args[0]);
            Properties streamProps = app.createStreamsProperties(envProps);
            Topology topology = app.buildTopology(envProps);
    
            app.preCreateTopics(envProps);
    
            final KafkaStreams streams = new KafkaStreams(topology, streamProps);
            final CountDownLatch latch = new CountDownLatch(1);
    
            Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
                @Override
                public void run() {
                    streams.close();
                    latch.countDown();
                }
            });
    
            try {
                streams.start();
                latch.await();
            } catch (Exception e) {
                System.exit(1);
            }
            System.exit(0);
        }
    
    
        /**
         * 构建Streams拓扑对象实例
         *
         * @param envProps
         * @return
         */
        private Topology buildTopology(Properties envProps) {
            final StreamsBuilder builder = new StreamsBuilder();
            final String inputTopic = envProps.getProperty("input.topic.name");
            final String outputTopic = envProps.getProperty("output.topic.name");
    
            builder.stream(inputTopic, Consumed.with(Serdes.String(), rawMovieProtobufSerdes()))
                    .map((key, rawMovie) -> new KeyValue<>(rawMovie.getId(), parseRawMovie(rawMovie)))
                    .to(outputTopic, Produced.with(Serdes.Long(), movieProtobufSerdes()));
    
            return builder.build();
        }
    
        /**
         * 为Kafka Streams程序构建所需的Properties实例
         *
         * @param envProps
         * @return
         */
        private Properties createStreamsProperties(Properties envProps) {
            Properties props = new Properties();
            props.put(StreamsConfig.APPLICATION_ID_CONFIG, envProps.getProperty("application.id"));
            props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
            props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
            props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
            return props;
        }
    
        /**
         * 预创建输入/输出topic,如果topic已存在则忽略
         *
         * @param envProps
         * @throws Exception
         */
        private void preCreateTopics(Properties envProps) throws Exception {
            Map<String, Object> config = new HashMap<>();
            config.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
            String inputTopic = envProps.getProperty("input.topic.name");
            String outputTopic = envProps.getProperty("output.topic.name");
            try (AdminClient client = AdminClient.create(config)) {
                Collection<TopicListing> existingTopics = client.listTopics().listings().get();
    
                List<NewTopic> topics = new ArrayList<>();
                List<String> topicNames = existingTopics.stream().map(TopicListing::name).collect(Collectors.toList());
                if (!topicNames.contains(inputTopic))
                    topics.add(new NewTopic(
                            envProps.getProperty("input.topic.name"),
                            Integer.parseInt(envProps.getProperty("input.topic.partitions")),
                            Short.parseShort(envProps.getProperty("input.topic.replication.factor"))));
    
                if (!topicNames.contains(outputTopic))
                    topics.add(new NewTopic(
                            envProps.getProperty("output.topic.name"),
                            Integer.parseInt(envProps.getProperty("output.topic.partitions")),
                            Short.parseShort(envProps.getProperty("output.topic.replication.factor"))));
    
                if (!topics.isEmpty())
                    client.createTopics(topics).all().get();
            }
        }
    
        /**
         * 加载configuration下的配置文件
         *
         * @param fileName
         * @return
         * @throws IOException
         */
        private Properties loadEnvProperties(String fileName) throws IOException {
            Properties envProps = new Properties();
            try (FileInputStream input = new FileInputStream(fileName)) {
                envProps.load(input);
            }
            return envProps;
        }
    
        /**
         * 构建输出topic所需的Serdes
         *
         * @return
         */
        private static ProtobufSerdes<ParsedMovie.Movie> movieProtobufSerdes() {
            return new ProtobufSerdes<>(ParsedMovie.Movie.parser());
        }
    
        /**
         * 构建输入topic所需的Serdes
         *
         * @return
         */
        private static ProtobufSerdes<RawMovieOuterClass.RawMovie> rawMovieProtobufSerdes() {
            return new ProtobufSerdes<>(RawMovieOuterClass.RawMovie.parser());
        }
    
        /**
         * 执行map逻辑提取release_year字段
         *
         * @param rawMovie
         * @return
         */
        private static ParsedMovie.Movie parseRawMovie(RawMovieOuterClass.RawMovie rawMovie) {
            String[] titleParts = rawMovie.getTitle().split("::");
            String title = titleParts[0];
            int releaseYear = Integer.parseInt(titleParts[1]);
            return ParsedMovie.Movie.newBuilder()
                    .setId(rawMovie.getId())
                    .setTitle(title)
                    .setReleaseYear(releaseYear)
                    .setGenre(rawMovie.getGenre())
                    .build();
        }
    }

     编写测试Producer和Consumer

    在src/main/java/huxihx/kafkastreams/tests/TestProducer.java和TestConsumer.java,内容分别如下:

    package huxihx.kafkastreams.tests;
    
    import huxihx.kafkastreams.proto.RawMovieOuterClass;
    import huxihx.kafkastreams.serdes.ProtobufSerializer;
    import org.apache.kafka.clients.producer.KafkaProducer;
    import org.apache.kafka.clients.producer.Producer;
    import org.apache.kafka.clients.producer.ProducerRecord;
    
    import java.util.Arrays;
    import java.util.List;
    import java.util.Properties;
    
    public class TestProducer {
    
        // 测试输入事件
        private static final List<RawMovieOuterClass.RawMovie> TEST_RAW_MOVIES = Arrays.asList(
                RawMovieOuterClass.RawMovie.newBuilder()
                        .setId(294).setTitle("Die Hard::1988").setGenre("action").build(),
    
                RawMovieOuterClass.RawMovie.newBuilder()
                        .setId(354).setTitle("Tree of Life::2011").setGenre("drama").build(),
    
                RawMovieOuterClass.RawMovie.newBuilder()
                        .setId(782).setTitle("A Walk in the Clouds::1995").setGenre("romance").build(),
    
                RawMovieOuterClass.RawMovie.newBuilder()
                        .setId(128).setTitle("The Big Lebowski::1998").setGenre("comedy").build());
    
        public static void main(String[] args) {
            Properties props = new Properties();
            props.put("bootstrap.servers", "localhost:9092");
            props.put("acks", "all");
            props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
            props.put("value.serializer", new ProtobufSerializer<RawMovieOuterClass.RawMovie>().getClass());
    
            try (final Producer<String, RawMovieOuterClass.RawMovie> producer = new KafkaProducer<>(props)) {
                TEST_RAW_MOVIES.stream()
                        .map(rawMovie -> new ProducerRecord<String, RawMovieOuterClass.RawMovie>("raw-movies", rawMovie))
                        .forEach(producer::send);
            }
        }
    }
    
    package huxihx.kafkastreams.tests;
    
    import com.google.protobuf.Parser;
    import huxihx.kafkastreams.proto.ParsedMovie;
    import huxihx.kafkastreams.serdes.ProtobufDeserializer;
    import org.apache.kafka.clients.consumer.ConsumerRecord;
    import org.apache.kafka.clients.consumer.ConsumerRecords;
    import org.apache.kafka.clients.consumer.KafkaConsumer;
    import org.apache.kafka.common.serialization.Deserializer;
    import org.apache.kafka.common.serialization.LongDeserializer;
    
    import java.time.Duration;
    import java.util.Arrays;
    import java.util.HashMap;
    import java.util.Map;
    import java.util.Properties;
    
    public class TestConsumer {
    
        public static void main(String[] args) {
            // 为输出事件构造protobuf deserializer
            Deserializer<ParsedMovie.Movie> deserializer = new ProtobufDeserializer<>();
            Map<String, Parser<ParsedMovie.Movie>> config = new HashMap<>();
            config.put("parser", ParsedMovie.Movie.parser());
            deserializer.configure(config, false);
    
            Properties props = new Properties();
            props.put("bootstrap.servers", "localhost:9092");
            props.put("group.id", "test-group");
            props.put("enable.auto.commit", "true");
            props.put("auto.commit.interval.ms", "1000");
            props.put("auto.offset.reset", "earliest");
            KafkaConsumer<Long, ParsedMovie.Movie> consumer = new KafkaConsumer<>(props, new LongDeserializer(), deserializer);
            consumer.subscribe(Arrays.asList("movies"));
            while (true) {
                ConsumerRecords<Long, ParsedMovie.Movie> records = consumer.poll(Duration.ofSeconds(1));
                for (ConsumerRecord<Long, ParsedMovie.Movie> record : records)
                    System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
            }
        }
    }

     测试

     首先我们运行下列命令构建项目:

    ./gradlew shadowJar

    然后启动Kafka集群,之后运行Kafka Streams应用:

    java -jar build/libs/kstreams-transform-standalone-0.0.1.jar configuration/dev.properties

    然后启动TestProducer发送测试事件:

    java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestProducer

    最后启动TestConsumer验证Kafka Streams提取出了每条输入事件的release_year字段:

    java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestConsumer

    .......

    offset = 0, key = 294, value = id: 294
    title: "Die Hard"
    release_year: 1988
    genre: "action"

    offset = 1, key = 354, value = id: 354
    title: "Tree of Life"
    release_year: 2011
    genre: "drama"

    offset = 2, key = 782, value = id: 782
    title: "A Walk in the Clouds"
    release_year: 1995
    genre: "romance"

    offset = 3, key = 128, value = id: 128
    title: "The Big Lebowski"
    release_year: 1998
    genre: "comedy"

     总结

    okay,第一篇的演示至此结束。总体上来看,Kafka Streams的转换操作算子map还是非常好用的。它能够实时地为每条入站消息执行你指定的逻辑。下一篇中我将演示另一个经典的transformation操作:filter。

  • 相关阅读:
    Spark&Hadoop:scala编写spark任务jar包,运行无法识别main函数,怎么办?
    Linux:krb5
    SqlServer数据库端口默认是1433吗?
    Linux下使用shell实现上传linux下某个目录下所有文件到ftp
    Spark+Hadoop+Hive集群上数据操作记录
    hadoop之 Hadoop2.2.0中HDFS的高可用性实现原理
    虚拟路由冗余协议VRRP
    hadoop 之Hadoop生态系统
    Oracle NoLogging Append 方式减少批量insert的redo_size
    Oracle常用的性能诊断语句
  • 原文地址:https://www.cnblogs.com/huxi2b/p/11525660.html
Copyright © 2020-2023  润新知