• storm中KafkaSpout的选择



    Storm最常用的消息源就是Kafka,在对接的时候大多需要使用KafkaSpout;

    在网上大概有两种KafkaSpout,一种是只有几十行,一种却有一大啪啦类文件。


    在kafka中,同一个partition中的消息只能被同一个组的一个consumer消费,不能并发,所以kafka的并发说的是多partition的并发;

    kafka的consumer API分为high level consumer和low level consumer,官方建议使用前者,以为不用关心partition、offset那些,但是后者也有其存在的意义:1.多次读取的时候;2.选择性读取部分消息;3.控制消费过程。


    写法比较简单的KafkaSpout:

     1 public class KafkaSpouttest implements IRichSpout {
     2 
     3     private static final long serialVersionUID = 1L;
     4     private SpoutOutputCollector collector;
     5     private ConsumerConnector consumer;
     6     private String topic;
     7 
     8     public KafkaSpouttest() {}
     9 
    10     public KafkaSpouttest(String topic) {
    11         this.topic = topic;
    12     }
    13 
    14     public void ack(Object arg0) {
    15 
    16 }
    17 
    18     private static ConsumerConfig createConsumerConfig() {
    19         Properties props = new Properties();
    20         // 设置zookeeper的链接地址
    21         props.put("zookeeper.connect", "localhost:2181");
    22         // 设置group id
    23         props.put("group.id", "1");
    24         // kafka的group 消费记录是保存在zookeeper上的, 但这个信息在zookeeper上不是实时更新的, 需要有个间隔时间更新
    25         props.put("auto.commit.interval.ms", "1000");
    26         props.put("zookeeper.session.timeout.ms", "10000");
    27         return new ConsumerConfig(props);
    28     }
    29 
    30     public void activate() {
    31         consumer = kafka.consumer.Consumer.createJavaConsumerConnector(createConsumerConfig());
    32         Map < String,
    33         Integer > topickMap = new HashMap < String,
    34         Integer > ();
    35         topickMap.put(topic, 1);
    36 
    37         System.out.println("*********Results********topic:" + topic);
    38 
    39         Map < String,
    40         List < KafkaStream < byte[],
    41         byte[] >>> streamMap = consumer.createMessageStreams(topickMap);
    42         KafkaStream < byte[],
    43         byte[] > stream = streamMap.get(topic).get(0);
    44         ConsumerIterator < byte[],
    45         byte[] > it = stream.iterator();
    46         while (it.hasNext()) {
    47             String value = new String(it.next().message());
    48             SimpleDateFormat formatter = new SimpleDateFormat("yyyy年MM月dd日 HH:mm:ss SSS");
    49             Date curDate = new Date(System.currentTimeMillis()); //获取当前时间      
    50             String str = formatter.format(curDate);
    51 
    52             System.out.println("storm接收到来自kafka的消息------->" + value);
    53 
    54             collector.emit(new Values(value, 1, str), value);
    55         }
    56     }
    57 
    58     public void close() {
    59         // TODO Auto-generated method stub
    60     }
    61 
    62     public void deactivate() {
    63         // TODO Auto-generated method stub
    64     }
    65 
    66     public void fail(Object arg0) {
    67         // TODO Auto-generated method stub
    68     }
    69 
    70     public void nextTuple() {
    71         // TODO Auto-generated method stub
    72     }
    73 
    74     public void open(Map arg0, TopologyContext arg1, SpoutOutputCollector collector) {
    75         this.collector = collector;
    76     }
    77 
    78     public void declareOutputFields(OutputFieldsDeclarer declarer) {
    79         declarer.declare(new Fields("word", "id", "time"));
    80     }
    81 
    82     public Map < String,
    83     Object > getComponentConfiguration() {
    84         System.out.println("getComponentConfiguration被调用");
    85         topic = "admln";
    86         return null;
    87     }
    88 
    89 }

    方法相关的不解释,和本主题相关的一句话是:

    byte[] >>> streamMap = consumer.createMessageStreams(topickMap);

    想说的是它用的是High Level API


    复杂的代码就多了,在github上有好几个,最官方的还是apache storm自带的:

    里面和本主题相关的一句话是DynamicPartitionConnections.java中的60行:

    _connections.put(host, new ConnectionInfo(new SimpleConsumer(host.host, host.port, _config.socketTimeoutMs, _config.bufferSizeBytes, _config.clientId)));

    它用的是low level API


    apache KafkaSpout 在 topology 中的配置

    String zkConnString = "node1:2181,node2:2181,node3:2181";
            String topicName = "testtopic";
            BrokerHosts hosts = new ZkHosts(zkConnString);
            SpoutConfig spoutConfig = new SpoutConfig(hosts, topicName, "/" + topicName, UUID.randomUUID().toString());
            spoutConfig.forceFromStart = false;
            spoutConfig.zkPort = 2181;
            spoutConfig.zkServers = Arrays.asList(new String[]{"node1","node2","node3"});
            
            spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
            
            KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
            
            TopologyBuilder builder = new TopologyBuilder();
            // 构造NC数据流向图
            builder.setSpout("mrspout", kafkaSpout, 30);
            builder.setBolt("mrverifybolt", new MRVerifyBolt(), 30)
                    .shuffleGrouping("mrspout");
            builder.setBolt("mr2storagebolt", new MR2StorageBolt(), 30)
                    .shuffleGrouping("mrverifybolt");
            // 以类名作为STORM任务名
            String name = MRTopology.class.getSimpleName();
            // 传主机名则为集群运行模式,不传则为本地运行模式
            if (args != null && args.length > 0) {
                Config conf = new Config();
                // 通过指定nimbus主机
                conf.put(Config.NIMBUS_HOST, args[0]);
                conf.setNumWorkers(6);
                conf.setNumAckers(0);
                conf.setMaxSpoutPending(100000);
                StormSubmitter.submitTopologyWithProgressBar(name, conf,
                        builder.createTopology());
            } else {
                Map conf = new HashMap();
                conf.put(Config.TOPOLOGY_WORKERS, 1);
                conf.put(Config.TOPOLOGY_DEBUG, true);
                LocalCluster cluster = new LocalCluster();
                cluster.submitTopology(name, conf, builder.createTopology());
            }
        }

    关于 spoutConfig.servers 和 spoutConfig.port 在实际应用中其实不设置也可以,因为在集群中如果不设置 storm 默认会把 storm 配置中的 zookeeper 地址和端口,设置的用处是在 eclipse 中测试运行的时候因为是模拟 storm cluster, 所以主动设置。


    两者各有优劣,相同点性能,简单测试过,low level的要好点,但是相差不大(都在合适的配置下,小集群);

    不同点是high level 的代码简单,而low level的代码很多,配置也多,用着麻烦(也不是很麻烦);

    low level的优点是支持重读,就是配置中的 spoutConfig.forceFromStart = false; ,支持重读的另一个好处是和storm的acker结合,可以重发,防止丢数据,这一点比low level的要安全一点,另一个好处是配置多,使用就很难灵活,比如设置KafkaSpout的fetchSizeBytes,和kafka的bufferSizeBytes对应,是优化的一个手段。

    至于选择哪种,支持后者,反正storm中已经自带了,不需要自己写,配置就好,而且0.9.4中优化了很多KafkaSpout的问题。


  • 相关阅读:
    20210312
    20210311
    20210310
    例5-1
    例5-2
    例4-12-2
    例4-12
    例4-11
    例4-10
    例4-9
  • 原文地址:https://www.cnblogs.com/admln/p/storm-KafkaSpout-choose.html
Copyright © 2020-2023  润新知