• 存储Kafka消息[Python版]


    前段时间,由于业务需要对Kafka消息从两方面进行存储,一方面离线分批存储,另一方面在线分布式存储。在存储Kafka消息前,需要了解Kafka工作原理

    1 离线分批存储

    将消费到的Kafka数据打包分批存储到本地

    from pykafka import KafkaClient
    import json
    import time
    
    client = KafkaClient(hosts="xx.xx.xx.xx:9092,xx.xx.xx.xx:9092,xx.xx.xx.xx:9092")
    topic = client.topics['test']
    consumer = topic.get_simple_consumer(
        consumer_group="logGroup",
        reset_offset_on_start=True
    )
    
    review_list = []
    
    while True:
        for message in consumer:
            if message is not None:
                #print(message.offset, message.value)
                #print(message.value)
                #print(type(message.value)) #<class 'bytes'>
                message.value = str(message.value).strip('b').strip("'")
                print(type(message.value)) #<class 'str'>
                message.value = message.value.replace("\n", "") #去除
    
    
                message.value = eval(message.value)
                print(type(message.value)) #<class 'dict'>
    
                print(message.value)            
                review_list.append(message.value)
                #print(review_list)
                if len(review_list) == 20000:
                    now = time.strftime("%Y_%m_%d-%H_%M_%S",time.localtime(time.time()))
                    print(now)
                    fname="/lujing"+now+r".json"
                    with open(fname, 'w') as f:
                        
                        json.dump(review_list, f)
                    
                    review_list = []
                    print(len(review_list)==20000)
                    now = time.strftime("%Y_%m_%d-%H_%M_%S",time.localtime(time.time()))
                    print(now)
    
                    continue
    

    2 在线分布式存储

    直接将消费到的Kafka消息写入HBase

    # -*- coding: utf-8 -*-
    """
    Created on Tue Nov 5 14:12:09 2019
    
    @author: yoyoyo
    """
    
    
    from kafka import KafkaConsumer
    import happybase
    import json
    import time
    
    hbase_ip='localhost'
    hbase_port=9090
    ip = hbase_ip
    port = hbase_port
    pool = happybase.ConnectionPool(size=1, host=ip) # create connection
    
    # table insert data
    def hbase_load(tableName, lists):
        with pool.connection() as connection:
            connection.open()    #start transport
        if tableName not in str(connection.tables()):
            create_table(connection, tableName)
        table = connection.table(tableName)
        b = table.batch(batch_size=1024) # batchs inserts data
        for li in lists:
            try:
                rowkey = li['xxx']
                data_dicts = {}
                for d, x in li.items():
                    key = "timeStamp:" + d
                    value = str(x)
                    data_dicts[key] = value
                    b.put(row=rowkey, data=data_dicts)
                    b.send()
                    print("rowkey:" + rowkey + " data append success")
            except Exception as ex:
                print(str(ex) + " data append failure.")
        connection.close()
    
    # create table
    def create_table(conn, table):
        try:
            conn.create_table(
                table,
                {
                    "timeStamp": dict(max_versions=10)
                }
            )
        except Exception as ex:
            print(str(ex) + " table exists.")
    # print log
    def log(str):
        t = time.strftime(r"%Y-%m-%d %H:%M:%S", time.localtime())
        print("[%s]%s" % (t, str))
    lst = []
    log('start consumer')
    consumer = KafkaConsumer('test', group_id='logGroup', bootstrap_servers=['xx.xx.xx.xx:9092']) # HBase集群master ip
    for msg in consumer:
        recv = "%s:%d:%d: key=%s value=%s" % (msg.topic, msg.partition, msg.offset, msg.key, msg.value)
        log(recv)
        dict_data = json.loads(msg.value, strict = False)
        dict_data['xxx'] = str(dict_data['xxx'])+'-'+dict_data['xxx']
        lst.append(dict_data)
        hbase_load('kafka2Hbase_test', lst)
    

    服务器调试在线分布式存储时报错处理

    错误1 连接Hbase出现的错误

    thriftpy2.transport.TTransportException: TTransportException(type=1, message=“Could not connect to (‘localhost’, 9090)”
    

    原因分析
    Hbase是用Java写的,原生地提供了Java接口,如果用其他语言,需要开启连接原生提供的thrift接口服务器。也就是python使用HappyBase操作HBase需要先开启thrift接口服务器。

    解决办法并查看进程

    (base) root@master:/opt/hbase/hbase-1.2.10/bin# hbase-daemon.sh start thrift2
    
    (base) root@master:~# jps
    3266 QuorumPeerMain
    4772 HMaster
    996 Kafka
    4102 ResourceManager
    4231 NodeManager
    11850 ThriftServer
    12048 Jps
    3890 SecondaryNameNode
    3703 DataNode
    3578 NameNode
    

    错误2 loads时报错
    JSONDecodeError: invalid control character
    解决办法:json.loads(json, strict=False) #关掉strict即可

  • 相关阅读:
    nutch2.x在eclipse+windows环境下运行遇到的一些问题的解决方案
    QT 设置应用程序图标和可执行程序图标(另有setWindowTitle和setWindowIcon)
    QT延时方法整理(QTimer::singleShot,QWaitCondition,QDateTime.secsTo三种新方法)
    自定义QT窗口部件外观之QStyle
    qt中文编码(好多方法)
    QT字符编码转换,可用于中文内码传输
    Visual Studio GitHub For Windows部署
    Linux部署ASP.NET 5 (vNext)
    微软的实现3
    JAVA8,SPRING,ANGULARJS对项目
  • 原文地址:https://www.cnblogs.com/eugene0/p/12548121.html
Copyright © 2020-2023  润新知