• 首单分析(二)


    1、数据采集——maxwell 实现

    1.1、maxwell 介绍

    Maxwell 是由美国zendesk开源,用java编写的Mysql实时抓取软件,其抓取的原理也是基于binlog。

    1.2、maxwell 与 canal 对比

    • Maxwell没有canal那种server+client模式,只有一个server把数据发送到消息队列或redis。如果需要多个实例,通过指定不同配置文件启动多个进程。
    • Maxwell有一个亮点功能,就是canal只能抓取最新数据,对已存在的历史数据没有办法处理。而Maxwell有一个bootstrap功能,可以直接引导出完整的历史数据用于初始化,非常好用。
    • Maxwell不能直接支持HA,但是它支持断点还原,即错误解决后重启继续上次点儿读取数据。
    • Maxwell只支持json格式,而Canal如果用Server+client模式的话,可以自定义格式。
    • Maxwell比Canal更加轻量级。

    1.3、安装 maxwell

    [hui@hadoop201 software]$ tar -zxvf maxwell-1.25.0.tar.gz -C /opt/module/
    [hui@hadoop201 module]$ mv maxwell-1.25.0 maxwell1.25

    1.4、使用maxwell前准备工作

    #在数据库中建立一个maxwell库用于存储Maxwell的元数据
    [atguigu@hadoop202 module]$ mysql -uroot -p123465
    CREATE DATABASE maxwell;
    #分配一个账号可以操作该数据库
    GRANT ALL   ON maxwell.* TO 'maxwell'@'%' IDENTIFIED BY '123456';
    #分配这个账号可以监控其他数据库的权限
    GRANT  SELECT ,REPLICATION SLAVE , REPLICATION CLIENT  ON *.* TO maxwell@'%';

    另外需要开启mysql binlog 功能

    [hui@hadoop201 ~]$ locate my.cnf
    /etc/my.cnf
    /etc/my.cnf.d
    [hui@hadoop201 ~]$ less /etc/my.cnf
    # For advice on how to change settings please see
    # http://dev.mysql.com/doc/refman/5.7/en/server-configuration-defaults.html
    
    [mysqld]
    #
    # Remove leading # and set to the amount of RAM for the most important data
    # cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
    # innodb_buffer_pool_size = 128M
    #
    # Remove leading # to turn on a very important data integrity option: logging
    # changes to the binary log between backups.
    # log_bin
    #
    # Remove leading # to set options mainly useful for reporting servers.
    # The server defaults are faster for transactions and fast SELECTs.
    # Adjust sizes as needed, experiment to find the optimal values.
    # join_buffer_size = 128M
    # sort_buffer_size = 2M
    # read_rnd_buffer_size = 2M
    datadir=/var/lib/mysql
    socket=/var/lib/mysql/mysql.sock
    
    # Disabling symbolic-links is recommended to prevent assorted security risks
    symbolic-links=0
    
    log-error=/var/log/mysqld.log
    pid-file=/var/run/mysqld/mysqld.pid
    
    max_allowed_packet=1024M
    
    #数据库id
    server-id = 1
    ##启动binlog,该参数的值会作为binlog的文件名
    log-bin=mysql-bin
    ##binlog类型,maxwell要求为row类型
    binlog_format=row
    ##启用binlog的数据库,需根据实际情况作出修改,如需配置监听多个数据库,可以多写几行进行配置
    binlog-do-db=gmall0423
    #解除下面一行配置,可以多监听一个数据库
    #binlog-do-db=gmall

    1.5、编辑maxwell配置文件

    [hui@hadoop201 maxwell1.25]$ vim config.properties
    producer=kafka
    kafka.bootstrap.servers=hadoop201:9092,hadoop202:9092,hadoop203:9092
    # mysql login info
    host=hadoop201
    user=maxwell
    password=maxwell
    #需要添加
    kafka_topic=gmall0423_db_maxwell
    #需要添加 后续初始化会用
    client_id=maxwell_1
    producer_partition_by=primary_key

    注意:默认还是输出到指定Kafka主题的一个kafka分区,因为多个分区并行可能会打乱binlog的顺序如果要提高并行度,首先设置kafka的分区数>1,然后设置producer_partition_by属性 

    可选值producer_partition_by=database|table|primary_key|random| column

    maxwell 启动脚本

    [hui@hadoop201 bin]$ less maxwell.sh 
    
    /opt/module/maxwell-1.25.0/bin/maxwell --config  /opt/module/maxwell-1.25.0/config.properties >/dev/null 2>&1 &
    [hui@hadoop201 ~]$ jps
    2675 Maxwell

    启动模拟数据程序,观察效果

    [hui@hadoop201 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop201:9092 --topic gmall0423_db_maxwell

    数据格式

    {
        "database":"gmall0423",
        "table":"comment_info",
        "type":"insert",
        "ts":1651873950,
        "xid":2229,
        "xoffset":1392,
        "data":{
            "id":1522695828560359428,
            "user_id":23,
            "sku_id":12,
            "spu_id":10,
            "order_id":3515,
            "appraise":"1204",
            "comment_txt":"评论内容:23985229577423662931538863626755831166131361537121",
            "create_time":"2022-04-23 05:52:30",
            "operate_time":null
        }
    }

    1.6、Maxwell 版本的 ods 层处理

    1.6.1、不同操作下 canal 和 maxwell 数据格式对比

    Insert

    INSERT INTO z_user_info VALUES(30,'zhang3','13810001010'),(31,'li4','1389999999');

    canal

    maxwell

    {"data":[{"id":"30","user_name":"zhang3","tel":"13810001010"},{"id":"31","user_name":"li4","tel":"1389999999"}],"database":"gmall-2020-04","es":1589385314000,"id":2,"isDdl":false,"mysqlType":{"id":"bigint(20)","user_name":"varchar(20)","tel":"varchar(20)"},"old":null,"pkNames":["id"],"sql":"","sqlType":{"id":-5,"user_name":12,"tel":12},"table":"z_user_info","ts":1589385314116,"type":"INSERT"}

    {"database":"gmall-2020-04","table":"z_user_info","type":"insert","ts":1589385314,"xid":82982,"xoffset":0,"data":{"id":30,"user_name":"zhang3","tel":"13810001010"}}

    {"database":"gmall-2020-04","table":"z_user_info","type":"insert","ts":1589385314,"xid":82982,"commit":true,"data":{"id":31,"user_name":"li4","tel":"1389999999"}}

    update

    UPDATE z_user_info SET user_name='wang55' WHERE id IN(30,31)

    canal

    maxwell

    {"data":[{"id":"30","user_name":"wang55","tel":"13810001010"},{"id":"31","user_name":"wang55","tel":"1389999999"}],"database":"gmall-2020-04","es":1589385508000,"id":3,"isDdl":false,"mysqlType":{"id":"bigint(20)","user_name":"varchar(20)","tel":"varchar(20)"},"old":[{"user_name":"zhang3"},{"user_name":"li4"}],"pkNames":["id"],"sql":"","sqlType":{"id":-5,"user_name":12,"tel":12},"table":"z_user_info","ts":1589385508676,"type":"UPDATE"}

    {"database":"gmall-2020-04","table":"z_user_info","type":"update","ts":1589385508,"xid":83206,"xoffset":0,"data":{"id":30,"user_name":"wang55","tel":"13810001010"},"old":{"user_name":"zhang3"}}

    {"database":"gmall-2020-04","table":"z_user_info","type":"update","ts":1589385508,"xid":83206,"commit":true,"data":{"id":31,"user_name":"wang55","tel":"1389999999"},"old":{"user_name":"li4"}}

    delete

    DELETE  FROM z_user_info   WHERE id IN(30,31)

    canal

    maxwell

    {"data":[{"id":"30","user_name":"wang55","tel":"13810001010"},{"id":"31","user_name":"wang55","tel":"1389999999"}],"database":"gmall-2020-04","es":1589385644000,"id":4,"isDdl":false,"mysqlType":{"id":"bigint(20)","user_name":"varchar(20)","tel":"varchar(20)"},"old":null,"pkNames":["id"],"sql":"","sqlType":{"id":-5,"user_name":12,"tel":12},"table":"z_user_info","ts":1589385644829,"type":"DELETE"}

    {"database":"gmall-2020-04","table":"z_user_info","type":"delete","ts":1589385644,"xid":83367,"xoffset":0,"data":{"id":30,"user_name":"wang55","tel":"13810001010"}}

    {"database":"gmall-2020-04","table":"z_user_info","type":"delete","ts":1589385644,"xid":83367,"commit":true,"data":{"id":31,"user_name":"wang55","tel":"1389999999"}}

    1.6.2、数据格式对比总结

    • 日志结构:canal 每一条SQL会产生一条日志,如果该条Sql影响了多行数据,则已经会通过集合的方式归集在这条日志中。(即使是一条数据也会是数组结构)maxwell 以影响的数据为单位产生日志,即每影响一条数据就会产生一条日志。如果想知道这些日志是否是通过某一条sql产生的可以通过xid进行判断,相同的xid的日志来自同一sql。
    • 数字类型:当原始数据是数字类型时,maxwell会尊重原始数据的类型不增加双引,变为字符串。canal一律转换为字符串。
    • 带原始数据字段定义:canal数据中会带入表结构。maxwell更简洁。

    1.6.3、SparkStreaming 分流业务逻辑修改-——maxwell

    import com.alibaba.fastjson.{JSON, JSONObject}
    import org.apache.kafka.clients.consumer.ConsumerRecord
    import org.apache.kafka.common.TopicPartition
    import org.apache.spark.SparkConf
    import org.apache.spark.streaming.dstream.{DStream, InputDStream}
    import org.apache.spark.streaming.kafka010.{HasOffsetRanges, OffsetRange}
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import org.wdh01.gmall.realtime.util.{MyKafkaSink, MyKafkaUtil, OffsetManagerUtil}
    
    /**
     * 使用 maxwell 采集数据 分流
     */
    object BaseDBMaxwellApp {
      def main(args: Array[String]): Unit = {
        val conf: SparkConf = new SparkConf().setAppName("BaseDBMaxwellApp").setMaster("local[4]")
        val ssc: StreamingContext = new StreamingContext(conf, Seconds(5))
    
        var topic: String = "gmall0426_db_maxwell"
        var groupid: String = "base_db_db_maxwell_group"
    
        var recordDStream: InputDStream[ConsumerRecord[String, String]] = null
        //从 redis 获取偏移量
        val offsetMap: Map[TopicPartition, Long] = OffsetManagerUtil.getOffset(topic, groupid)
        if (offsetMap != null && offsetMap.size > 0) {
          //从指定位置读取数据
          recordDStream = MyKafkaUtil.getKafkaStream(topic, ssc, offsetMap, groupid)
        } else {
          //从默认位置读取数据
          recordDStream = MyKafkaUtil.getKafkaStream(topic, ssc, groupid)
        }
        //获取当前采集周琼的偏移量
        var ranges: Array[OffsetRange] = Array.empty[OffsetRange]
        val offsetDstream: DStream[ConsumerRecord[String, String]] = recordDStream.transform {
          rdd => {
            ranges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
          }
            rdd
        }
        //对读取的数据进行结构转换 ConsumerRecord<K,V>==>V(sonStr)==>V(jsonObj )
        val jsonObjDStream: DStream[JSONObject] = offsetDstream.map {
          record => {
            val jsonStr: String = record.value()
            //将jsonStr 转换为 jsonObj
            val jsonObj: JSONObject = JSON.parseObject(jsonStr)
            jsonObj
          }
        }
        jsonObjDStream.foreachRDD {
          rdd => {
            rdd.foreach {
              jsonObj => {
                //获取操作数据
                val dataJsonObj: JSONObject = jsonObj.getJSONObject("data")
                //获取操作类型
                val opType: String = jsonObj.getString("type")
    
                if (dataJsonObj != null && !dataJsonObj.isEmpty && "insert".equals(opType)) {
                  //获取表名
                  val tabName: String = jsonObj.getString("table")
                  var sendTopic = "ods_" + tabName
                  MyKafkaSink.send(sendTopic, dataJsonObj.toString())
                }
              }
            }
            //提交偏移量
            OffsetManagerUtil.saveOffset(topic, groupid, ranges)
    
          }
        }
      }
    }
  • 相关阅读:
    java多线程:并发包中ConcurrentHashMap和jdk的HashMap的对比
    java编程之:Unsafe类
    mave之:java的web项目必须要的三个jar的pom形式
    java多线程:并发包中ReentrantReadWriteLock读写锁的锁降级模板
    java多线程:并发包中ReentrantReadWriteLock读写锁的原理
    java编程之:按位与运算,等运算规则
    java多线程:jdk并发包的总结(转载)
    liunx之:wps for liunx的安装经验
    mysql中enum类型理解
    MySQL类型float double decimal的区别
  • 原文地址:https://www.cnblogs.com/wdh01/p/16228255.html
Copyright © 2020-2023  润新知