• Flink 关联 Hbase 非主键


    Hbase 作为 Hadoop 全家桶中,非常重要的存储组件,适用于海量数据的随机查询,使用是非常广泛的。

    实时数仓项目使用 Kafka 作为数仓的基础表,我们也会把 Kafka 的数据往 Hbase 写一份,方便其他场景使用,比如:做维表

    Flink Hbase 表默认使用 TableScan 一次性加载全量维表数据来关联,维表数据不能更新,适用场景比较少(由于 TableScan 完成后,算子状态为 Finish,导致任务不能完成 Checkpoint)

    Flink Hbase 时态表 Lookup 功能(仅限于关联 Hbase 表主键,不支持非主键),支持缓存和透查外部系统,完美解决默认表维表数据不能更新和不能 完成 Checkpoint 的问题

    关联 sql 如下

    CREATE TEMPORARY TABLE hbase_behavior_conf (
       rowkey STRING
      ,cf ROW(item_id STRING
      ,category_id STRING
      ,behavior STRING
      ,ts TIMESTAMP(3))
    ) WITH (
       'connector' = 'hbase-2.2'
       ,'zookeeper.quorum' = 'thinkpad:12181'
       ,'table-name' = 'user_log'
       ,'lookup.cache.max-rows' = '10000'
       ,'lookup.cache.ttl' = '1 minute' -- ttl time 超过这么长时间无数据才行
       ,'lookup.async' = 'true'
    );
    
    INSERT INTO kakfa_join_mysql_demo(user_id, item_id, category_id, behavior, behavior_map, ts)
    SELECT a.user_id, a.item_id, a.category_id, a.behavior, c.cf.item_id, a.ts
    FROM user_log a
        left join hbase_behavior_conf FOR SYSTEM_TIME AS OF a.process_time AS c  ON a.user_id = rowkey -- 主键
        -- left join hbase_behavior_conf FOR SYSTEM_TIME AS OF a.process_time AS c ON a.user_id = cf.item_id  --非主键
    where a.behavior is not null;

    关联非主键报错如下:

    Caused by: java.lang.IllegalArgumentException: Currently, HBase table can only be lookup by single rowkey.
        at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138)
        at org.apache.flink.connector.hbase2.source.HBaseDynamicTableSource.getLookupRuntimeProvider(HBaseDynamicTableSource.java:52)
        at org.apache.flink.table.planner.plan.utils.LookupJoinUtil.getLookupFunction(LookupJoinUtil.java:172)
        at org.apache.flink.table.planner.plan.nodes.physical.common.CommonPhysicalLookupJoin.explainTerms(CommonPhysicalLookupJoin.scala:168)
        at org.apache.calcite.rel.AbstractRelNode.getDigestItems(AbstractRelNode.java:409)
        at org.apache.calcite.rel.AbstractRelNode.deepHashCode(AbstractRelNode.java:391)
        at org.apache.calcite.rel.AbstractRelNode$InnerRelDigest.hashCode(AbstractRelNode.java:443)
        at java.util.HashMap.hash(HashMap.java:339)
        at java.util.HashMap.get(HashMap.java:557)
        at org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1150)
        at org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:589)
        at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:604)
        at org.apache.calcite.plan.volcano.VolcanoRuleCall.transformTo(VolcanoRuleCall.java:148)
        ... 45 more


    Flink 关联 Hbase 场景也有关联非主键的场景,刚开始用的时候,为了方便就直接实现一个 UDF,启动的时候加载全量的 Hbase 表数据到内存中(维表数据并不多),根据策略定期去Hbase 重新加载最新的数据。

    受 Lookup Source 的启发,想实现一个关联 Hbase 非主键的 UDF,支持缓存和缓存失效透查 Hbase

    需求如下:
    1、UDF 关联 Hbase 表非主键
    2、支持缓存时间和缓存数量的控制
    3、关联键缓存有值就从缓存出
    4、关联键缓存没有值就从 Hbase 查,结果放到缓存中

    基于这些需求,实现了这样的一个 UDTF(由于 Hbase 表非主键可能有重复值,所以使用 Table Function,如果有多条数据,返回多条数据到 SQL 端处理)

    ## UDF 代码

    @FunctionHint(output = new DataTypeHint("ROW<arr ARRAY<STRING>>"))
      def eval(key: String): Unit = {
        // if key is empty
        if (key == null || key.length == 0) {
          return
        }
        // insert
        val rowKind = RowKind.fromByteValue(0.toByte)
        val row = new Row(rowKind, 1)
        // get result from cache
        var list: ListBuffer[Array[String]] = cache.getIfPresent(key)
        if (list != null) {
          list.foreach(arr => {
            row.setField(0, arr)
            collect(row)
          })
          return
        }
        // cache get nothing, query hbase
        list = queryHbase(key)
        if (list.length == 0) {
          // if get nothing
          return
        }
        // get result, add to cache
        cache.put(key, list)
        list.foreach(arr => {
          row.setField(0, arr)
          collect(row)
        })
        //    LOG.info("finish join key : " + key)
      }
    
      /**
       * query hbase
       *
       * @param key join key
       * @return query result row
       */
      private def queryHbase(key: String): ListBuffer[Array[String]] = {
        val scan: Scan = new Scan();
        qualifier.foreach(item => scan.addColumn(family, item))
    
        val filter = new SingleColumnValueFilter(family, qualifier.head, CompareOperator.EQUAL, key.getBytes("UTF8"))
        scan.setFilter(filter)
    
        val resultScanner = table.getScanner(scan)
        val it = resultScanner.iterator()
    
        val list = new ListBuffer[Array[String]]
    
        // loop result
        while (it.hasNext) {
    
          val result = it.next()
          val arr = new Array[String](qualifier.length + 1)
          // add rowkey to result array
          var index = 0
          val rowkey = new String(result.getRow)
          arr(index) = rowkey
          // add special qualify to result array
          qualifier.foreach(item => {
            val value = result.getValue(family, item)
            if (value != null) {
              index += 1
              arr(index) = new String(value, "UTF8")
            }
          })
    
          // add array to result
          list.+=(arr)
        }
        list
      }

    ## 测试 SQL

    -- kafka source
    CREATE TABLE user_log (
      user_id STRING
      ,item_id STRING
      ,category_id STRING
      ,behavior STRING
      ,ts TIMESTAMP(3)
      ,process_time as proctime()
    --   , WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
    ) WITH (
      'connector' = 'kafka'
      ,'topic' = 'user_behavior'
      ,'properties.bootstrap.servers' = 'localhost:9092'
      ,'properties.group.id' = 'user_log'
      ,'scan.startup.mode' = 'group-offsets'
      ,'format' = 'json'
    );
    
    ---sinkTable
    CREATE TABLE join_hbbase_sink (
      user_id STRING
      ,item_id STRING
      ,category_id STRING
      ,behavior STRING
      ,ts TIMESTAMP(3)
      ,rowkey STRING
      ,c1 STRING
      ,c2 STRING
      ,c3 STRING
      ,c4 STRING
      ,primary key (user_id) not enforced
    ) WITH (
       'connector' = 'upsert-kafka'
      ,'topic' = 'user_behavior_1'
      ,'properties.bootstrap.servers' = 'localhost:9092'
      ,'properties.group.id' = 'user_log'
      ,'key.format' = 'csv'
      ,'value.format' = 'csv'
    );
    
    INSERT INTO join_hbbase_sink
    SELECT a.user_id, a.item_id, a.category_id, a.behavior, a.ts, t2.col[1], t2.col[2], t2.col[3], t2.col[4], t2.col[5]
    FROM user_log a
    -- left join lateral table(udf_join_hbase_non_rowkey_no_cache(item_id)) as t2(col) on true
    left join lateral table(udf_join_hbase_non_rowkey_cache(item_id)) as t2(col) on true
    where a.item_id is not null
    -- and t2.col[1] = a.item_id -- 返回多条数据可以在 where 条件后面过滤
    ;

    ## 测试Hbase 表

    hbase(main):014:0> count 'user_info'
    Current count: 1000, row: 0999                                                                                                                                                                                     
    Current count: 2000, row: 2                                                                                                                                                                                        
    Current count: 3000, row: 3                                                                                                                                                                                        
    Current count: 4000, row: 4                                                                                                                                                                                        
    Current count: 5000, row: 5                                                                                                                                                                                        
    Current count: 6000, row: 6                                                                                                                                                                                        
    Current count: 7000, row: 7                                                                                                                                                                                        
    Current count: 8000, row: 8                                                                                                                                                                                        
    Current count: 9000, row: 9                                                                                                                                                                                        
    9999 row(s)
    Took 0.5151 seconds                                                                                                                                                                                                
    => 9999
    hbase(main):015:0> get 'user_info','0001'
    COLUMN                                                CELL                                                                                                                                                         
     cf:c1                                                timestamp=1632311933302, value=0001  # 关联键                                                                                                             
     cf:c10                                               timestamp=1632311933302, value=xEDx9Ex99xDEx88xE6xB4x92xE6x90x9DxECx84x88xECxA0x89xE3x97xA6xEBxB3xAFxE6x9Bx91xE6xBCx9BxE8xB0x
                                                          81xE7x89xB2xE6xBEx94xECx8CxB8xEBx86xA6                                                                                                           
     cf:c2                                                timestamp=1632311933302, value=xE7x8FxADxE7x99x83xEDx80x98xE0xABx86xEBxB0x9AxE7xB8x8BxE7xA2xA8xE8x9FxBFxE8xBFx82xEBx9Fx86xEBx
                                                          8BxB8xE4x96x82xE7xBCxA0xEDx99xB8xE9xAEx89                                                                                                       
     cf:c3                                                timestamp=1632311933302, value=xE3x8AxA6xE8xABxA5xE7x80xA2xE3x83xAFxE3x9DxA5xECxBAx86xE4xA0xB7xEAxB4x83xE5xACx8DxE3x9FxB0xE6x
                                                          B1xABxD6x8BxD7xB8xE2xBBxB4xE1xB5xA6                                                                                                               
     cf:c4                                                timestamp=1632311933302, value=xE3xA5x95xE9x80xA1xE8x9BxB5xE4xBBxB1xD8x90xE2x82xA7xECx98xA6xE6xA5xA0xEBxB1xBExEBxADx8AxE7xABx
                                                          B3xE6xB7xB6xEAx89xA3xEDx92xBExE4xBCx9B                                                                                                           
     cf:c5                                                timestamp=1632311933302, value=xE9xA0x91xE2x93x86xE6x86xA0xE1x85xA7xE5xB9x84xE7x9FxB7xE3x99x88xE2xAFx8FxE3x80x84xECxB9xADxE6x
                                                          92xB7xE3x91xB7xE9x94xB8xEAx85x9CxE2x90xAB                                                                                                       
     cf:c6                                                timestamp=1632311933302, value=xE9xBFxBAxE2x9Ax80xEAx81x99xECx99xA1xE3xB1x83xE0xBFxA4xE7x8Cx8ExE2x9BxB5xE6xABxBDxE9xBBx9ExE8x
                                                          96xA2xEAx8ExAFxE5x98xBAxE3x83x8CxE1x91x8F                                                                                                       
     cf:c7                                                timestamp=1632311933302, value=xEAx8Dx89xE9x87xACxEAxB9xBBxE3x93xBCxE4xABxAAxEBxA2x9BxE3x85x90xECx93xB9xE8x9Ax84xE2xBFx82xE7x
                                                          A4xA7xEAxB5xB6xE2x8FxBCxE0xACx80xE6xB1xA9                                                                                                       
     cf:c8                                                timestamp=1632311933302, value=xE1x82x8FxE3xABxBExE7x92x93xE5x97xA0xE1xA0x93xC6xBExE0xB6xA5xE3xB4x93xECx9DxB3xC5x89xE3x98xB2x
                                                          E8x91x8ExC2xA3xE4xB3x90xE5x87xA8                                                                                                                   
     cf:c9                                                timestamp=1632311933302, value=xEAx80xB8xECxADx8FxEBx95x91xD8xB6xE8x90xAEx5CxE5x85x95xECx89x83xCCxAFxE5xA4x8BxE4x88x99xEBxBAx
                                                          88xE1xA4xB8xEAx9Ex99xE6x9Bx8F                                                                                                                       
    1 row(s)
    Took 0.0446 seconds 

    Flink 任务:

     

     

    hbase 表: 10000 条数据,非主键关联测试,关联键值10000条不重复,缓存时间 10 min
    笔记本 on yarn 测试 TPS: 850+ (基本可用,服务器环境应该会好很多,不够还可以加并行度)


    改进点:
    1、将 UDF 改为基于配置的
    2、将关联字段改成支持多个字段
    3、修改 Hbase connector 源码,支持关联非主键的 Lookup join

    完整案例参考 GitHub:  https://github.com/springMoon/sqlSubmit

    欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文

  • 相关阅读:
    ubuntu在图形界面下打开一个终端
    [置顶] 屠夫与大夫
    service bound(一)
    Android Interface Definition Language (AIDL)
    service bound(二)
    移动应用开发原则
    Service bound(三)
    Linux 安装SSH服务
    JDK中设计模式
    Bad Hair Day【单调栈】
  • 原文地址:https://www.cnblogs.com/Springmoon-venn/p/15323951.html
Copyright © 2020-2023  润新知