• impala+kudu


    【impala建表】kudu的表必须有主键,作为分区的字段需排在其他字段前面

    【range分区】(不推荐)
    CREATE TABLE KUDU_WATER_HISTORY ( 
    id STRING, 
    year INT,
    device STRING,
    reading INT,
    time STRING,
    PRIMARY KEY (id,year) 
    ) PARTITION BY RANGE (year)

    PARTITION VALUES < 2017, 
    PARTITION 2017 <= VALUES < 2018, 
    PARTITION 2018 <= VALUES
    )
    STORED AS KUDU 
    TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');

    【hash分区】(不推荐)
    CREATE TABLE KUDU_WATER_HISTORY_PARTITION_BY_ID ( 
    id STRING, 
    year INT,
    device STRING,
    reading INT,
    time STRING,
    PRIMARY KEY (id) 
    ) PARTITION BY HASH (id) PARTITIONS 4
    STORED AS KUDU 
    TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');

    【hash range混合分区】推荐是用混合分区方式
    CREATE TABLE KUDU_WATER_HISTORY (
    id STRING,
    year INT,
    device STRING,
    reading INT,
    time STRING,
    PRIMARY KEY (id,device,year)
    ) PARTITION BY HASH (device) PARTITIONS 3,
    RANGE (year)
    (
    PARTITION VALUE = 2016,
    PARTITION VALUE = 2017,
    PARTITION VALUE = 2018,
    PARTITION VALUE = 2019
    )
    STORED AS KUDU
    TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');

    CREATE TABLE DEVICE_KUDU (
    id STRING,
    device STRING,
    name STRING,
    orgId INT,
    PRIMARY KEY (id)
    ) PARTITION BY HASH (id) PARTITIONS 4
    STORED AS KUDU
    TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');

    【增加分区】
    ALTER TABLE KUDU_WATER_HISTORY ADD RANGE PARTITION VALUE = 2020;

    【查询语句】

    select
    T_3C75F1.`device`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`),
    sum(T_3C75F1.`reading`),
    count(1)
    from (select DEVICE_KUDU.device,reading,to_timestamp(time,'yyyy-MM-dd HH:mm:ss') as time from KUDU_WATER_HISTORY,DEVICE_KUDU where KUDU_WATER_HISTORY.device=DEVICE_KUDU.device) as `T_3C75F1`
    group by
    T_3C75F1.`device`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`);

    耗时:DEVICE_KUDU表50条记录,KUDU_WATER_HISTORY表1亿条记录,执行上面的查询语句耗时12秒。同样的数据量和查询方式,比impala+hdfs+parquet快10倍。

    【跨数据源的查询】

    device表的数据存储在hdfs中,抄表读数的数据存储在kudu中。分别在impala中创建对应的表

    explain select
    T_3C75F1.`deviceid`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`),
    sum(T_3C75F1.`reading`),
    count(1)
    from (select device_parquet.deviceid,reading,to_timestamp(time,'yyyy-MM-dd HH:mm:ss') as time from KUDU_WATER_HISTORY,device_parquet where KUDU_WATER_HISTORY.device=device_parquet.deviceid) as `T_3C75F1`
    group by
    T_3C75F1.`deviceid`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`);

    执行上面的语句,查看执行计划

    +-----------------------------------------------------------------------------------------------------------------------------------------+
    | Explain String                                                                                                                          |
    +-----------------------------------------------------------------------------------------------------------------------------------------+
    | Per-Host Resource Reservation: Memory=400.00MB                                                                                          |
    | Per-Host Resource Estimates: Memory=2.25GB                                                                                              |
    | WARNING: The following tables are missing relevant table and/or column statistics.                                                      |
    | default.device_parquet, default.kudu_water_history                                                                                      |
    |                                                                                                                                         |
    | PLAN-ROOT SINK                                                                                                                          |
    | |                                                                                                                                       |
    | 07:EXCHANGE [UNPARTITIONED]                                                                                                             |
    | |                                                                                                                                       |
    | 06:AGGREGATE [FINALIZE]                                                                                                                 |
    | |  output: sum:merge(T_3C75F1.reading), count:merge(1)                                                                                  |
    | |  group by: T_3C75F1.deviceid, year(T_3C75F1.time), month(T_3C75F1.time)                                                               |
    | |                                                                                                                                       |
    | 05:EXCHANGE [HASH(T_3C75F1.deviceid,year(T_3C75F1.time),month(T_3C75F1.time))]                                                          |
    | |                                                                                                                                       |
    | 03:AGGREGATE [STREAMING]                                                                                                                |
    | |  output: sum(reading), count(1)                                                                                                       |
    | |  group by: device_parquet.deviceid, year(to_timestamp(time, 'yyyy-MM-dd HH:mm:ss')), month(to_timestamp(time, 'yyyy-MM-dd HH:mm:ss')) |
    | |                                                                                                                                       |
    | 02:HASH JOIN [INNER JOIN, BROADCAST]                                                                                                    |
    | |  hash predicates: KUDU_WATER_HISTORY.device = device_parquet.deviceid                                                                 |
    | |                                                                                                                                       |
    | |--04:EXCHANGE [BROADCAST]                                                                                                              |
    | |  |                                                                                                                                    |
    | |  01:SCAN HDFS [default.device_parquet]                                                                                                |
    | |     partitions=1/1 files=1 size=1.00KB                                                                                                |
    | |                                                                                                                                       |
    | 00:SCAN KUDU [default.kudu_water_history]                                                                                               |
    +-----------------------------------------------------------------------------------------------------------------------------------------+

    能看到00:SCAN KUDU和01:SCAN HDFS,分别从kudu和hdfs中查询数据。

  • 相关阅读:
    Java之正則表達式【使用语法】
    2015年创业中遇到的技术问题:71-80
    2015年创业中遇到的技术问题:71-80
    Kinect小小玩偶游戏----小小潜水员
    微信开发学习日记(三):6点经验
    微信开发学习日记(二):3个案例
    2015年创业中遇到的技术问题:61-70
    2015年创业中遇到的技术问题:61-70
    2次创业经验谈(想创业想做事的人不要错过)
    Kinect舒适区范围--UE4 的Blueprint测试范例
  • 原文地址:https://www.cnblogs.com/puroc/p/9234304.html
Copyright © 2020-2023  润新知