• 增量数据和合并问题验证


    1.建立基表和增量测试数据

    [root@node1 delta_merge]# pwd
    /root/delta_merge
    [root@node1 delta_merge]# cat base.txt 
    1001,gongshaocheng
    1002,LIDACHAO
    [root@node1 delta_merge]# cat delta.txt 
    1002,lidachao
    1003,chenjianzhong

    [root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta
    [root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta/base
    [root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta/delta
    [root@node1 delta_merge]# hdfs dfs -put base.txt /user/merge_delta/base
    [root@node1 delta_merge]# hdfs dfs -put delta.txt /user/merge_delta/delta

    2.建立测试表

    hive> create external table base(id string,name string)
        > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
        > location "/user/merge_delta/base/";
    OK
    Time taken: 0.304 seconds
    hive> select * from base;
    OK
    1001    gongshaocheng
    1002    LIDACHAO
    Time taken: 0.875 seconds, Fetched: 2 row(s)
    hive> create external table delta(id string,name string)
        > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
        > location "/user/merge_delta/delta/";
    OK
    Time taken: 0.134 seconds
    hive> select * from delta;
    OK
    1002    lidachao
    1003    chenjianzhong
    Time taken: 0.321 seconds, Fetched: 2 row(s)

    3.测试:

    a. full outer join语法:

    hive> select base.*,delta.* from base full outer join delta on base.id = delta.id;

    结果如下:

    1001    gongshaocheng    NULL    NULL
    1002    LIDACHAO    1002    lidachao
    NULL    NULL    1003    chenjianzhong

    我们最终想要的答案应该是:

    1001 gongshaocheng --代表保持不变的记录

    1002 lidachao  --代表修改后的最新记录

    1003 chenjianzhong --代表新增记录

    b.coalesce函数:

    select 
    coalesce(base.id, delta.id)
    from base full outer join delta on base.id = delta.id
    where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);

    结果:

    1001
    1002
    1003

    上面验证了对于主键列,我们可以采用coalesce函数,使得结果集中主键列总是有值的

    c.if函数

    select 
    if(delta.id is NULL, base.name,delta.name)
    from base full outer join delta on base.id = delta.id
    where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);

    结果:

    gongshaocheng
    lidachao
    chenjianzhong

    上面验证了对于普通列,如果是未修改的数据(delta.id is NULL),则直接用基表里的值,否则直接用增量表的数据

    最后综合起来,得到我们想要的HQL语句:

    select 
    coalesce(base.id, delta.id),
    if(delta.id is NULL, base.name,delta.name)
    from base full outer join delta on base.id = delta.id
    where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);

    结果如下:

    hive> select 
        > coalesce(base.id, delta.id),
        > if(delta.id is NULL, base.name,delta.name)
        > from base full outer join delta on base.id = delta.id
        > where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);
    Query ID = root_20151230235050_befa6322-f78f-4166-8bbd-4fde04a1a9b1
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1451024710809_0005, Tracking URL = http://node1.clouderachina.com:8088/proxy/application_1451024710809_0005/
    Kill Command = /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/bin/hadoop job  -kill job_1451024710809_0005
    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
    2015-12-30 23:51:04,904 Stage-1 map = 0%,  reduce = 0%
    2015-12-30 23:51:13,245 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.69 sec
    2015-12-30 23:51:22,685 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.16 sec
    MapReduce Total cumulative CPU time: 5 seconds 160 msec
    Ended Job = job_1451024710809_0005
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.16 sec   HDFS Read: 12293 HDFS Write: 52 SUCCESS
    Total MapReduce CPU Time Spent: 5 seconds 160 msec
    OK
    1001    gongshaocheng
    1002    lidachao
    1003    chenjianzhong
    Time taken: 31.376 seconds, Fetched: 3 row(s)

    注意:上面所有的HQL都只需要有一个MR作业。这就是本解决方案的精髓所在!

    最后对HQL进行进一步优化:之前为了保持逻辑上的清晰,增加了WHERE子句,对FULL OUTER JOIN的三种情况进行分布讨论,但实际上两个OR合并后就是全集,其实WHERE子句是多余的。最终的HQL为:

    select 
    coalesce(base.id, delta.id),
    if(delta.id is NULL, base.name,delta.name)
    from base full outer join delta on base.id = delta.id;
  • 相关阅读:
    Codeforces Round #595 (Div. 3) A,B,C,D
    计算几何板子题【2019牛客国庆集训派对day7——三角形和矩形】【多边形相交的面积】
    [POJ]POJ1753(dfs)
    [POJ]POJ2965(dfs)
    洛谷 P1772 [ZJOI2006]物流运输 题解
    简单概率与期望
    洛谷 P3802 小魔女帕琪 题解
    用树状数组实现的平衡树
    【模板】扩展中国剩余定理(EXCRT)
    新博客开通通知
  • 原文地址:https://www.cnblogs.com/littlesuccess/p/5090427.html
Copyright © 2020-2023  润新知