• hive 之 Cube, Rollup介绍


    1. GROUPING SETS

    GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统维度,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来。

    为方便理解,以testdb.test_1为例:

    hive> use testdb;
    hive> desc test_1;
    
    user_id        string      id                
    device_id      string      设备类型:手机、平板             
    os_id          string      操作系统类型:ios、android            
    app_id         string      手机app_id             
    client_v       string      客户端版本             
    channel        string      渠道
    
    grouping sets语句 等价hive语句
    select device_id,os_id,app_id,count(user_id) from  test_1 group by device_id,os_id,app_id grouping sets((device_id))  SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id
    select device_id,os_id,app_id,count(user_id) from  test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id)) SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id
    select device_id,os_id,app_id,count(user_id) from  test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id UNION ALL SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id
    select device_id,os_id,app_id,count(user_id) from  test_1 group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id UNION ALL SELECT null,os_id,null,count(user_id) FROM test_1 group by os_id UNION ALL SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id  UNION ALL SELECT null,null,null,count(user_id) FROM test_1

    2. CUBE函数

    cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,cube会统计所选列中值的所有组合的聚合

    select device_id,os_id,app_id,client_v,channel,count(user_id) 
    from test_1 
    group by device_id,os_id,app_id,client_v,channel with cube;
    

    等价于:

    SELECT device_id,null,null,null,null ,count(user_id) FROM test_1 group by device_id
    UNION ALL
    SELECT null,os_id,null,null,null ,count(user_id) FROM test_1 group by os_id
    UNION ALL
    SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_1 group by device_id,os_id
    UNION ALL
    SELECT null,null,app_id,null,null ,count(user_id) FROM test_1 group by app_id
    UNION ALL
    SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_1 group by device_id,app_id
    UNION ALL
    SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_1 group by os_id,app_id
    UNION ALL
    SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id
    UNION ALL
    SELECT null,null,null,client_v,null ,count(user_id) FROM test_1 group by client_v
    UNION ALL
    SELECT device_id,null,null,client_v,null ,count(user_id) FROM test_1 group by device_id,client_v
    UNION ALL
    SELECT null,os_id,null,client_v,null ,count(user_id) FROM test_1 group by os_id,client_v
    UNION ALL
    SELECT device_id,os_id,null,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,client_v
    UNION ALL
    SELECT null,null,app_id,client_v,null ,count(user_id) FROM test_1 group by app_id,client_v
    UNION ALL
    SELECT device_id,null,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,app_id,client_v
    UNION ALL
    SELECT null,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by os_id,app_id,client_v
    UNION ALL
    SELECT device_id,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v
    UNION ALL
    SELECT null,null,null,null,channel ,count(user_id) FROM test_1 group by channel
    UNION ALL
    SELECT device_id,null,null,null,channel ,count(user_id) FROM test_1 group by device_id,channel
    UNION ALL
    SELECT null,os_id,null,null,channel ,count(user_id) FROM test_1 group by os_id,channel
    UNION ALL
    SELECT device_id,os_id,null,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,channel
    UNION ALL
    SELECT null,null,app_id,null,channel ,count(user_id) FROM test_1 group by app_id,channel
    UNION ALL
    SELECT device_id,null,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,app_id,channel
    UNION ALL
    SELECT null,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by os_id,app_id,channel
    UNION ALL
    SELECT device_id,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,channel
    UNION ALL
    SELECT null,null,null,client_v,channel ,count(user_id) FROM test_1 group by client_v,channel
    UNION ALL
    SELECT device_id,null,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,client_v,channel
    UNION ALL
    SELECT null,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by os_id,client_v,channel
    UNION ALL
    SELECT device_id,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,client_v,channel
    UNION ALL
    SELECT null,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by app_id,client_v,channel
    UNION ALL
    SELECT device_id,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,app_id,client_v,channel
    UNION ALL
    SELECT null,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by os_id,app_id,client_v,channel
    UNION ALL
    SELECT device_id,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v,channel
    UNION ALL
    SELECT null,null,null,null,null ,count(user_id) FROM test_1
    

    3. ROLL UP函数

    rollup可以实现从右到左递减多级的统计,显示统计某一层次结构的聚合

    select device_id,os_id,app_id,client_v,channel,count(user_id) 
    from test_1 
    group by device_id,os_id,app_id,client_v,channel with rollup;
    

    等价于:

    select device_id,os_id,app_id,client_v,channel,count(user_id) 
    from test_1 
    group by device_id,os_id,app_id,client_v,channel 
    grouping sets ((device_id,os_id,app_id,client_v,channel),(device_id,os_id,app_id,client_v),(device_id,os_id,app_id),(device_id,os_id),(device_id),());
    

    4.Grouping_ID函数

    当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)

    例子如下:

    Column1 (key) Column2 (value)
    1 NULL
    1 1
    2 2
    3 3
    3 NULL
    4 5

    hql统计:

      SELECT key, value, GROUPING_ID, count(*) from T1 GROUP BY key, value WITH ROLLUP
    

    结果如下:

     key value GROUPING_ID  count(*) 
    NULL NULL 0     00 6
    1 NULL 1     10 2
    1 NULL 3     11 1
    1 1 3     11 1
    2 NULL 1     10 1
    2 2 3     11 1
    3 NULL 1     10 2
    3 NULL 3     11 1
    3 3 3     11 1
    4 NULL 1     10 1
    4 5 3     11 1

    GROUPING_ID转变为二进制,如果对应位上有值为null,说明这列本身值就是null。(通过类DataFilterNull.py 扫描,可以筛选过滤掉列中null、“”统计结果),

    5. 窗口函数

    hive窗口函数,感觉大部分都是在模仿oracle,有对oracle熟悉的,应该看下就知道怎么用。

    具体参见:http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

    参考文章

    1. https://blog.csdn.net/gua___gua/article/details/52523698
    Souviens Toi Que Tu Vas Mourir !
  • 相关阅读:
    9_输出表.md
    8_输入表详解.md
    3_imageoptionalheader32.md
    5_节表续.md
    1.md
    10_基址重定向.md
    4_节表.md
    如何优雅的写单词_lduoj_kmp
    杀人游戏_lduoj_Tarjan_强联通
    All in the Family_upc_模拟 or lca + 并查集
  • 原文地址:https://www.cnblogs.com/remainsu/p/11060580.html
Copyright © 2020-2023  润新知