• Hive新功能 Cube, Rollup介绍


    说明:Hive之cube、rollup,还有窗口函数,在传统关系型数据(Oracle、sqlserver)中都是有的,用法都很相似。

    GROUPING SETS

    GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解,

    以acorn_3g.test_xinyan_reg为例:

    [dp@YZSJHL19-87 xjob]$ hive -e "use acorn_3g;desc test_xinyan_reg;"
    user_id                 bigint                  None                 
    device_id               int                     None   手机,平板             
    os_id                   int                     None   操作系统类型             
    app_id                  int                     None   手机app_id             
    client_version          string                  None   客户端版本             
    from_id                 int                     None   四级渠道
    
    
    

    几个demo帮助大家了解:

    grouping sets语句等价hive语句
    select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id))  SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
    select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
    select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id 
    UNION ALL 
    SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
    select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id 
    UNION ALL 
    SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id 
    UNION ALL 
    SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id  
    UNION ALL 
    SELECT null,null,null,count(user_id) FROM test_xinyan_reg

    CUBE函数

    cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合

    select device_id,os_id,app_id,client_version,from_id,count(user_id) 
    from test_xinyan_reg 
    group by device_id,os_id,app_id,client_version,from_id with cube;
    

    手工实现需要写的hql语句(写个程序自己生成的,手写累死):

    SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id
    UNION ALL
    SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id
    UNION ALL
    SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id
    UNION ALL
    SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id
    UNION ALL
    SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id
    UNION ALL
    SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id
    UNION ALL
    SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id
    UNION ALL
    SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version
    UNION ALL
    SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version
    UNION ALL
    SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version
    UNION ALL
    SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version
    UNION ALL
    SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version
    UNION ALL
    SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version
    UNION ALL
    SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version
    UNION ALL
    SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version
    UNION ALL
    SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id
    UNION ALL
    SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id
    UNION ALL
    SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id
    UNION ALL
    SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id
    UNION ALL
    SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id
    UNION ALL
    SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id
    UNION ALL
    SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id
    UNION ALL
    SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id
    UNION ALL
    SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id
    UNION ALL
    SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id
    UNION ALL
    SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id
    UNION ALL
    SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id
    UNION ALL
    SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id
    UNION ALL
    SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id
    UNION ALL
    SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id
    UNION ALL
    SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id
    UNION ALL
    SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg
    

    看着很蛋疼是不是,体会到cube的强大了吗!(低版本hive可以通过union all方式解决,算是没有办法的办法)

    ROLL UP函数

    rollup可以实现从右到做递减多级的统计,显示统计某一层次结构的聚合。

     select device_id,os_id,app_id,client_version,from_id,count(user_id) 
    from test_xinyan_reg 
    group by device_id,os_id,app_id,client_version,from_id with rollup;
    

    等价以下sql语句:

     select device_id,os_id,app_id,client_version,from_id,count(user_id) 
    from test_xinyan_reg 
    group by device_id,os_id,app_id,client_version,from_id 
    grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());
    

    Grouping_ID函数

    当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)

    直接拿官方文档一个例子,O(∩_∩)O哈哈~

    Column1 (key)Column2 (value)
    1 NULL
    1 1
    2 2
    3 3
    3 NULL
    4 5

    hql统计:

      SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP
    

    统计结果如下:

        
    NULL NULL 0     00 6
    1 NULL 1     10 2
    1 NULL 3     11 1
    1 1 3     11 1
    2 NULL 1     10 1
    2 2 3     11 1
    3 NULL 1     10 2
    3 NULL 3     11 1
    3 3 3     11 1
    4 NULL 1     10 1
    4 5 3     11 1

    GROUPING__ID转变为二进制,如果对应位上有值为null,说明这列本身值就是null。(通过类DataFilterNull.py 扫描,可以筛选过滤掉列中null、“”统计结果),

    窗口函数

    hive窗口函数,感觉大部分都是在模仿oracle,有对oracle熟悉的,应该看下就知道怎么用。

    具体参见:http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

    主要围绕..over( partitoin by ..) ..

    3g业务求新增激活时候,有的一部手机,可能注册多个渠道,这时候就要按时间顺序求第一个:

    select f.udid,f.from_id,f.ins_date 
    from (select /* +MAPJOIN(u) */ u.device_id as udid ,g.device_id as gdid,u.from_id,u.ins_date,row_number() over (partition by u.device_id order by u.ins_date asc) as row_number 
            from user_device_info u  
            left outer join  (select device_id from 3g_device_id where log_date<'2013-07-25') g  on ( u.device_id = g.device_id ) 
            where u.log_date='2013-07-25' and u.device_id is not null and u.device_id <> '') f  
    where f.gdid is null and row_number=1
    

    参考资料

    apache hive窗口函数官方介绍:http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

    apache hive官方:cube、rollup函数介绍:https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup

    oracle窗口函数介绍:http://www.blogjava.net/pengpenglin/archive/2012/04/12/211334.html

     
     
    参考地址:http://blog.csdn.net/moon_yang_bj/article/details/17200367
     
  • 相关阅读:
    openresty开发系列35--openresty执行流程之5内容content阶段
    openresty开发系列34--openresty执行流程之4访问阶段
    openresty开发系列33--openresty执行流程之3重写rewrite和重定向
    [转] Dangers of using dlsym() with RTLD_NEXT
    fork failed because of Out Of Memory
    gdb 拾穗
    原子变量的性能问题
    blktrace + blkparse + btt 分析IO
    [转] 利用BLKTRACE分析IO性能
    使用perf + FlameGraph生成进程火焰图
  • 原文地址:https://www.cnblogs.com/xiohao/p/6404000.html
Copyright © 2020-2023  润新知