• Hive之窗口函数


    普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。
    因此,普通的聚合函数每组(Group by)只返回一个值,而开窗函数则可为窗口中的每行都返回一个值。
    简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。 开窗函数一般分为两类,聚合开窗函数和排序开窗函数。

    OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化
    CURRENT ROW:当前行
    n PRECEDING:往前n行数据
    n FOLLOWING:往后n行数据
    UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
    LAG(col,n):往前第n行数据
    LEAD(col,n):往后第n行数据
    NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。

    测试数据:

    name,orderdate,cost
    jack,2017-01-01,10
    tony,2017-01-02,15
    jack,2017-02-03,23
    tony,2017-01-04,29
    jack,2017-01-05,46
    jack,2017-04-06,42
    tony,2017-01-07,50
    jack,2017-01-08,55
    mart,2017-04-08,62
    mart,2017-04-09,68
    neil,2017-05-10,12
    mart,2017-04-11,75
    neil,2017-06-12,80
    mart,2017-04-13,94

    需求:
    (1)查询在2017年4月份购买过的顾客及总人数
    (2)查询顾客的购买明细及月购买总额
    (3)上述的场景,要将cost按照日期进行累加
    (4)查询顾客上次的购买时间
    (5)查询前20%时间的订单信息

    建表、导数:

    create table business(name string, orderdate string, cost int)
    row format delimited fields terminated by ',';
    
    load data local inpath "/opt/data/business.txt" into table business;

    (1)查询在2017年4月份购买过的顾客及总人数

    select
    name, count(*) as cnt
    from business where date_format(orderdate,'yyyy-MM')='2017-04'
    group by name;

    (2)查询顾客的购买明细及月购买总额

    select
    name,
    orderdate,
    cost,
    sum(cost) over (partition by month(orderdate)) as month_sum
    from business order by name,orderdate;

    (3)上述的场景,要将cost按照日期进行累加

    select
    name,orderdate,cost,
    sum(cost) over (partition by name order by orderdate rows between unbounded preceding and current row) as accumulate_cost,
    --按name分组,按orderdate排序,求从组内首行到当前行之和
    
    sum(cost) over(partition by name order by orderdate) as accumulate_cost2,
    --效果和上个一样
    
    sum(cost) over () as total_cost,
    --所有cost求和
    
    sum(cost) over (partition by name) as total_cost_by_person,
    --每个name组内之和
    
    sum(cost) over (partition by name order by orderdate rows between 1 preceding and current row) as double_cost,
    --组内按orderdate排序,前一行和当前行之和
    
    sum(cost) over (partition by name order by orderdate rows between 1 preceding and
    1 following) as triple_cost,
    --组内按orderdate排序,前一行、当前行、下一行,三行之和
    
    sum(cost) over (partition by name order by orderdate rows between current row and
    unbounded following) as total_next_cost
    --组内按orderdate排序,当前行至组内尾行之和
    from business;

    (4)查询顾客上次的购买时间

    select
    name,
    orderdate,
    lag(orderdate,1) over(partition by name order by orderdate) as time2
    from business;

    (5)查询前20%时间的订单信息,

    ntile(n)用于将分组数据按照顺序切分成n片,返回当前记录所在的切片值,经常用来取前30% 带有百分之多少比例的记录什么的。

    select name,orderdate,cost,
    ntile(5) over (order by orderdate) sorted
    from business;

     查询20%即sorted=1

    select t.* from (
    select
      name,
      orderdate,
      cost,
      ntile(5) over (order by orderdate) sorted
    from business
    ) t
    where t.sorted=1;

     扩展:
    求前40%时间和后60%时间内的平均cost

    --<1>按照orderdate升序拆分成5分

    select
    name,
    orderdate,
    cost,
    ntile(5) over (order by orderdate) as prt
    from business;

     --<2>前2份prt标识为1,是前40%的时间;后3份prt标识为2,是后60%的时间

    select 
    t.name,
    t.orderdate,
    t.cost,
    case when prt in (1,2) then 1 else 2 end as new_prt
    from(
        select 
        name,
        orderdate,
        cost,
        ntile(5) over (order by orderdate) as prt
        from business
        ) t;

    --<3>--取每个new_prt的平均值

    select 
    new_prt,
    max(case when new_prt=1 then "first_40%_avg_cost" when new_prt=2 then "last_60%_avg_cost" end) as prt_cost_name,
    cast(avg(cost)as decimal(10,2)) as avg_cost
    from(
        select 
        t.name,
        t.orderdate,
        t.cost,
        case when prt in (1,2) then 1 else 2 end as new_prt
        from(
            select 
            name,
            orderdate,
            cost,
            ntile(5) over (order by orderdate) as prt
            from business
             ) t
        ) q
    group by q.new_prt;

  • 相关阅读:
    常见排序算法(上)
    二.CSS的伪类
    ASP.NET_编码模型
    一.CSS工作原理
    如何拦截来电,并检测到某些特定号码时自动挂断电话
    如何获取短信内容
    Android导入项目时出现红色感叹号
    如何接收广播;如何发送广播
    设置Activity显示和关闭时的动画效果
    将Activity设置成半透明的对话框
  • 原文地址:https://www.cnblogs.com/lucas-zhao/p/12024161.html
Copyright © 2020-2023  润新知