• Hive窗口函数


    @

    官方文档地址

    Hive官网,点我就进
    oracle,sqlserver都提供了窗口函数,但是在mysql5.5和5.6都没有提供窗口函数!

    窗口函数: 窗口+函数

    • 窗口: 函数运行时计算的数据集的范围
    • 函数: 运行的函数!
      仅仅支持以下函数:

    Windowing functions

    • LEAD (scalar_expression [,offset] [,default]): 返回当前行以下N行的指定列的列值!如果找不到,就采用默认值
    • LAG (scalar_expression [,offset] [,default]): 返回当前行以上N行的指定列的列值!如果找不到,就采用默认值
    • FIRST_VALUE(列名,[false(默认)]):返回当前窗口指定列的第一个值,第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找!
    • LAST_VALUE(列名,[false(默认)]):返回当前窗口指定列的最后一个值,第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找!

    统计类的函数(一般都需要结合over使用):min,max,avg,sum,count

    排名分析:

    • RANK
    • ROW_NUMBER
    • DENSE_RANK
    • CUME_DIST
    • PERCENT_RANK
    • NTILE

    注意:不是所有的函数在运行都是可以通过改变窗口的大小,来控制计算的数据集的范围!所有的排名函数和LAG,LEAD,支持使用over(),但是在over()中不能定义 window_clause

    格式: 函数 over( partition by 字段 ,order by 字段 window_clause )

    窗口的大小可以通过windows_clause来指定:

    (rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
    (rows | range) between current row and (current row | (unbounded | [num]) following)
    (rows | range) between [num] following and (unbounded | [num]) following
    

    特殊情况:

    • ①在over()中既没有出现windows_clause,也没有出现order by,窗口默认为rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
    • ②在over()中(没有出现windows_clause),指定了order by,窗口默认为rows between UNBOUNDED PRECEDING and CURRENT ROW

    窗口函数和分组有什么区别?

    • ①如果是分组操作,select后只能写分组后的字段
    • ②如果是窗口函数,窗口函数是在指定的窗口内,对每条记录都执行一次函数
    • ③如果是分组操作,有去重效果,而partition不去重!

    练习

    (9) 查询前20%时间的订单信息
    精确算法:

     select *
     from
     (select name,orderdate,cost,cume_dist() over(order by orderdate ) cdnum
     from  business) tmp
     where cdnum<=0.2
    

    不精确计算:

     select *
     from
     (select name,orderdate,cost,ntile(5) over(order by orderdate ) cdnum
     from  business) tmp
     where cdnum=1
    

    (8)查询顾客的购买明细及顾客最近三次cost花费

    最近三次: 当前和之前两次当前+前一次+后一次

    当前和之前两次:

    select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and CURRENT  row) 
     from business 
    

    当前+前一次+后一次:

    select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1  FOLLOWING) 
     from business
    

    select name,orderdate,cost,cost+
     lag(cost,1,0) over(partition by name order by orderdate )+
     lead(cost,1,0) over(partition by name order by orderdate )
     from business
    

    (7) 查询顾客的购买明细及顾客本月最后一次购买的时间

    select name,orderdate,cost,LAST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate rows between CURRENT  row and UNBOUNDED  FOLLOWING) 
     from business 
    

    (6) 查询顾客的购买明细及顾客本月第一次购买的时间

    select name,orderdate,cost,FIRST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate ) 
     from business
    

    (5) 查询顾客的购买明细及顾客下次的购买时间

     select name,orderdate,cost,lead(orderdate,1,'无数据') over(partition by name order by orderdate ) 
     from business
    

    (4)查询顾客的购买明细及顾客上次的购买时间

      select name,orderdate,cost,lag(orderdate,1,'无数据') over(partition by name order by orderdate ) 
     from business
    

    (3)查询顾客的购买明细要将cost按照日期进行累加

    select name,orderdate,cost,sum(cost) over(partition by name order by orderdate ) 
     from business
    

    (2)查询顾客的购买明细及月购买总额

     select name,orderdate,cost,sum(cost) over(partition by name,substring(orderdate,1,7) ) 
     from business
    

    (1)查询在2017年4月份购买过的顾客及总人数

    select name,count(*) over(rows between UNBOUNDED  PRECEDING and UNBOUNDED  FOLLOWING)
    from business
    where substring(orderdate,1,7)='2017-04'
    group by name
    

    等价于

    select name,count(*) over()
    from business
    where substring(orderdate,1,7)='2017-04'
    group by name
    
  • 相关阅读:
    报错:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1
    Flink 集群安装配置
    配置Flink依赖的pom文件时报错:flink-clients_2.11 & flink-streaming-java_2.11
    解析流中的Xml文件时,报错:java.net.MalformedURLException: no protocol
    kafka产生的数据通过Flume存到HDFS中
    kafka服务自动关闭
    Flink安装启动
    hadoop长时间运行后,stop-all.sh报错
    Linux环境下配置maven环境
    vue2 自定义时间过滤器
  • 原文地址:https://www.cnblogs.com/sunbr/p/13778737.html
Copyright © 2020-2023  润新知