Hive窗口函数

Hive窗口函数
@
目录
官方文档地址

Hive官网，点我就进
oracle,sqlserver都提供了窗口函数，但是在mysql5.5和5.6都没有提供窗口函数！

窗口函数：窗口+函数
- 窗口：函数运行时计算的数据集的范围
- 函数：运行的函数！
  仅仅支持以下函数：
Windowing functions
- LEAD (scalar_expression [,offset] [,default])：返回当前行以下N行的指定列的列值！如果找不到，就采用默认值
- LAG (scalar_expression [,offset] [,default])：返回当前行以上N行的指定列的列值！如果找不到，就采用默认值
- FIRST_VALUE(列名,[false(默认)])：返回当前窗口指定列的第一个值，第二个参数如果为true,代表加入第一个值为null，跳过空值，继续寻找！
- LAST_VALUE(列名,[false(默认)])：返回当前窗口指定列的最后一个值，第二个参数如果为true,代表加入第一个值为null，跳过空值，继续寻找！
统计类的函数(一般都需要结合over使用)：min,max,avg,sum,count

排名分析：
- RANK
- ROW_NUMBER
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
注意：不是所有的函数在运行都是可以通过改变窗口的大小，来控制计算的数据集的范围！所有的排名函数和LAG,LEAD，支持使用over()，但是在over()中不能定义 window_clause

格式：函数 over( partition by 字段 ,order by 字段 window_clause )

窗口的大小可以通过windows_clause来指定：
```
(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
(rows | range) between current row and (current row | (unbounded | [num]) following)
(rows | range) between [num] following and (unbounded | [num]) following
```
特殊情况：
- ①在over()中既没有出现windows_clause，也没有出现order by，窗口默认为rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
- ②在over()中(没有出现windows_clause)，指定了order by，窗口默认为rows between UNBOUNDED PRECEDING and CURRENT ROW
窗口函数和分组有什么区别？
- ①如果是分组操作，select后只能写分组后的字段
- ②如果是窗口函数，窗口函数是在指定的窗口内，对每条记录都执行一次函数
- ③如果是分组操作，有去重效果，而partition不去重！
练习

(9) 查询前20%时间的订单信息
精确算法：
```
 select *
 from
 (select name,orderdate,cost,cume_dist() over(order by orderdate ) cdnum
 from  business) tmp
 where cdnum<=0.2
```
不精确计算：
```
 select *
 from
 (select name,orderdate,cost,ntile(5) over(order by orderdate ) cdnum
 from  business) tmp
 where cdnum=1
```
(8)查询顾客的购买明细及顾客最近三次cost花费

最近三次： 当前和之前两次 或 当前+前一次+后一次

当前和之前两次:
```
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and CURRENT  row) 
 from business 
```
当前+前一次+后一次:
```
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1  FOLLOWING) 
 from business
```
或
```
select name,orderdate,cost,cost+
 lag(cost,1,0) over(partition by name order by orderdate )+
 lead(cost,1,0) over(partition by name order by orderdate )
 from business
```
(7) 查询顾客的购买明细及顾客本月最后一次购买的时间
```
select name,orderdate,cost,LAST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate rows between CURRENT  row and UNBOUNDED  FOLLOWING) 
 from business 
```
(6) 查询顾客的购买明细及顾客本月第一次购买的时间
```
select name,orderdate,cost,FIRST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate ) 
 from business
```
(5) 查询顾客的购买明细及顾客下次的购买时间
```
 select name,orderdate,cost,lead(orderdate,1,'无数据') over(partition by name order by orderdate ) 
 from business
```
（4）查询顾客的购买明细及顾客上次的购买时间
```
  select name,orderdate,cost,lag(orderdate,1,'无数据') over(partition by name order by orderdate ) 
 from business
```
（3）查询顾客的购买明细要将cost按照日期进行累加
```
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate ) 
 from business
```
（2）查询顾客的购买明细及月购买总额
```
 select name,orderdate,cost,sum(cost) over(partition by name,substring(orderdate,1,7) ) 
 from business
```
（1）查询在2017年4月份购买过的顾客及总人数
```
select name,count(*) over(rows between UNBOUNDED  PRECEDING and UNBOUNDED  FOLLOWING)
from business
where substring(orderdate,1,7)='2017-04'
group by name
```
等价于
```
select name,count(*) over()
from business
where substring(orderdate,1,7)='2017-04'
group by name
```
相关阅读:
报错：Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1
Flink 集群安装配置
 配置Flink依赖的pom文件时报错：flink-clients_2.11 & flink-streaming-java_2.11
解析流中的Xml文件时，报错：java.net.MalformedURLException: no protocol
kafka产生的数据通过Flume存到HDFS中
 kafka服务自动关闭
 Flink安装启动
 hadoop长时间运行后，stop-all.sh报错
 Linux环境下配置maven环境
 vue2 自定义时间过滤器
原文地址：https://www.cnblogs.com/sunbr/p/13778737.html

官方文档地址

Windowing functions

统计类的函数(一般都需要结合over使用)：min,max,avg,sum,count

窗口的大小可以通过windows_clause来指定：

特殊情况：

窗口函数和分组有什么区别？

练习