前言
在SQL
开发中,有时我们可以使用聚合函数将多行数据按照规则聚集在一行,但是我们又想同时得到聚合前的数据,单纯的聚合函数是做不到的,怎么办呢?这时我们的窗口函数就闪亮登场了。窗口函数兼具分组和排序功能,又叫分析函数!
1. 窗口函数
语法如下:
[分析函数] over ([partition by col,...] [order by col,... desc|asc] [rows between 开始行 and 结束行])
最重要的关键字就是over
,里面的 partition by
和 order by
都是可选项,具体可以根据自身需求灵活运用。over
和 分析函数(比如 sum,max,min
) 组合起来才能构成窗口函数。
partition by
:表示按照后跟的那些列进行分组,分析函数是按照每一组的数据进行分析计算的order by
:表示在窗口内进行排序,可以指定排序规则,desc|asc
表示 降序或升序。rows between 开始行 and 结束行
:表示窗口指定的数据范围,默认的范围是rows between unbounded preceding and current row
数据范围如下:
current row
: 当前行n preceding
:往前n
行数据n following
:往后n
行数据unbounded preceding
:表示该窗口的首行(起点)unbounded following
:表示该窗口的尾行(终点)
例如:
rows between unbounded preceding and current row --(表示从窗口起点到当前行) rows between unbounded preceding and unbounded following--(表示从窗口起点到终点) rows between 20 preceding and 10 following --(表示往前20行到往后10行) rows between 20 preceding and current row --(表示往前20行到当前行) rows between current row and unbounded following --(表示当前行到终点)
在讲解窗口函数前,先创建一个表,以实际例子讲解大家更容易理解。
创建一个交易流水表 trade
,表结构如下:
create table trade( user_id string, -- 用户ID trade_date string, -- 交易日期 trade_num int -- 交易数量 );
添加数据如下:
'S001','2023-04-10',1 'S001','2023-04-11',2 'S001','2023-04-13',3 'S002','2023-04-12',4 'S002','2023-04-14',5 'S003','2023-04-16',6
1.1 聚合窗口函数
常见的聚合函数有:sum(),count(),max(),min(),avg()
以sum() 函数为例
- 执行
代码1
select user_id, trade_date, trade_num, sum(trade_num) over(partition by user_id order by trade_date) as s_num from trade;
执行结果如下:
- 执行
代码2
select user_id, trade_date, trade_num, sum(trade_num) over(partition by user_id) as s_num from trade;
执行结果如下:
- 执行
代码3
select user_id, trade_date, trade_num, sum(trade_num) over(order by trade_date) as s_num from trade;
执行结果如下:
- 执行
代码4
select user_id, trade_date, trade_num, sum(trade_num) over(rows between unbounded preceding and current row) as s_num from trade;
执行结果如下:
- 执行
代码5
select user_id, trade_date, trade_num, sum(trade_num) over() as s_num from trade;
执行结果如下:
从上面几个SQL的执行结果来看:
代码1
和代码2
中都有partition by
,根据user_id
进行分组,都是在组内对trade_num
进行sum
聚合;不同的是,加上order by
:指定的默认数据范围为 从起点到当前行(即rows between unbounded preceding and current row
),不加order by
:指定的数据范围为 从起点到终点组内所有数据(即rows between unbounded preceding and unbounded following
)。代码3
的over
中单独使用order by
,此时就只有一个分组(包含全部数据),根据trade_date
进行排序,排序之后才在窗口内进行sum
聚合,指定的默认数据范围为 从起点到终点组内所有数据(即rows between unbounded preceding and unbounded following
),当然,也可以自己指定数据范围。代码3
的over
中单独使用rows between
,此时也是只有一个分组(包含全部数据),自己指定数据范围。代码5
的over
中既没有partition by
也没有order by
,此时也是只有一个分组(包含全部数据),指定的默认数据范围为从起点到终点组内所有数据(即rows between unbounded preceding and unbounded following
),当然,也可以自己指定数据范围。
count(),max(),min(),avg() 和 sum() 用法一样,这里就不展开讲了。
1.2 分析窗口函数
分析函数包含有排序函数和占比函数:row_number() ,dense_rank() ,rank() ,percent_rank() ,cume_dist()
- row_number()
从1开始,按照顺序,生成分组内记录的序列。
执行代码如下:
select user_id, trade_date, trade_num, row_number() over(partition by user_id order by trade_num desc) as rn from trade;
执行结果如下:
- dense_rank()
生成数据项在分组中的排名,排名相等时名次是连续的。
执行如下代码:
select user_id, trade_date, trade_num, dense_rank() over(partition by user_id order by trade_num desc) as rn from trade;
执行结果如下:
- rank()
生成数据项在分组中的排名,排名相等时名次可能是不连续的。
执行如下代码:
select user_id, trade_date, trade_num, rank() over(partition by user_id order by trade_num desc) as rn from trade;
执行结果如下:
为了对比上述三个排名函数的不同,再插入一条数据:
insert into trade values('S001','2023-04-12',2);
执行如下代码:
select user_id, trade_date, trade_num, row_number() over(partition by user_id order by trade_num desc) as rn, dense_rank() over(partition by user_id order by trade_num desc) as rn1, rank() over(partition by user_id order by trade_num desc) as rn2 from trade;
执行结果如下:
- percent_rank() ,cume_dist()
percent_rank() :累计百分比
cume_dist():累计分布值
执行如下代码:
select user_id, trade_date, trade_num, percent_rank() over(partition by user_id order by trade_num desc) as rp, cume_dist() over(partition by user_id order by trade_num desc) as cd from trade;
执行结果如下:
1.3 取值窗口函数
取值函数有:lag(),lead(),first_value(),last_value()
- lag()
语法:
LAG(col,n,DEFAULT)
用于统计窗口内往上第n行值,第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
执行如下代码:
select user_id, trade_date, trade_num, row_number() over(partition by user_id order by trade_date desc) as rn, lag(trade_date) over(partition by user_id order by trade_date desc) as td, lag(trade_date, 1 , '2020-01-01') over(partition by user_id order by trade_date desc) as td1, lag(trade_date, 2) over(partition by user_id order by trade_date desc) as td2 from trade;
执行结果如下:
从上述结果可以看出:
td: n的默认值为1,不指定DEFAULT的默认值为NULL td1: 指定了往上1行的值,指定DEFAULT为'2020-01-01' S001 第一行,往上1行是NULL,取指定默认值为'2020-01-01' S001 第二行,往上1行是第一行,取值为'2023-04-13' ... td2: 指定了往上2行的值,没指定DEFAULT为NULL S001 第一行,往上2行是NULL S001 第二行,往上2行是NULL S001 第三行,往上2行是第一行,取值为'2023-04-13' S001 第四行,往上2行是第二行,取值为'2023-04-12'
- lead()
语法:
LEAD(col,n,DEFAULT)
用于统计窗口内往下第n行值,第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
执行如下代码:
select user_id, trade_date, trade_num, row_number() over(partition by user_id order by trade_date desc) as rn, lead(trade_date) over(partition by user_id order by trade_date desc) as td, lead(trade_date, 1 , '2020-01-01') over(partition by user_id order by trade_date desc) as td1, lead(trade_date, 2) over(partition by user_id order by trade_date desc) as td2 from trade;
执行结果如下:
- first_value()
取分组内排序后,截止到当前行,第一个值
执行如下代码:
select user_id, trade_date, trade_num, row_number() over(partition by user_id order by trade_date desc) as rn, first_value(trade_num) over(partition by user_id order by trade_date desc) as fd from trade;
- last_value()
取分组内排序后,截止到当前行,最后一个值
执行如下代码:
select user_id, trade_date, trade_num, row_number() over(partition by user_id order by trade_date desc) as rn, last_value(trade_num) over(partition by user_id order by trade_date desc) as ld from trade;
执行结果如下:
2. 综合案例分析
2.1 案例1:连续出现的数字
需求:编写一个 SQL 查询,查找所有连续出现至少三次的数字。
假如有一个test表,数据如下:
+----------+-----------+ | test.id | test.num | +----------+-----------+ | 1 | 6 | | 2 | 6 | | 3 | 6 | | 4 | 7 | | 5 | 6 | | 6 | 8 | | 7 | 8 | +----------+-----------+
从上述数据中可以看出,6 是连续出现至少三次的数字。
实现步骤:
- 把下面两行的数字放到一行记录中
select num, lead(num,1,null) over(order by id) as ld1, lead(num,2,null) over(order by id) as ld2 from test;
- 判断数字是否相等,并去重
select distinct num from (select num, lead(num,1,null) over(order by id) as ld1, lead(num,2,null) over(order by id) as ld2 from test) t1 where t1.num = t1.ld1 and t1.num = t1.ld2;
2.2 案例2:连续3天交易的用户
需求:编写一个 SQL 查询,查找交易流水表 trade
中所有连续3天交易的用户。
实现步骤:
- 去重:由于每个人可能一天可能不止交易一次,需要去重
select distinct user_id, trade_date from trade;
- 排序:对每个用户ID的交易日期排序
select user_id, trade_date, row_number() over(partition by user_id order by trade_date) as rn from (select distinct user_id, trade_date from trade) t1;
- 差值:计算交易日期与排序之间的差值,找到连续交易的记录
select user_id, trade_date, date_sub(trade_date, row_number() over(partition by user_id order by trade_date)) as diff from (select distinct user_id, trade_date from trade) t1;
- 连续交易天数计算:select user_id, count(*) group by id, 差值(伪代码)
select user_id, diff, count(*) as days from (select user_id, trade_date, date_sub(trade_date, row_number() over(partition by user_id order by trade_date)) as diff from (select distinct user_id, trade_date from trade) t1) t2 group by user_id,diff;
- 取出连续交易3天以及以上的记录
select user_id from (select user_id,diff,count(*) as days from (select user_id, trade_date, date_sub(trade_date, row_number() over(partition by user_id order by trade_date)) as diff from (select distinct user_id, trade_date from trade) t1) t2 group by user_id,diff) t3 where days >= 3;
步骤4和5进行合并的写法:
select user_id, diff from (select user_id, trade_date, date_sub(trade_date, row_number() over(partition by user_id order by trade_date)) as diff from (select distinct user_id, trade_date from trade) t1) t2 group by user_id,diff having count(*) >= 3;
总结
主要对Hive窗口函数进行详解,剖析各种窗口函数的使用方法,并且根据窗口函数综合案例进行分析,都是工作常用、面试必问的非常经典的例子。