1.概述
SQL中的窗口函数(Hive/Maxcomputer)在我的见解下主要分为5类:聚合函数、排名函数、数学函数、分布函数、偏移函数
2.函数介绍
消费表tb |
||
user | date | amt |
u01 | 2022-07-14 | 100 |
u01 | 2022-10-03 | 320 |
u01 | 2022-01-23 | 150 |
u01 | 2022-02-06 | 200 |
u01 | 2022-02-06 | 230 |
u01 | 2022-08-21 |
200 |
u01 | 2022-09-14 |
180 |
u01 | 2022-07-02 |
350 |
u01 | 2022-10-26 |
430 |
u01 | 2022-05-23 |
290 |
u02 | 2022-02-01 | 150 |
u02 | 2022-06-04 | 100 |
u02 | 2022-01-23 | 300 |
u02 | 2022-03-21 | 500 |
u02 | 2022-04-25 |
200 |
u02 | 2022-04-25 |
500 |
u02 | 2022-06-10 |
160 |
u02 | 2022-08-20 |
320 |
u02 | 2022-09-25 |
400 |
u02 | 2022-10-24 |
560 |
创建临时表tb
with tb as(select*fromvalues('u01','2022-07-14','100'),('u01','2022-10-03','320'),('u01','2022-01-23','150'),('u01','2022-02-06','200'),('u01','2022-02-06','230'),('u01','2022-08-21','200'),('u01','2022-09-14','180'),('u01','2022-07-02','350'),('u01','2022-10-26','430'),('u01','2022-05-23','290'),('u02','2022-02-01','150'),('u02','2022-06-04','100'),('u02','2022-01-23','300'),('u02','2022-03-21','500'),('u02','2022-04-25','200'),('u02','2022-04-25','500'),('u02','2022-06-10','160'),('u02','2022-08-20','320'),('u02','2022-09-25','400'),('u02','2022-10-24','560') t(user,date,amt))
2.1聚合函数
常用的聚合窗口函数如:sum、count、max、min,avg
sum:组内表达式累积值或聚合值,区别在于是否使用了排序order by
--无组内排序order by,则组内sum值一样select user,date, amt, sum(amt) over(partition by user) amts from tb;user date amt amts u01 2022-10-033202450.0u01 2022-01-231502450.0u01 2022-02-062002450.0u01 2022-02-062302450.0u01 2022-08-212002450.0u01 2022-09-141802450.0u01 2022-07-023502450.0u01 2022-10-264302450.0u01 2022-05-232902450.0u01 2022-07-141002450.0u02 2022-10-245603190.0u02 2022-02-011503190.0u02 2022-06-041003190.0u02 2022-01-233003190.0u02 2022-03-215003190.0u02 2022-04-252003190.0u02 2022-04-255003190.0u02 2022-06-101603190.0u02 2022-08-203203190.0u02 2022-09-254003190.0--有组内排序order by,则组内sum值为按排序累加值select user,date, amt, sum(amt) over(partition by user orderbydate) amts from tb;user date amt amts u01 2022-01-23150150.0u01 2022-02-06200350.0u01 2022-02-06230580.0u01 2022-05-23290870.0u01 2022-07-023501220.0u01 2022-07-141001320.0u01 2022-08-212001520.0u01 2022-09-141801700.0u01 2022-10-033202020.0u01 2022-10-264302450.0u02 2022-01-23300300.0u02 2022-02-01150450.0u02 2022-03-21500950.0u02 2022-04-255001450.0u02 2022-04-252001650.0u02 2022-06-041001750.0u02 2022-06-101601910.0u02 2022-08-203202230.0u02 2022-09-254002630.0u02 2022-10-245603190.0
count:组内表达式累积计数或聚合计数,区别在于是否使用了排序order by,案例如上所示,这里暂不展示order by
select user,date, amt,count(*) over(partition by user) cnt from tb;user date amt cnt u01 2022-10-0332010u01 2022-01-2315010u01 2022-02-0620010u01 2022-02-0623010u01 2022-08-2120010u01 2022-09-1418010u01 2022-07-0235010u01 2022-10-2643010u01 2022-05-2329010u01 2022-07-1410010u02 2022-10-2456010u02 2022-02-0115010u02 2022-06-0410010u02 2022-01-2330010u02 2022-03-2150010u02 2022-04-2520010u02 2022-04-2550010u02 2022-06-1016010u02 2022-08-2032010u02 2022-09-2540010
max:组内表达式最大值
select user,date, amt, max(amt) over(partition by user) max_amt from tb;user date amt max_amt u01 2022-10-03320430u01 2022-01-23150430u01 2022-02-06200430u01 2022-02-06230430u01 2022-08-21200430u01 2022-09-14180430u01 2022-07-02350430u01 2022-10-26430430u01 2022-05-23290430u01 2022-07-14100430u02 2022-10-24560560u02 2022-02-01150560u02 2022-06-04100560u02 2022-01-23300560u02 2022-03-21500560u02 2022-04-25200560u02 2022-04-25500560u02 2022-06-10160560u02 2022-08-20320560u02 2022-09-25400560
min:组内表达式最小值
select user,date, amt, min(amt) over(partition by user) min_amt from tb;user date amt min_amt u01 2022-10-03320100u01 2022-01-23150100u01 2022-02-06200100u01 2022-02-06230100u01 2022-08-21200100u01 2022-09-14180100u01 2022-07-02350100u01 2022-10-26430100u01 2022-05-23290100u01 2022-07-14100100u02 2022-10-24560100u02 2022-02-01150100u02 2022-06-04100100u02 2022-01-23300100u02 2022-03-21500100u02 2022-04-25200100u02 2022-04-25500100u02 2022-06-10160100u02 2022-08-20320100u02 2022-09-25400100
avg:组内表达式平均值
select user,date, amt, avg(amt) over(partition by user) avg_amt from tb;user date amt avg_amt u01 2022-10-03320245.0u01 2022-01-23150245.0u01 2022-02-06200245.0u01 2022-02-06230245.0u01 2022-08-21200245.0u01 2022-09-14180245.0u01 2022-07-02350245.0u01 2022-10-26430245.0u01 2022-05-23290245.0u01 2022-07-14100245.0u02 2022-10-24560319.0u02 2022-02-01150319.0u02 2022-06-04100319.0u02 2022-01-23300319.0u02 2022-03-21500319.0u02 2022-04-25200319.0u02 2022-04-25500319.0u02 2022-06-10160319.0u02 2022-08-20320319.0u02 2022-09-25400319.0
2.2排名函数
row_number:计算当前行在分区中的行号,从1开始递增
select user,date, amt, row_number() over(partition by user orderbydate) rn from tb;user date amt rn u01 2022-01-231501u01 2022-02-062002u01 2022-02-062303u01 2022-05-232904u01 2022-07-023505u01 2022-07-141006u01 2022-08-212007u01 2022-09-141808u01 2022-10-033209u01 2022-10-2643010u02 2022-01-233001u02 2022-02-011502u02 2022-03-215003u02 2022-04-255004u02 2022-04-252005u02 2022-06-041006u02 2022-06-101607u02 2022-08-203208u02 2022-09-254009u02 2022-10-2456010
rank:计算排名,排名不连续(1,2,2,4)
select user,date, amt, rank() over(partition by user orderbydate) rk from tb;user date amt rk u01 2022-01-231501u01 2022-02-062002u01 2022-02-062302u01 2022-05-232904u01 2022-07-023505u01 2022-07-141006u01 2022-08-212007u01 2022-09-141808u01 2022-10-033209u01 2022-10-2643010u02 2022-01-233001u02 2022-02-011502u02 2022-03-215003u02 2022-04-252004u02 2022-04-255004u02 2022-06-041006u02 2022-06-101607u02 2022-08-203208u02 2022-09-254009u02 2022-10-2456010
dense_rank:计算排名,排名连续(1,2,2,3)
select user,date, amt, dense_rank() over(partition by user orderbydate) dk from tb;user date amt dk u01 2022-01-231501u01 2022-02-062002u01 2022-02-062302u01 2022-05-232903u01 2022-07-023504u01 2022-07-141005u01 2022-08-212006u01 2022-09-141807u01 2022-10-033208u01 2022-10-264309u02 2022-01-233001u02 2022-02-011502u02 2022-03-215003u02 2022-04-252004u02 2022-04-255004u02 2022-06-041005u02 2022-06-101606u02 2022-08-203207u02 2022-09-254008u02 2022-10-245609
2.3数学函数
stddev:计算总体标准差,STDDEV_POP函数的别名
select user,date, amt, stddev(amt) over(partition by user) std from tb;user date amt std u01 2022-10-0332095.62949335848225u01 2022-01-2315095.62949335848225u01 2022-02-0620095.62949335848225u01 2022-02-0623095.62949335848225u01 2022-08-2120095.62949335848225u01 2022-09-1418095.62949335848225u01 2022-07-0235095.62949335848225u01 2022-10-2643095.62949335848225u01 2022-05-2329095.62949335848225u01 2022-07-1410095.62949335848225u02 2022-10-24560157.0u02 2022-02-01150157.0u02 2022-06-04100157.0u02 2022-01-23300157.0u02 2022-03-21500157.0u02 2022-04-25200157.0u02 2022-04-25500157.0u02 2022-06-10160157.0u02 2022-08-20320157.0u02 2022-09-25400157.0
stddev_samp:计算样本标准差。
select user,date, amt, stddev_samp(amt) over(partition by user) std from tb;user date amt std u01 2022-10-03320100.80233683358294u01 2022-01-23150100.80233683358294u01 2022-02-06200100.80233683358294u01 2022-02-06230100.80233683358294u01 2022-08-21200100.80233683358294u01 2022-09-14180100.80233683358294u01 2022-07-02350100.80233683358294u01 2022-10-26430100.80233683358294u01 2022-05-23290100.80233683358294u01 2022-07-14100100.80233683358294u02 2022-10-24560165.4925308821452u02 2022-02-01150165.4925308821452u02 2022-06-04100165.4925308821452u02 2022-01-23300165.4925308821452u02 2022-03-21500165.4925308821452u02 2022-04-25200165.4925308821452u02 2022-04-25500165.4925308821452u02 2022-06-10160165.4925308821452u02 2022-08-20320165.4925308821452u02 2022-09-25400165.4925308821452
cluster_sample
用户随机抽样。返回True表示该行数据被抽中。
- 命令说明
cluster_sample(bigint )
:表示随机抽取N条数据。cluster_sample(bigint , bigint )
:表示按比例(M/N)随机抽取。即抽取partition_row_count×M / N
条数据。partition_row_count
指分区中的数据行数。
- 参数说明
- N:必填。BIGINT类型常量。N为NULL时,返回值为NULL。
- M:必填。BIGINT类型常量。M为NULL时,返回值为NULL。
- partition_clause:可选。
select user,date, amt, cluster_sample(3) over(partition by user) cs from tb;--cs为true的为抽中的数据user date amt cs u01 2022-10-03320trueu01 2022-01-23150falseu01 2022-02-06200falseu01 2022-02-06230falseu01 2022-08-21200falseu01 2022-09-14180trueu01 2022-07-02350falseu01 2022-10-26430falseu01 2022-05-23290falseu01 2022-07-14100trueu02 2022-10-24560falseu02 2022-02-01150trueu02 2022-06-04100falseu02 2022-01-23300falseu02 2022-03-21500falseu02 2022-04-25200trueu02 2022-04-25500trueu02 2022-06-10160falseu02 2022-08-20320falseu02 2022-09-25400false
2.4分布函数
cume_dist:求累计分布,相当于求分区中大于等于当前行的数据在分区中的占比。大小关系由order by判定。
select user,date, amt, concat(round(cume_dist() over (partition by user orderby amt desc)*100,2),'%')as cume_dist from tb;user date amt cume_dist u01 2022-10-2643010.0%u01 2022-07-0235020.0%u01 2022-10-0332030.0%u01 2022-05-2329040.0%u01 2022-02-0623050.0%u01 2022-02-0620070.0%u01 2022-08-2120070.0%u01 2022-09-1418080.0%u01 2022-01-2315090.0%u01 2022-07-14100100.0%u02 2022-10-2456010.0%u02 2022-03-2150030.0%u02 2022-04-2550030.0%u02 2022-09-2540040.0%u02 2022-08-2032050.0%u02 2022-01-2330060.0%u02 2022-04-2520070.0%u02 2022-06-1016080.0%u02 2022-02-0115090.0%u02 2022-06-04100100.0%
ntile:用于将分区中的数据按照顺序切分成N等份,并返回数据所在等份的编号。如果分区中的数据不能被均匀地切分成N等份时,最前面的等份(编号较小的)会优先多分配1条数据。
select user,date, amt, ntile(4) over(partition by user orderby amt) nt from tb;user date amt nt u01 2022-07-141001u01 2022-01-231501u01 2022-09-141801u01 2022-02-062002u01 2022-08-212002u01 2022-02-062302u01 2022-05-232903u01 2022-10-033203u01 2022-07-023504u01 2022-10-264304u02 2022-06-041001u02 2022-02-011501u02 2022-06-101601u02 2022-04-252002u02 2022-01-233002u02 2022-08-203202u02 2022-09-254003u02 2022-03-215003u02 2022-04-255004u02 2022-10-245604
percent_rank:计算当前行在分区中按照order by排序后的百分比排名。
select user,date, amt, percent_rank() over(partition by user orderby amt desc) pr from tb;user date amt pr u01 2022-10-264300.0u01 2022-07-023500.1111111111111111u01 2022-10-033200.2222222222222222u01 2022-05-232900.3333333333333333u01 2022-02-062300.4444444444444444u01 2022-02-062000.5555555555555556u01 2022-08-212000.5555555555555556u01 2022-09-141800.7777777777777778u01 2022-01-231500.8888888888888888u01 2022-07-141001.0u02 2022-10-245600.0u02 2022-03-215000.1111111111111111u02 2022-04-255000.1111111111111111u02 2022-09-254000.3333333333333333u02 2022-08-203200.4444444444444444u02 2022-01-233000.5555555555555556u02 2022-04-252000.6666666666666666u02 2022-06-101600.7777777777777778u02 2022-02-011500.8888888888888888u02 2022-06-041001.0
median:计算窗口中的中位数。
select user,date, amt, median(amt) over(partition by user) md from tb;user date amt md u01 2022-10-03320215.0u01 2022-01-23150215.0u01 2022-02-06200215.0u01 2022-02-06230215.0u01 2022-08-21200215.0u01 2022-09-14180215.0u01 2022-07-02350215.0u01 2022-10-26430215.0u01 2022-05-23290215.0u01 2022-07-14100215.0u02 2022-10-24560310.0u02 2022-02-01150310.0u02 2022-06-04100310.0u02 2022-01-23300310.0u02 2022-03-21500310.0u02 2022-04-25200310.0u02 2022-04-25500310.0u02 2022-06-10160310.0u02 2022-08-20320310.0u02 2022-09-25400310.0
2.5偏移函数
lag:返回往前第N行的数据
select user,date, amt, lag(amt,1,amt) over(partition by user orderbydate) lag_amt from tb;user date amt lag_amt u01 2022-01-23150150u01 2022-02-06200150u01 2022-02-06230200u01 2022-05-23290230u01 2022-07-02350290u01 2022-07-14100350u01 2022-08-21200100u01 2022-09-14180200u01 2022-10-03320180u01 2022-10-26430320u02 2022-01-23300300u02 2022-02-01150300u02 2022-03-21500150u02 2022-04-25200500u02 2022-04-25500200u02 2022-06-04100500u02 2022-06-10160100u02 2022-08-20320160u02 2022-09-25400320u02 2022-10-24560400
lead:返回往前第N行的数据
select user,date, amt, lead(amt,1,amt) over(partition by user orderbydate) lead_amt from tb;user date amt lead_amt u01 2022-01-23150200u01 2022-02-06200230u01 2022-02-06230290u01 2022-05-23290350u01 2022-07-02350100u01 2022-07-14100200u01 2022-08-21200180u01 2022-09-14180320u01 2022-10-03320430u01 2022-10-26430430u02 2022-01-23300150u02 2022-02-01150500u02 2022-03-21500200u02 2022-04-25200500u02 2022-04-25500100u02 2022-06-04100160u02 2022-06-10160320u02 2022-08-20320400u02 2022-09-25400560u02 2022-10-24560560
first_value:返回表达式在窗口的第一条数据上进行运算的结果。
select user,date, amt, first_value(amt) over(partition by user orderbydate) fir from tb;user date amt fir u01 2022-01-23150150u01 2022-02-06200150u01 2022-02-06230150u01 2022-05-23290150u01 2022-07-02350150u01 2022-07-14100150u01 2022-08-21200150u01 2022-09-14180150u01 2022-10-03320150u01 2022-10-26430150u02 2022-01-23300300u02 2022-02-01150300u02 2022-03-21500300u02 2022-04-25200300u02 2022-04-25500300u02 2022-06-04100300u02 2022-06-10160300u02 2022-08-20320300u02 2022-09-25400300u02 2022-10-24560300
last_value:返回表达式在窗口的最后一条数据上进行运算的结果。
select user,date, amt, last_value(amt) over(partition by user orderbydate) las from tb;user date amt fir u01 2022-01-23150150u01 2022-02-06200230u01 2022-02-06230230u01 2022-05-23290290u01 2022-07-02350350u01 2022-07-14100100u01 2022-08-21200200u01 2022-09-14180180u01 2022-10-03320320u01 2022-10-26430430u02 2022-01-23300300u02 2022-02-01150150u02 2022-03-21500500u02 2022-04-25200500u02 2022-04-25500500u02 2022-06-04100100u02 2022-06-10160160u02 2022-08-20320320u02 2022-09-25400400u02 2022-10-24560560
查看结果发现好像并不是组内最后一条数据,因为窗口指定了order by,最后一行数据是截止到当前行的最后一条数据,怎么才能拿到最后一条呢,我们可以加窗口子句:rows between unbounded preceding and unbounded following
select user,date, amt, last_value(amt) over(partition by user orderbydate rows between unbounded preceding and unbounded following) las from tb;user date amt las u01 2022-01-23150430u01 2022-02-06200430u01 2022-02-06230430u01 2022-05-23290430u01 2022-07-02350430u01 2022-07-14100430u01 2022-08-21200430u01 2022-09-14180430u01 2022-10-03320430u01 2022-10-26430430u02 2022-01-23300560u02 2022-02-01150560u02 2022-03-21500560u02 2022-04-25200560u02 2022-04-25500560u02 2022-06-04100560u02 2022-06-10160560u02 2022-08-20320560u02 2022-09-25400560u02 2022-10-24560560
nth_value:返回表达式在窗口的第N条数据进行运算的结果。
select user,date, amt, nth_value(amt,6) over(partition by user) nv from tb;user date amt nv u01 2022-10-03320180u01 2022-01-23150180u01 2022-02-06200180u01 2022-02-06230180u01 2022-08-21200180u01 2022-09-14180180u01 2022-07-02350180u01 2022-10-26430180u01 2022-05-23290180u01 2022-07-14100180u02 2022-10-24560200u02 2022-02-01150200u02 2022-06-04100200u02 2022-01-23300200u02 2022-03-21500200u02 2022-04-25200200u02 2022-04-25500200u02 2022-06-10160200u02 2022-08-20320200u02 2022-09-25400200
3.总结
分析函数还有很多,常用的基本就这些函数,灵活使用窗口分析函数,能有效提升我们开发效率。小伙伴们快用起来吧。
拜了个拜