数据分析之SQL窗口函数

简介: 会使用SQL语言的小伙伴在开发或者分析过程都少不了使用函数,企业面试中,更是钟情于分析函数的问题,笔试、面试都基本跑不了。

1.概述

SQL中的窗口函数(Hive/Maxcomputer)在我的见解下主要分为5类:聚合函数、排名函数、数学函数、分布函数、偏移函数

2.函数介绍

消费表tb

user date amt
u01 2022-07-14 100
u01 2022-10-03 320
u01 2022-01-23 150
u01 2022-02-06 200
u01 2022-02-06 230
u01

2022-08-21

200

u01

2022-09-14

180

u01

2022-07-02

350

u01

2022-10-26

430

u01

2022-05-23

290

u02 2022-02-01 150
u02 2022-06-04 100
u02 2022-01-23 300
u02 2022-03-21 500
u02

2022-04-25

200

u02

2022-04-25

500

u02

2022-06-10

160

u02

2022-08-20

320

u02

2022-09-25

400

u02

2022-10-24

560

创建临时表tb

with tb as(select*fromvalues('u01','2022-07-14','100'),('u01','2022-10-03','320'),('u01','2022-01-23','150'),('u01','2022-02-06','200'),('u01','2022-02-06','230'),('u01','2022-08-21','200'),('u01','2022-09-14','180'),('u01','2022-07-02','350'),('u01','2022-10-26','430'),('u01','2022-05-23','290'),('u02','2022-02-01','150'),('u02','2022-06-04','100'),('u02','2022-01-23','300'),('u02','2022-03-21','500'),('u02','2022-04-25','200'),('u02','2022-04-25','500'),('u02','2022-06-10','160'),('u02','2022-08-20','320'),('u02','2022-09-25','400'),('u02','2022-10-24','560')             t(user,date,amt))

2.1聚合函数

常用的聚合窗口函数如:sum、count、max、min,avg

sum:组内表达式累积值或聚合值,区别在于是否使用了排序order by

--无组内排序order by,则组内sum值一样select  user,date,  amt,  sum(amt) over(partition by user) amts
from tb;user  date      amt amts
u01 2022-10-033202450.0u01 2022-01-231502450.0u01 2022-02-062002450.0u01 2022-02-062302450.0u01 2022-08-212002450.0u01 2022-09-141802450.0u01 2022-07-023502450.0u01 2022-10-264302450.0u01 2022-05-232902450.0u01 2022-07-141002450.0u02 2022-10-245603190.0u02 2022-02-011503190.0u02 2022-06-041003190.0u02 2022-01-233003190.0u02 2022-03-215003190.0u02 2022-04-252003190.0u02 2022-04-255003190.0u02 2022-06-101603190.0u02 2022-08-203203190.0u02 2022-09-254003190.0--有组内排序order by,则组内sum值为按排序累加值select  user,date,  amt,  sum(amt) over(partition by user orderbydate) amts
from tb;user  date      amt amts
u01 2022-01-23150150.0u01 2022-02-06200350.0u01 2022-02-06230580.0u01 2022-05-23290870.0u01 2022-07-023501220.0u01 2022-07-141001320.0u01 2022-08-212001520.0u01 2022-09-141801700.0u01 2022-10-033202020.0u01 2022-10-264302450.0u02 2022-01-23300300.0u02 2022-02-01150450.0u02 2022-03-21500950.0u02 2022-04-255001450.0u02 2022-04-252001650.0u02 2022-06-041001750.0u02 2022-06-101601910.0u02 2022-08-203202230.0u02 2022-09-254002630.0u02 2022-10-245603190.0

count:组内表达式累积计数或聚合计数,区别在于是否使用了排序order by,案例如上所示,这里暂不展示order by

select  user,date,  amt,count(*) over(partition by user) cnt
from tb;user  date      amt cnt
u01 2022-10-0332010u01 2022-01-2315010u01 2022-02-0620010u01 2022-02-0623010u01 2022-08-2120010u01 2022-09-1418010u01 2022-07-0235010u01 2022-10-2643010u01 2022-05-2329010u01 2022-07-1410010u02 2022-10-2456010u02 2022-02-0115010u02 2022-06-0410010u02 2022-01-2330010u02 2022-03-2150010u02 2022-04-2520010u02 2022-04-2550010u02 2022-06-1016010u02 2022-08-2032010u02 2022-09-2540010

max:组内表达式最大值

select  user,date,  amt,  max(amt) over(partition by user) max_amt
from tb;user  date      amt max_amt
u01 2022-10-03320430u01 2022-01-23150430u01 2022-02-06200430u01 2022-02-06230430u01 2022-08-21200430u01 2022-09-14180430u01 2022-07-02350430u01 2022-10-26430430u01 2022-05-23290430u01 2022-07-14100430u02 2022-10-24560560u02 2022-02-01150560u02 2022-06-04100560u02 2022-01-23300560u02 2022-03-21500560u02 2022-04-25200560u02 2022-04-25500560u02 2022-06-10160560u02 2022-08-20320560u02 2022-09-25400560

min:组内表达式最小值

select  user,date,  amt,  min(amt) over(partition by user) min_amt
from tb;user  date      amt min_amt
u01 2022-10-03320100u01 2022-01-23150100u01 2022-02-06200100u01 2022-02-06230100u01 2022-08-21200100u01 2022-09-14180100u01 2022-07-02350100u01 2022-10-26430100u01 2022-05-23290100u01 2022-07-14100100u02 2022-10-24560100u02 2022-02-01150100u02 2022-06-04100100u02 2022-01-23300100u02 2022-03-21500100u02 2022-04-25200100u02 2022-04-25500100u02 2022-06-10160100u02 2022-08-20320100u02 2022-09-25400100

avg:组内表达式平均值

select  user,date,  amt,  avg(amt) over(partition by user) avg_amt
from tb;user  date      amt avg_amt
u01 2022-10-03320245.0u01 2022-01-23150245.0u01 2022-02-06200245.0u01 2022-02-06230245.0u01 2022-08-21200245.0u01 2022-09-14180245.0u01 2022-07-02350245.0u01 2022-10-26430245.0u01 2022-05-23290245.0u01 2022-07-14100245.0u02 2022-10-24560319.0u02 2022-02-01150319.0u02 2022-06-04100319.0u02 2022-01-23300319.0u02 2022-03-21500319.0u02 2022-04-25200319.0u02 2022-04-25500319.0u02 2022-06-10160319.0u02 2022-08-20320319.0u02 2022-09-25400319.0

2.2排名函数

row_number计算当前行在分区中的行号,从1开始递增

select  user,date,  amt,  row_number() over(partition by user orderbydate) rn
from tb;user  date      amt rn
u01 2022-01-231501u01 2022-02-062002u01 2022-02-062303u01 2022-05-232904u01 2022-07-023505u01 2022-07-141006u01 2022-08-212007u01 2022-09-141808u01 2022-10-033209u01 2022-10-2643010u02 2022-01-233001u02 2022-02-011502u02 2022-03-215003u02 2022-04-255004u02 2022-04-252005u02 2022-06-041006u02 2022-06-101607u02 2022-08-203208u02 2022-09-254009u02 2022-10-2456010

rank:计算排名,排名不连续(1,2,2,4)

select  user,date,  amt,  rank() over(partition by user orderbydate) rk
from tb;user  date      amt rk
u01 2022-01-231501u01 2022-02-062002u01 2022-02-062302u01 2022-05-232904u01 2022-07-023505u01 2022-07-141006u01 2022-08-212007u01 2022-09-141808u01 2022-10-033209u01 2022-10-2643010u02 2022-01-233001u02 2022-02-011502u02 2022-03-215003u02 2022-04-252004u02 2022-04-255004u02 2022-06-041006u02 2022-06-101607u02 2022-08-203208u02 2022-09-254009u02 2022-10-2456010

dense_rank:计算排名,排名连续(1,2,2,3)

select  user,date,  amt,  dense_rank() over(partition by user orderbydate) dk
from tb;user  date      amt dk
u01 2022-01-231501u01 2022-02-062002u01 2022-02-062302u01 2022-05-232903u01 2022-07-023504u01 2022-07-141005u01 2022-08-212006u01 2022-09-141807u01 2022-10-033208u01 2022-10-264309u02 2022-01-233001u02 2022-02-011502u02 2022-03-215003u02 2022-04-252004u02 2022-04-255004u02 2022-06-041005u02 2022-06-101606u02 2022-08-203207u02 2022-09-254008u02 2022-10-245609

2.3数学函数

stddev计算总体标准差,STDDEV_POP函数的别名

select  user,date,  amt,  stddev(amt) over(partition by user) std
from tb;user  date      amt std
u01 2022-10-0332095.62949335848225u01 2022-01-2315095.62949335848225u01 2022-02-0620095.62949335848225u01 2022-02-0623095.62949335848225u01 2022-08-2120095.62949335848225u01 2022-09-1418095.62949335848225u01 2022-07-0235095.62949335848225u01 2022-10-2643095.62949335848225u01 2022-05-2329095.62949335848225u01 2022-07-1410095.62949335848225u02 2022-10-24560157.0u02 2022-02-01150157.0u02 2022-06-04100157.0u02 2022-01-23300157.0u02 2022-03-21500157.0u02 2022-04-25200157.0u02 2022-04-25500157.0u02 2022-06-10160157.0u02 2022-08-20320157.0u02 2022-09-25400157.0

stddev_samp计算样本标准差。

select  user,date,  amt,  stddev_samp(amt) over(partition by user) std
from tb;user  date      amt std
u01 2022-10-03320100.80233683358294u01 2022-01-23150100.80233683358294u01 2022-02-06200100.80233683358294u01 2022-02-06230100.80233683358294u01 2022-08-21200100.80233683358294u01 2022-09-14180100.80233683358294u01 2022-07-02350100.80233683358294u01 2022-10-26430100.80233683358294u01 2022-05-23290100.80233683358294u01 2022-07-14100100.80233683358294u02 2022-10-24560165.4925308821452u02 2022-02-01150165.4925308821452u02 2022-06-04100165.4925308821452u02 2022-01-23300165.4925308821452u02 2022-03-21500165.4925308821452u02 2022-04-25200165.4925308821452u02 2022-04-25500165.4925308821452u02 2022-06-10160165.4925308821452u02 2022-08-20320165.4925308821452u02 2022-09-25400165.4925308821452

cluster_sample

用户随机抽样。返回True表示该行数据被抽中。

  • 命令说明
  • cluster_sample(bigint ):表示随机抽取N条数据。
  • cluster_sample(bigint , bigint ):表示按比例(M/N)随机抽取。即抽取partition_row_count×M / N条数据。partition_row_count指分区中的数据行数。
  • 参数说明
  • N:必填。BIGINT类型常量。N为NULL时,返回值为NULL。
  • M:必填。BIGINT类型常量。M为NULL时,返回值为NULL。
  • partition_clause:可选。
select  user,date,  amt,  cluster_sample(3) over(partition by user) cs
from tb;--cs为true的为抽中的数据user  date      amt cs
u01 2022-10-03320trueu01 2022-01-23150falseu01 2022-02-06200falseu01 2022-02-06230falseu01 2022-08-21200falseu01 2022-09-14180trueu01 2022-07-02350falseu01 2022-10-26430falseu01 2022-05-23290falseu01 2022-07-14100trueu02 2022-10-24560falseu02 2022-02-01150trueu02 2022-06-04100falseu02 2022-01-23300falseu02 2022-03-21500falseu02 2022-04-25200trueu02 2022-04-25500trueu02 2022-06-10160falseu02 2022-08-20320falseu02 2022-09-25400false

2.4分布函数

cume_dist求累计分布,相当于求分区中大于等于当前行的数据在分区中的占比。大小关系由order by判定。

select  user,date,  amt,  concat(round(cume_dist() over (partition by user orderby amt desc)*100,2),'%')as cume_dist
from tb;user  date      amt cume_dist
u01 2022-10-2643010.0%u01 2022-07-0235020.0%u01 2022-10-0332030.0%u01 2022-05-2329040.0%u01 2022-02-0623050.0%u01 2022-02-0620070.0%u01 2022-08-2120070.0%u01 2022-09-1418080.0%u01 2022-01-2315090.0%u01 2022-07-14100100.0%u02 2022-10-2456010.0%u02 2022-03-2150030.0%u02 2022-04-2550030.0%u02 2022-09-2540040.0%u02 2022-08-2032050.0%u02 2022-01-2330060.0%u02 2022-04-2520070.0%u02 2022-06-1016080.0%u02 2022-02-0115090.0%u02 2022-06-04100100.0%

ntile用于将分区中的数据按照顺序切分成N等份,并返回数据所在等份的编号。如果分区中的数据不能被均匀地切分成N等份时,最前面的等份(编号较小的)会优先多分配1条数据。

select  user,date,  amt,  ntile(4) over(partition by user orderby amt) nt 
from tb;user  date      amt nt
u01 2022-07-141001u01 2022-01-231501u01 2022-09-141801u01 2022-02-062002u01 2022-08-212002u01 2022-02-062302u01 2022-05-232903u01 2022-10-033203u01 2022-07-023504u01 2022-10-264304u02 2022-06-041001u02 2022-02-011501u02 2022-06-101601u02 2022-04-252002u02 2022-01-233002u02 2022-08-203202u02 2022-09-254003u02 2022-03-215003u02 2022-04-255004u02 2022-10-245604

percent_rank计算当前行在分区中按照order by排序后的百分比排名。

select  user,date,  amt,  percent_rank() over(partition by user orderby amt desc) pr
from tb;user  date      amt pr
u01 2022-10-264300.0u01 2022-07-023500.1111111111111111u01 2022-10-033200.2222222222222222u01 2022-05-232900.3333333333333333u01 2022-02-062300.4444444444444444u01 2022-02-062000.5555555555555556u01 2022-08-212000.5555555555555556u01 2022-09-141800.7777777777777778u01 2022-01-231500.8888888888888888u01 2022-07-141001.0u02 2022-10-245600.0u02 2022-03-215000.1111111111111111u02 2022-04-255000.1111111111111111u02 2022-09-254000.3333333333333333u02 2022-08-203200.4444444444444444u02 2022-01-233000.5555555555555556u02 2022-04-252000.6666666666666666u02 2022-06-101600.7777777777777778u02 2022-02-011500.8888888888888888u02 2022-06-041001.0

median计算窗口中的中位数。

select  user,date,  amt,  median(amt) over(partition by user) md
from tb;user  date      amt md
u01 2022-10-03320215.0u01 2022-01-23150215.0u01 2022-02-06200215.0u01 2022-02-06230215.0u01 2022-08-21200215.0u01 2022-09-14180215.0u01 2022-07-02350215.0u01 2022-10-26430215.0u01 2022-05-23290215.0u01 2022-07-14100215.0u02 2022-10-24560310.0u02 2022-02-01150310.0u02 2022-06-04100310.0u02 2022-01-23300310.0u02 2022-03-21500310.0u02 2022-04-25200310.0u02 2022-04-25500310.0u02 2022-06-10160310.0u02 2022-08-20320310.0u02 2022-09-25400310.0

2.5偏移函数

lag:返回往前第N行的数据

select  user,date,  amt,  lag(amt,1,amt) over(partition by user orderbydate) lag_amt
from tb;user  date      amt lag_amt
u01 2022-01-23150150u01 2022-02-06200150u01 2022-02-06230200u01 2022-05-23290230u01 2022-07-02350290u01 2022-07-14100350u01 2022-08-21200100u01 2022-09-14180200u01 2022-10-03320180u01 2022-10-26430320u02 2022-01-23300300u02 2022-02-01150300u02 2022-03-21500150u02 2022-04-25200500u02 2022-04-25500200u02 2022-06-04100500u02 2022-06-10160100u02 2022-08-20320160u02 2022-09-25400320u02 2022-10-24560400

lead:返回往前第N行的数据

select  user,date,  amt,  lead(amt,1,amt) over(partition by user orderbydate) lead_amt
from tb;user  date      amt lead_amt
u01 2022-01-23150200u01 2022-02-06200230u01 2022-02-06230290u01 2022-05-23290350u01 2022-07-02350100u01 2022-07-14100200u01 2022-08-21200180u01 2022-09-14180320u01 2022-10-03320430u01 2022-10-26430430u02 2022-01-23300150u02 2022-02-01150500u02 2022-03-21500200u02 2022-04-25200500u02 2022-04-25500100u02 2022-06-04100160u02 2022-06-10160320u02 2022-08-20320400u02 2022-09-25400560u02 2022-10-24560560

first_value返回表达式在窗口的第一条数据上进行运算的结果。

select  user,date,  amt,  first_value(amt) over(partition by user orderbydate) fir
from tb;user  date      amt fir
u01 2022-01-23150150u01 2022-02-06200150u01 2022-02-06230150u01 2022-05-23290150u01 2022-07-02350150u01 2022-07-14100150u01 2022-08-21200150u01 2022-09-14180150u01 2022-10-03320150u01 2022-10-26430150u02 2022-01-23300300u02 2022-02-01150300u02 2022-03-21500300u02 2022-04-25200300u02 2022-04-25500300u02 2022-06-04100300u02 2022-06-10160300u02 2022-08-20320300u02 2022-09-25400300u02 2022-10-24560300

last_value返回表达式在窗口的最后一条数据上进行运算的结果。

select  user,date,  amt,  last_value(amt) over(partition by user orderbydate) las
from tb;user  date      amt fir
u01 2022-01-23150150u01 2022-02-06200230u01 2022-02-06230230u01 2022-05-23290290u01 2022-07-02350350u01 2022-07-14100100u01 2022-08-21200200u01 2022-09-14180180u01 2022-10-03320320u01 2022-10-26430430u02 2022-01-23300300u02 2022-02-01150150u02 2022-03-21500500u02 2022-04-25200500u02 2022-04-25500500u02 2022-06-04100100u02 2022-06-10160160u02 2022-08-20320320u02 2022-09-25400400u02 2022-10-24560560

查看结果发现好像并不是组内最后一条数据,因为窗口指定了order by,最后一行数据是截止到当前行的最后一条数据,怎么才能拿到最后一条呢,我们可以加窗口子句:rows between unbounded preceding and unbounded following

select  user,date,  amt,  last_value(amt) over(partition by user orderbydate rows between unbounded preceding and unbounded following) las
from tb;user  date      amt las
u01 2022-01-23150430u01 2022-02-06200430u01 2022-02-06230430u01 2022-05-23290430u01 2022-07-02350430u01 2022-07-14100430u01 2022-08-21200430u01 2022-09-14180430u01 2022-10-03320430u01 2022-10-26430430u02 2022-01-23300560u02 2022-02-01150560u02 2022-03-21500560u02 2022-04-25200560u02 2022-04-25500560u02 2022-06-04100560u02 2022-06-10160560u02 2022-08-20320560u02 2022-09-25400560u02 2022-10-24560560

nth_value返回表达式在窗口的第N条数据进行运算的结果。

select  user,date,  amt,  nth_value(amt,6) over(partition by user) nv
from tb;user  date      amt nv
u01 2022-10-03320180u01 2022-01-23150180u01 2022-02-06200180u01 2022-02-06230180u01 2022-08-21200180u01 2022-09-14180180u01 2022-07-02350180u01 2022-10-26430180u01 2022-05-23290180u01 2022-07-14100180u02 2022-10-24560200u02 2022-02-01150200u02 2022-06-04100200u02 2022-01-23300200u02 2022-03-21500200u02 2022-04-25200200u02 2022-04-25500200u02 2022-06-10160200u02 2022-08-20320200u02 2022-09-25400200

3.总结

分析函数还有很多,常用的基本就这些函数,灵活使用窗口分析函数,能有效提升我们开发效率。小伙伴们快用起来吧。

拜了个拜

目录
相关文章
|
1月前
|
SQL 自然语言处理 数据挖掘
大模型与数据分析:探索Text-to-SQL(上)
大模型与数据分析:探索Text-to-SQL(上)
105 0
|
1月前
|
SQL 自然语言处理 数据挖掘
大模型与数据分析:探索Text-to-SQL(中)
大模型与数据分析:探索Text-to-SQL(中)
88 0
|
1月前
|
SQL 存储 数据挖掘
大模型与数据分析:探索Text-to-SQL(下)
大模型与数据分析:探索Text-to-SQL(下)
61 3
|
5月前
|
SQL 分布式计算 数据可视化
Spark SQL案例【电商购买数据分析】
Spark SQL案例【电商购买数据分析】
|
5月前
|
SQL 分布式计算 数据挖掘
Spark_Day07:Spark SQL(DataFrame是什么和数据分析(案例讲解))
Spark_Day07:Spark SQL(DataFrame是什么和数据分析(案例讲解))
82 0
|
3月前
|
SQL Java API
Flink SQL 问题之窗口函数报错如何解决
Flink SQL报错通常指在使用Apache Flink的SQL接口执行数据处理任务时遇到的问题;本合集将收集常见的Flink SQL报错情况及其解决方法,帮助用户迅速恢复数据处理流程。
23 1
|
3月前
|
SQL 关系型数据库 MySQL
玩转SQL窗口函数
玩转SQL窗口函数
31 0
|
4月前
|
SQL 数据挖掘 数据库
SQL数据分析实战:从导入到高级查询的完整指南
SQL数据分析实战:从导入到高级查询的完整指南
64 0
|
4月前
|
SQL 数据采集 分布式计算
Spark SQL中的聚合与窗口函数
Spark SQL中的聚合与窗口函数
|
5月前
|
SQL 分布式计算 数据挖掘
面试官嫌我Sql写的太low?要求我重写还加了三个需求?——二战Spark电影评分数据分析
面试官嫌我Sql写的太low?要求我重写还加了三个需求?——二战Spark电影评分数据分析
49 0
面试官嫌我Sql写的太low?要求我重写还加了三个需求?——二战Spark电影评分数据分析