- LEAD的使用:
与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值。
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time, LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time FROM user_url;
结果如下:
- FIRST_VALUE的使用:
取分组内排序后,截止到当前行,第一个值。
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 FROM user_url;
结果如下:
- LAST_VALUE的使用:
取分组内排序后,截止到当前行,最后一个值。
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 FROM user_url;
结果如下:
如果想要取分组内排序后最后一个值,则需要变通一下:
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1, FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 FROM user_url ORDER BY cookieid,createtime;
注意上述SQL,使用的是 FIRST_VALUE 的倒序取出分组内排序最后一个值!
结果如下:
此处要特别注意order by
如果不指定ORDER BY,则进行排序混乱,会出现错误的结果
SELECT cookieid, createtime, url, FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2 FROM user_url;
结果如下:
上述 url2 和 url55 的createtime即不属于最靠前的时间也不属于最靠后的时间,所以结果是混乱的。
4. CUME_DIST
先创建一张员工薪水表:staff_salary
CREATE EXTERNAL TABLE staff_salary ( dept string, userid string, sal int );
表中加入如下数据:
d1,user1,1000 d1,user2,2000 d1,user3,3000 d2,user4,4000 d2,user5,5000
- CUME_DIST的使用:
此函数的结果和order by的排序顺序有关系。
CUME_DIST:小于等于当前值的行数/分组内总行数。 order默认顺序
:正序
比如,统计小于等于当前薪水的人数,所占总人数的比例。
SELECT dept, userid, sal, CUME_DIST() OVER(ORDER BY sal) AS rn1, CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM staff_salary;
结果如下:
解释:
rn1: 没有partition,所有数据均为1组,总行数为5, 第一行:小于等于1000的行数为1,因此,1/5=0.2 第三行:小于等于3000的行数为3,因此,3/5=0.6 rn2: 按照部门分组,dpet=d1的行数为3, 第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666
5. GROUPING SETS、GROUPING__ID、CUBE、ROLLUP
这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。
还是先创建一个用户访问表:user_date
CREATE TABLE user_date ( month STRING, day STRING, cookieid STRING );
表中加入如下数据:
2021-03,2021-03-10,cookie1 2021-03,2021-03-10,cookie5 2021-03,2021-03-12,cookie7 2021-04,2021-04-12,cookie3 2021-04,2021-04-13,cookie2 2021-04,2021-04-13,cookie4 2021-04,2021-04-16,cookie4 2021-03,2021-03-10,cookie2 2021-03,2021-03-10,cookie3 2021-04,2021-04-12,cookie5 2021-04,2021-04-13,cookie6 2021-04,2021-04-15,cookie3 2021-04,2021-04-15,cookie2 2021-04,2021-04-16,cookie1
- GROUPING SETS的使用:
grouping sets是一种将多个group by 逻辑写在一个sql语句中的便利写法。
等价于将不同维度的GROUP BY结果集进行UNION ALL。
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM user_date GROUP BY month,day GROUPING SETS (month,day) ORDER BY GROUPING__ID;
注:上述SQL中的GROUPING__ID,是个关键字,表示结果属于哪一个分组集合,根据grouping sets中的分组条件month,day,1是代表month,2是代表day。
结果如下:
上述SQL等价于:
SELECT month, NULL as day, COUNT(DISTINCT cookieid) AS uv, 1 AS GROUPING__ID FROM user_date GROUP BY month UNION ALL SELECT NULL as month, day, COUNT(DISTINCT cookieid) AS uv, 2 AS GROUPING__ID FROM user_date GROUP BY day;