- CUBE的使用:
根据GROUP BY的维度的所有组合进行聚合。
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM user_date GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID;
结果如下:
上述SQL等价于:
SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM user_date UNION ALL SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM user_date GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM user_date GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM user_date GROUP BY month,day;
- ROLLUP的使用:
是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
比如,以month维度进行层级聚合:
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM user_date GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID;
结果如下:
把month和day调换顺序,则以day维度进行层级聚合:
SELECT day, month, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM user_date GROUP BY day,month WITH ROLLUP ORDER BY GROUPING__ID;
结果如下:
这里,根据日和月进行聚合,和根据日聚合结果一样,因为有父子关系,如果是其他维度组合的话,就会不一样。
窗口函数实际应用
1. 第二高的薪水
难度简单。
编写一个 SQL 查询,获取 Employee 表中第二高的薪水(Salary)。
+----+--------+ | Id | Salary | +----+--------+ | 1 | 100 | | 2 | 200 | | 3 | 300 | +----+--------+
例如上述 Employee 表,SQL查询应该返回 200 作为第二高的薪水。如果不存在第二高的薪水,那么查询应返回 null。
+---------------------+ | SecondHighestSalary | +---------------------+ | 200 | +---------------------+
这道题可以用 row_number 函数解决。
参考代码:
SELECT * FROM( SELECT Salary, row_number() over(order by Salary desc) rk FROM Employee ) t WHERE t.rk = 2;
更简单的代码:
SELECT DISTINCT Salary FROM Employee ORDER BY Salary DESC LIMIT 1 OFFSET 1
OFFSET:偏移量,表示从第几条数据开始取,0代表第1条数据。
2. 分数排名
难度简单。
编写一个 SQL 查询来实现分数排名。
如果两个分数相同,则两个分数排名(Rank)相同。请注意,平分后的下一个名次应该是下一个连续的整数值。换句话说,名次之间不应该有“间隔”。
+----+-------+ | Id | Score | +----+-------+ | 1 | 3.50 | | 2 | 3.65 | | 3 | 4.00 | | 4 | 3.85 | | 5 | 4.00 | | 6 | 3.65 | +----+-------+
例如,根据上述给定的 Scores 表,你的查询应该返回(按分数从高到低排列):
+-------+------+ | Score | Rank | +-------+------+ | 4.00 | 1 | | 4.00 | 1 | | 3.85 | 2 | | 3.65 | 3 | | 3.65 | 3 | | 3.50 | 4 | +-------+------+
参考代码:
SELECT Score, dense_rank() over(order by Score desc) as `Rank` FROM Scores;
3. 连续出现的数字
难度中等。
编写一个 SQL 查询,查找所有至少连续出现三次的数字。
+----+-----+ | Id | Num | +----+-----+ | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 2 | | 5 | 1 | | 6 | 2 | | 7 | 2 | +----+-----+
例如,给定上面的 Logs 表, 1 是唯一连续出现至少三次的数字。
+-----------------+ | ConsecutiveNums | +-----------------+ | 1 | +-----------------+
参考代码:
SELECT DISTINCT `Num` as ConsecutiveNums FROM ( SELECT Num, lead(Num, 1, null) over(order by id) n2, lead(Num, 2, null) over(order by id) n3 FROM Logs ) t1 WHERE Num = n2 and Num = n3
4. 连续N天登录
难度困难。
写一个 SQL 查询, 找到活跃用户的 id 和 name,活跃用户是指那些至少连续 5 天登录账户的用户,返回的结果表按照 id 排序。
表 Accounts:
+----+-----------+ | id | name | +----+-----------+ | 1 | Winston | | 7 | Jonathan | +----+-----------+
表 Logins:
+----+-------------+ | id | login_date | +----+-------------+ | 7 | 2020-05-30 | | 1 | 2020-05-30 | | 7 | 2020-05-31 | | 7 | 2020-06-01 | | 7 | 2020-06-02 | | 7 | 2020-06-02 | | 7 | 2020-06-03 | | 1 | 2020-06-07 | | 7 | 2020-06-10 | +----+-------------+
例如,给定上面的Accounts和Logins表,至少连续 5 天登录账户的是id=7的用户
+----+-----------+ | id | name | +----+-----------+ | 7 | Jonathan | +----+-----------+
思路:
- 去重:由于每个人可能一天可能不止登陆一次,需要去重
- 排序:对每个ID的登录日期排序
- 差值:计算登录日期与排序之间的差值,找到连续登陆的记录
- 连续登录天数计算:select id, count(*) group by id, 差值(伪代码)
- 取出登录5天以上的记录
- 通过表合并,取出id对应用户名
参考代码:
SELECT DISTINCT b.id, name FROM (SELECT id, login_date, DATE_SUB(login_date, ROW_NUMBER() OVER(PARTITION BY id ORDER BY login_date)) AS diff FROM(SELECT DISTINCT id, login_date FROM Logins) a) b INNER JOIN Accounts ac ON b.id = ac.id GROUP BY b.id, diff HAVING COUNT(b.id) >= 5
注意点:
- DATE_SUB的应用:DATE_SUB (DATE, X),注意,X为正数表示当前日期的前X天;
- 如何找连续日期:通过排序与登录日期之间的差值,因为排序连续,因此若登录日期连续,则差值一致;
- GROUP BY和HAVING的应用:通过id和差值的GROUP BY,用COUNT找到连续天数大于5天的id,注意COUNT不是一定要出现在SELECT后,可以直接用在HAVING中
5. 给定数字的频率查询中位数
难度困难。
Numbers 表保存数字的值及其频率。
+----------+-------------+ | Number | Frequency | +----------+-------------| | 0 | 7 | | 1 | 1 | | 2 | 3 | | 3 | 1 | +----------+-------------+
在此表中,数字为 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 3,所以中位数是 (0 + 0) / 2 = 0。
+--------+ | median | +--------| | 0.0000 | +--------+
请编写一个查询来查找所有数字的中位数并将结果命名为 median 。
参考代码:
select avg(cast(number as float)) as median from ( select Number, Frequency, sum(Frequency) over(order by Number) - Frequency as prev_sum, sum(Frequency) over(order by Number) as curr_sum from Numbers ) t1, ( select sum(Frequency) as total_sum from Numbers ) t2 where t1.prev_sum <= (cast(t2.total_sum as float) / 2) and t1.curr_sum >= (cast(t2.total_sum as float) / 2)