(16)查询获奖员工信息
数据源:16_table1.csv、16_table2.csv
id,name 1,王小凤 2,刘诗迪 3,李思雨 4,张文华 5,张青云 6,徐文杰 7,李智瑞 8,徐雨秋 9,孙皓然
id,name 1,王小凤 2,刘诗迪 3,李思雨 7,李智瑞 8,徐雨秋 9,孙皓然
需求:现在我们想通过table1表获取获奖员工的更多信息。
select * from test.`16_table1` inner join `16_table2` `16t2` on `16_table1`.id = `16t2`.id;
运行结果:
(17)计算用户留存情况
数据源:user_login.csv
uid,login_time 1,2021/4/21 6:00 1,2021/4/24 10:00 1,2021/4/25 19:00 2,2021/4/22 10:00 2,2021/4/28 9:00 2,2021/4/29 14:00 3,2021/4/27 8:00 3,2021/4/28 10:00
需求:我们想看用户的次日留存数、三日留存数、七日留存数(只要用户首次登录以后再登录就算留存下来了),该怎么实现呢?
解题思路:本题有两种解题思路,先看思路一:按照用户时间求出七日留存,首先按uid分组,求出每个uid的第一次登陆时间和最后一次登陆时间,算出中间间隔的时间,如果间隔为1就是次日留存,间隔为3就是3日留存,间隔为7就是7日留存,以此类推分别求出他们的数量。思路二:按照当日时间求出七日留存,如果用户登陆的时间正好等于当前日期前一天的日期,则去重统计uid数量即为次日留存数,以此类推分别求出三日留存、七日留存。
-- 写法一,按照用户时间求出七日留存 select count(case when user_day_value.day_value = 1 then uid end ) as '次日留存', count(case when user_day_value.day_value = 3 then uid end ) as '三日留存', count(case when user_day_value.day_value = 7 then uid end ) as '七日留存' from (select uid, max(login_time), min(login_time), datediff(date (max(login_time)) , date (min(login_time))) as day_value from test.user_login group by uid) as user_day_value; -- 写法二,按照当日时间求出七日留存 select count(distinct case when date(login_time) = date_sub(date (current_date),interval 1 day) then uid end ) as '次日留存', count(distinct case when date(login_time) > date_sub(date (current_date),interval 3 day) then uid end ) as '三日留存', count(distinct case when date(login_time) > date_sub(date (current_date),interval 7 day) then uid end ) as '七日留存' from test.user_login;
运行结果:
写法一:
写法二:
(18)筛选最受欢迎的课程
数据源:course_table.csv
id,name,grade,course 1,王小凤,一年级,心理学 2,刘诗迪,二年级,心理学 3,李思雨,三年级,社会学 4,张文华,一年级,心理学 5,张青云,二年级,心理学 6,徐文杰,三年级,计算机 7,李智瑞,一年级,心理学 8,徐雨秋,二年级,计算机 9,孙皓然,三年级,社会学 10,李春山,一年级,社会学
需求:现在我们想知道最受欢迎的课程是哪一门。
解题思路:解法一通过按照course分组,按照course_count降序排序,取第一个值即为最受欢迎的课程,但是这种写法在一定程度上不严谨,比如说course_count最大值一样多,也就所说的并列第一,那么使用第一种方法就不能把所有最喜欢的课程都取出来。来看看第二种写法,通过使用rank窗口函数得出并列第一的值,通过where = 1,即可以全部取出最受欢迎的课程。
-- 18.筛选最受欢迎的课程 -- 解法一:不取重复值 select * from test.course_table; select course, count(course) as course_count from test.course_table group by course order by course_count desc limit 1; -- 解法二:取重复值 select course, course_count from (select course, count(course) as course_count, rank() over (order by count(course) desc ) as rank_num from test.course_table group by course) as t where rank_num = 1;
运行结果:
(19)筛选出每个年级最受欢迎的三门课程
数据源:course_table.csv
需求:现在我们想知道每个年级最受欢迎的三门课程
解题思路:通过运用窗口函数,分组求Top N
-- 19.筛选出每个年级最受欢迎的三门课程 -- 解法一:利用两个窗口函数 select * from (select a.grade, a.course, a.count_num, row_number() over (partition by grade order by count_num desc) as count_rank from (select grade, course, count(course) over (partition by grade,course) as count_num from test.course_table) as a group by a.grade,a.course,a.count_num) as b where b.count_rank < 4 ; -- 解法二 select * from test.course_table; select * from (select a.grade, a.course, a.count_num, row_number() over (partition by grade order by count_num desc) as count_rank from (select grade, course , count(id) as count_num from test.course_table group by grade, course) as a) as b where b.count_rank < 4;
运行结果:
根据此题也可以求出每个年级最受欢迎的课程,这里我不用窗口函数,使用连接子查询:
select * from (select grade, course , count(id) as count_num from test.course_table group by grade, course) as t1 where t1.count_num = (select t2.count_num from (select grade, course , count(id) as count_num from test.course_table group by grade, course) as t2 where t1.grade = t2.grade order by t2.count_num desc limit 1);
运行结果:
(20)求累积和
数据源:consum_order_table.csv
order_id,uid,amount 201901,1,10 201902,2,20 201903,3,15 201904,3,15 201905,4,20 201906,4,20 201907,5,25 201908,5,25 201909,6,30 201910,6,30 201911,7,35 201912,7,35
需求:现在我们想看下80%的订单金额最少是由多少用户贡献的
解题思路:第一步按uid分组,求出每个uid的amount和,第二步在生成累积和的时候需要按照订单金额进行降序排列,这样就可以得到最少的人数,第三步利用子查询获取到全部的订单金额,求出小于总额80%的uid的数量
select count(uid) from (select uid, sum_amount, sum(sum_amount) over (order by sum_amount desc ) as consume_amount from (select uid, sum(amount) as sum_amount from test.consum_order_table group by uid) as uid_table) as t where t.consume_amount < (select (sum(amount) * 0.8) as sum_80 from test.consum_order_table);
运行结果:
(21)获取新增用户数
数据源:user_reg_table.csv
uid,reg_time 1,2019/12/25 10:00 2,2019/12/26 10:00 3,2019/12/27 10:00 4,2019/12/28 10:00 5,2019/12/29 10:00 6,2019/12/30 10:00 7,2019/12/31 10:00 8,2020/1/1 10:00 9,2020/1/2 10:00 10,2020/1/3 10:00 11,2020/1/4 10:00
需求:我们想获取某一天的新增用户数,以及该天对应的过去7天内每天的平均新增用户数
解题思路:我们以2020-1-1那一天为例,首先通过case when 让reg_time = 设定的那一天,求出uid的数量即为某一天的新增用户数。然后求出2020-1-1向前7天的reg_time,求出uid的数量除以7即可得出过去7天内每天的平均新增用户数
-- 假设求2020-1-1那一天新增的用户 set @day_date = '2020-01-01'; select count(case when date (reg_time) = @day_date then uid end ) as new_count, (count(uid) / 7) as 7_avg_count from test.user_reg_table where date (reg_time) between date_sub(@day_date,interval 6 day) and @day_date;
运行结果:
(22)获取用户首次购买时间
数据源:first_order_table.csv
order_id,uid,order_time 201901,1,2020-01-01 10:00:00 201902,2,2020-01-02 10:00:00 201903,3,2020-01-03 10:00:00 201904,1,2020-01-04 10:00:00 201905,2,2020-01-05 10:00:00 201906,3,2020-01-06 10:00:00 201907,1,2020-01-07 10:00:00 201908,2,2020-01-08 10:00:00 201909,3,2020-01-09 10:00:00 201910,1,2020-01-10 10:00:00 201911,2,2020-01-11 10:00:00
需求:我们想获取每个用户的首次购买时间,以及首次购买时间是否在最近7天内
解题思路:首先按uid分组,求出每个uid的首次订单时间,然后通过case when判断首次订单时间是否大于当日七天之前的时间,如果大于则首次购买时间在最近7天内,如果小于则首次购买时间不在最近7天内
select uid, min(order_time) as first_time, (case when date_sub(current_date(),interval 7 day ) < min(date(order_time)) then 'yes' else 'no' end) as is_7_day from test.first_order_table group by uid;
运行结果:
(23)同时获取用户和订单数据
数据源:user_reg_table.csv、first_order_table.csv
需求:现在我们想获取过去7天每天的新增用户数、订单数、下单用户数
解题思路:首先求过去7天每天新增用户数,再求过去7天每天新增订单数、下单用户数,两个表join即可得到需求
-- 假设以2020-01-04为例 set @day = '2020-01-04'; select a.tdate, a.count_uid, b.order_count, b.uesr_count from -- 求过去7天每天新增用户数 (select date (reg_time) as tdate, count(uid) over(partition by date (reg_time)) as count_uid from test.user_reg_table where date (reg_time) between date_sub(@day,interval 7 day) and @day) as a left join -- 求过去7天每天新增订单数、下单用户数 (select date (order_time) as tdate, count(order_id) as order_count , count(uid) as uesr_count from test.first_order_table where date (order_time) between date_sub(@day,interval 7 day) and @day group by date (order_time)) as b on a.tdate = b.tdate;
运行结果:
(24)随机抽样
数据源:user_reg_table.csv、first_order_table.csv
需求:现在我们想从用户表中随机抽取5个用户,并获取这5个用户的历史购买订单数
解题思路:我们要随机抽取5个用户并获取他们的历史购买订单数,首先需要生成每个用户的历史购买订单数,然后从中随机抽取5个。具体的思路为利用rand()函数生成随机数,然后利用order by进行排序,最后利用limit将前5条数据显示出来,运行结果如下表所示。
select user_reg_table.uid, t.order_count from test.user_reg_table left join (select uid, count(order_id) as order_count from test.first_order_table group by uid) as t on user_reg_table.uid = t.uid order by rand() limit 5;
运行结果:
(25)获取沉默用户数
数据源:user_reg_table.csv、first_order_table.csv
需求:现在我们想获取沉默用户的数量,沉默的定义是已注册但最近30天内没有购买记录的用户
解题思路:先求出最近三十天已经有购买记录的用户,再求出所有已经注册的用户。两个表连接,得出第一列为所有已经注册的用户,第二列则为最近三十天购买记录的用户,如果不能连接为null,最后求出null值的个数即为最近30天有购买记录的用户
-- 获取已经注册用户 select count(t1.uid) as slient_count from test.user_reg_table as t1 -- 获取最近30天有购买记录的用户 left join (select uid from test.first_order_table where date (order_time) between current_date and date_sub(current_date,interval 30 day) group by uid) t2 on t1.uid = t2.uid where t2.uid is null;
(26)获取新用户的订单数
数据源:user_reg_table.csv、first_order_table.csv
需求:现在我们想获取最近7天注册的新用户在最近7天内的订单数是多少
解题思路:第一步,求出最近7天注册的新用户,第二步,求出最近7天下订单的用户和订单数,第三步,两表连接,第二列为null值为最近7天新注册没下订单的用户,否则为下订单的用户,求出第三列的个数即为获取最近7天注册的新用户在最近7天内的订单数
select sum(t2.order_count) from -- 获取最近7天注册的新用户 (select uid from test.user_reg_table where date (reg_time) between current_date and date_sub(current_date,interval 7 day )) as t1 left join -- 获取最近7天下订单的用户和订单数 (select uid, count(order_id) as order_count from test.first_order_table where date (order_time) between current_date and date_sub(current_date,interval 7 day ) group by uid) as t2 on t1.uid = t2.uid;
运行结果:
(27)获取借款到期名单
数据源:loan_table.csv
id,loan_time,expire_time,reback_time,amount,status 1,2019/12/1,2019/12/31,,2208,0 2,2019/12/1,2019/12/31,2019/12/31,5283,1 3,2019/12/5,2020/1/4,,5397,0 4,2019/12/5,2020/1/4,,4506,0 5,2019/12/10,2020/1/9,,3244,0 6,2019/12/10,2020/1/9,2020/1/12,4541,1 7,2020/1/1,2020/1/31,2020/1/10,3580,1 8,2020/1/1,2020/1/31,,7045,0 9,2020/1/5,2020/2/4,,2067,0 10,2020/1/5,2020/2/4,,7225,0
需求:这张表包含id、loan_time(借款时间)、expire_time(到期时间)、reback_time(还款时间)、amount(金额)和status(还款状态,1表示已还款、0表示未还款)六个字段,我们想获取每天到期的借款笔数、借款金额和平均借款天数
select count(id) as loan_count , sum(amount) as loan_amount, avg(datediff(reback_time,loan_time)) as avg_day from test.loan_table where date (loan_time) = current_date;
运行结果:
(28)获取即将到期的借款信息
数据源:loan_table.csv
需求:我们想知道有多少笔借款会在未来7天内到期,以及其中有多少笔是已经还款的
解题思路:先获取今日到未来7天所有的还款日期,求出count(id)即为借款会在未来7天内到期的数量,在计算出目前状态为还款的数量。
select count(id) as loan_count, count(case when status = 1 then id end ) as reback_count from test.loan_table where expire_time between current_date and date_add(current_date,interval 7 day );
运行结果:
(29)获取历史逾期借款信息
数据源:loan_table.csv
需求:我们想知道历史逾期的借款笔数和金额,以及至今还逾期的借款笔数和金额
解题思路:这里面的关键信息在于,逾期怎么判断,对到期时间和还款时间进行比较,如果是逾期且现在已经还款的,可以直接比较到期时间和还款时间,如果还款时间大于到期时间,则说明是逾期的;还有一种是逾期且至今还未还款的,这种情况是没有还款时间的,也就是还款时间为空,但是到期时间是在今天之前,说明已经到期但是未还款。
select count(id) as loan_count, sum(amount) as loan_amount, count(case when status = 0 then id end ) as no_reback_count, sum(case when status = 0 then amount end ) as no_reback_amount from test.loan_table where (current_date > expire_time and reback_time is null) or (reback_time > loan_table.expire_time);
运行结果:
(30)获取部门工资最高的员工
数据源:Employee.csv、Department.csv
Id,Name,Salary,DepartmentId 1,Joe,70000,1 2,Jim,90000,1 3,Henry,80000,2 4,Sam,60000,2 5,Max,90000,1
Id,Name 1,IT 2,Sales
需求1:如何求出Employee表薪水最高的员工(含重复值)
-- 问题一:如何求出并列第一第二的值 -- 解法一:运用子查询 select a.Id, Name, Salary from test.Employee a inner join (select Id from test.Employee where Salary = (select max(Salary) from test.Employee)) b on a.Id = b.Id; -- 解法二:运用查rank()窗口函数 select Name, Salary from (select Name, Salary,rank() over (order by Salary desc ) as rank_num from test.Employee) as t where t.rank_num = 1;
运行结果:
需求2:求出部门工资最高的员工(含重复值)
-- 问题二:部门工资最高的员工 select t2.Department, t2.Employee, t2.Salary from (select Department, Employee, Salary, rank() over (partition by Department order by Salary desc ) as rank_num from (select a.Name as Employee, a.Salary as Salary , b.Name as Department from test.Employee as a left join test.Department as b on a.DepartmentId = b.Id) as t) as t2 where rank_num = 1;
运行结果:
(补)求连续登陆三天的用户
表结构:
解题思路:
第一步首先按照id,和date分组,
第二步,使用窗口函数,按照id分区,按日期排序,
第三步,使用date_sub函数让date向前移动cum天
第四步,按照id,result_date分组,求出result_date的个数,并取出result_date>3的个数即为连续登陆3天以上的用户
具体代码:
select id, result_date, count(result_date) cnt_3 from (select id, date, cum, date_sub(date,interval cum day ) as result_date from (select id, date(date) as date, row_number() over (partition by id order by date(date)) as cum from test.demo01 group by id, date(date) ) as date_table) as t1 group by id, result_date having count(result_date) >= 3;
(补)某互联网公司面试题
问题:一张用户日活信息表,用户某一天的用户info为空,需要用该用户的最近一天的info去补,请问该如何实现
表结构:
dt uid info 20210811 001 1 20210811 002 0 20210811 003 1 20210811 004 1 20210812 001 null 20210812 002 0 20210812 003 null 20210813 002 1 20210813 003 0 20210813 004 1 20210814 001 0 20210814 003 1 20210814 004 null
比如说001用户12号的数据为null,就需要拿10号或者14号的数据去补,而不是拿15号
20210810 001 1 20210812 001 null 20210814 001 1 20210815 001 1 20210816 001 0 20210817 001 null 20210818 001 null 20210819 001 null
思路:先做笛卡尔乘积,再取时间差的绝对值的最小值,再分组排序,取第一个
具体测试数据:
测试代码:
select dt1 ,uid ,info as info_null from ( select dt1 , uid , info_null , info , min(dt_diff) as min_dt_diff , row_number() over (partition by dt1,uid order by min(dt_diff)) as rank_num from ( select a.dt as dt1 , b.dt as dt2 , a.uid as uid , a.info as info_null , b.info as info , abs(datediff(b.dt, a.dt)) as dt_diff from -- 计算出为空的数据 ( select dt , uid , info , row_number() over (partition by uid order by dt) as rank_num from user_login_info where info is null ) a left join -- 计算出不为空的数据 ( select dt , uid , info , row_number() over (partition by uid order by dt) as rank_num from user_login_info where info is not null ) as b on a.uid = b.uid group by a.dt , b.dt , a.uid , a.info , b.info ) as t group by dt1 , uid , info_null , info ) as abt where rank_num = 1
运行结果:
但是我这里的写法可能不太好,因为用户日活数据表数据量可能比较大,在做笛卡尔积的时候会性能较低。