大数据开发面试必会的SQL 30题！！！（二）-阿里云开发者社区

（16）查询获奖员工信息

数据源：16_table1.csv、16_table2.csv

id,name
1,王小凤
2,刘诗迪
3,李思雨
4,张文华
5,张青云
6,徐文杰
7,李智瑞
8,徐雨秋
9,孙皓然

id,name
1,王小凤
2,刘诗迪
3,李思雨
7,李智瑞
8,徐雨秋
9,孙皓然

需求：现在我们想通过table1表获取获奖员工的更多信息。

select * from test.`16_table1` inner join `16_table2` `16t2` on `16_table1`.id = `16t2`.id;

运行结果：

（17）计算用户留存情况

数据源：user_login.csv

uid,login_time
1,2021/4/21 6:00
1,2021/4/24 10:00
1,2021/4/25 19:00
2,2021/4/22 10:00
2,2021/4/28 9:00
2,2021/4/29 14:00
3,2021/4/27 8:00
3,2021/4/28 10:00

需求：我们想看用户的次日留存数、三日留存数、七日留存数（只要用户首次登录以后再登录就算留存下来了），该怎么实现呢？

解题思路：本题有两种解题思路，先看思路一：按照用户时间求出七日留存，首先按uid分组，求出每个uid的第一次登陆时间和最后一次登陆时间，算出中间间隔的时间，如果间隔为1就是次日留存，间隔为3就是3日留存，间隔为7就是7日留存，以此类推分别求出他们的数量。思路二：按照当日时间求出七日留存，如果用户登陆的时间正好等于当前日期前一天的日期，则去重统计uid数量即为次日留存数，以此类推分别求出三日留存、七日留存。

-- 写法一，按照用户时间求出七日留存
select
       count(case when user_day_value.day_value = 1 then uid end ) as '次日留存',
       count(case when user_day_value.day_value = 3 then uid end ) as '三日留存',
       count(case when user_day_value.day_value = 7 then uid end ) as '七日留存'
from
     (select uid,
       max(login_time),
       min(login_time),
       datediff(date (max(login_time)) , date (min(login_time))) as day_value
     from
        test.user_login
     group by
         uid) as user_day_value;
-- 写法二，按照当日时间求出七日留存
select
       count(distinct case when date(login_time) = date_sub(date (current_date),interval 1 day) then uid end ) as '次日留存',
       count(distinct case when date(login_time) > date_sub(date (current_date),interval 3 day) then uid end ) as '三日留存',
       count(distinct case when date(login_time) > date_sub(date (current_date),interval 7 day) then uid end ) as '七日留存'
from test.user_login;

运行结果：

写法一：

写法二：

（18）筛选最受欢迎的课程

数据源：course_table.csv

id,name,grade,course
1,王小凤,一年级,心理学
2,刘诗迪,二年级,心理学
3,李思雨,三年级,社会学
4,张文华,一年级,心理学
5,张青云,二年级,心理学
6,徐文杰,三年级,计算机
7,李智瑞,一年级,心理学
8,徐雨秋,二年级,计算机
9,孙皓然,三年级,社会学
10,李春山,一年级,社会学

需求：现在我们想知道最受欢迎的课程是哪一门。

解题思路：解法一通过按照course分组，按照course_count降序排序，取第一个值即为最受欢迎的课程，但是这种写法在一定程度上不严谨，比如说course_count最大值一样多，也就所说的并列第一，那么使用第一种方法就不能把所有最喜欢的课程都取出来。来看看第二种写法，通过使用rank窗口函数得出并列第一的值，通过where = 1，即可以全部取出最受欢迎的课程。

-- 18.筛选最受欢迎的课程
-- 解法一：不取重复值
select * from test.course_table;
select course,
       count(course) as course_count
from
     test.course_table
group by
         course
order by
         course_count desc limit 1;
-- 解法二：取重复值
select
       course,
       course_count
from
     (select
             course,
             count(course) as course_count,
             rank() over (order by count(course) desc ) as rank_num
     from
          test.course_table
     group by
              course) as t
where rank_num = 1;

运行结果：

（19）筛选出每个年级最受欢迎的三门课程

数据源：course_table.csv

需求：现在我们想知道每个年级最受欢迎的三门课程

解题思路：通过运用窗口函数，分组求Top N

-- 19.筛选出每个年级最受欢迎的三门课程
-- 解法一：利用两个窗口函数
select
       *
from
     (select
       a.grade,
       a.course,
       a.count_num,
       row_number() over (partition by grade order by count_num desc) as count_rank
     from
          (select grade,
                  course,
                  count(course) over (partition by grade,course) as count_num
          from test.course_table) as a
     group by a.grade,a.course,a.count_num) as b
where
      b.count_rank < 4 ;
-- 解法二
select * from test.course_table;
select
       *
from
     (select
       a.grade,
       a.course,
       a.count_num,
       row_number() over (partition by grade order by count_num desc) as count_rank from
        (select
          grade,
          course ,
          count(id) as count_num
        from
             test.course_table
        group by
                 grade,
                 course) as a) as b
where b.count_rank < 4;

运行结果：

根据此题也可以求出每个年级最受欢迎的课程，这里我不用窗口函数，使用连接子查询：

select
       *
from
     (select
             grade,
             course ,
             count(id) as count_num
     from
          test.course_table
     group by
              grade,
              course) as t1
where
      t1.count_num =
      (select
              t2.count_num
      from (select
                   grade,
                   course ,
                   count(id) as count_num
      from
           test.course_table
      group by
               grade,
               course) as t2
      where
            t1.grade = t2.grade
      order by
               t2.count_num desc limit 1);

运行结果：

（20）求累积和

数据源：consum_order_table.csv

order_id,uid,amount
201901,1,10
201902,2,20
201903,3,15
201904,3,15
201905,4,20
201906,4,20
201907,5,25
201908,5,25
201909,6,30
201910,6,30
201911,7,35
201912,7,35

需求：现在我们想看下80%的订单金额最少是由多少用户贡献的

解题思路：第一步按uid分组，求出每个uid的amount和，第二步在生成累积和的时候需要按照订单金额进行降序排列，这样就可以得到最少的人数，第三步利用子查询获取到全部的订单金额，求出小于总额80%的uid的数量

select
       count(uid)
from
     (select
       uid,
       sum_amount,
       sum(sum_amount) over (order by sum_amount desc ) as consume_amount
       from
            (select
                    uid,
                    sum(amount) as sum_amount
            from
                 test.consum_order_table
            group by uid) as uid_table) as t
where
      t.consume_amount < (select (sum(amount) * 0.8) as sum_80 from test.consum_order_table);

运行结果：

（21）获取新增用户数

数据源：user_reg_table.csv

uid,reg_time
1,2019/12/25 10:00
2,2019/12/26 10:00
3,2019/12/27 10:00
4,2019/12/28 10:00
5,2019/12/29 10:00
6,2019/12/30 10:00
7,2019/12/31 10:00
8,2020/1/1 10:00
9,2020/1/2 10:00
10,2020/1/3 10:00
11,2020/1/4 10:00

需求：我们想获取某一天的新增用户数，以及该天对应的过去7天内每天的平均新增用户数

解题思路：我们以2020-1-1那一天为例，首先通过case when 让reg_time = 设定的那一天，求出uid的数量即为某一天的新增用户数。然后求出2020-1-1向前7天的reg_time，求出uid的数量除以7即可得出过去7天内每天的平均新增用户数

-- 假设求2020-1-1那一天新增的用户
set @day_date = '2020-01-01';
select
       count(case when date (reg_time) = @day_date then uid end ) as new_count,
       (count(uid) / 7) as 7_avg_count
from
     test.user_reg_table
where
      date (reg_time) between date_sub(@day_date,interval 6 day) and @day_date;

运行结果：

（22）获取用户首次购买时间

数据源：first_order_table.csv

order_id,uid,order_time
201901,1,2020-01-01 10:00:00
201902,2,2020-01-02 10:00:00
201903,3,2020-01-03 10:00:00
201904,1,2020-01-04 10:00:00
201905,2,2020-01-05 10:00:00
201906,3,2020-01-06 10:00:00
201907,1,2020-01-07 10:00:00
201908,2,2020-01-08 10:00:00
201909,3,2020-01-09 10:00:00
201910,1,2020-01-10 10:00:00
201911,2,2020-01-11 10:00:00

需求：我们想获取每个用户的首次购买时间，以及首次购买时间是否在最近7天内

解题思路：首先按uid分组，求出每个uid的首次订单时间，然后通过case when判断首次订单时间是否大于当日七天之前的时间，如果大于则首次购买时间在最近7天内，如果小于则首次购买时间不在最近7天内

select
       uid,
       min(order_time) as first_time,
       (case when date_sub(current_date(),interval 7 day ) < min(date(order_time)) then 'yes' else 'no' end) as is_7_day
from
     test.first_order_table
group by
         uid;

运行结果：

（23）同时获取用户和订单数据

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想获取过去7天每天的新增用户数、订单数、下单用户数

解题思路：首先求过去7天每天新增用户数，再求过去7天每天新增订单数、下单用户数，两个表join即可得到需求

-- 假设以2020-01-04为例
set @day = '2020-01-04';
select
       a.tdate,
       a.count_uid,
       b.order_count,
       b.uesr_count
from
-- 求过去7天每天新增用户数
     (select
       date (reg_time) as tdate,
       count(uid) over(partition by date (reg_time)) as count_uid
     from
          test.user_reg_table
     where
           date (reg_time) between date_sub(@day,interval 7 day) and @day) as a
left join
-- 求过去7天每天新增订单数、下单用户数
     (select
             date (order_time) as tdate,
             count(order_id) as order_count ,
             count(uid) as uesr_count
     from
          test.first_order_table
     where
           date (order_time) between date_sub(@day,interval 7 day) and @day
     group by date (order_time)) as b
on a.tdate = b.tdate;

运行结果：

（24）随机抽样

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想从用户表中随机抽取5个用户，并获取这5个用户的历史购买订单数

解题思路：我们要随机抽取5个用户并获取他们的历史购买订单数，首先需要生成每个用户的历史购买订单数，然后从中随机抽取5个。具体的思路为利用rand()函数生成随机数，然后利用order by进行排序，最后利用limit将前5条数据显示出来，运行结果如下表所示。

select
       user_reg_table.uid,
       t.order_count
from
     test.user_reg_table
left join
         (select
                 uid,
                 count(order_id) as order_count
         from
              test.first_order_table
         group by
                  uid) as t
on user_reg_table.uid = t.uid
order by rand() limit 5;

运行结果：

（25）获取沉默用户数

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想获取沉默用户的数量，沉默的定义是已注册但最近30天内没有购买记录的用户

解题思路：先求出最近三十天已经有购买记录的用户，再求出所有已经注册的用户。两个表连接，得出第一列为所有已经注册的用户，第二列则为最近三十天购买记录的用户，如果不能连接为null，最后求出null值的个数即为最近30天有购买记录的用户

-- 获取已经注册用户
select
       count(t1.uid) as slient_count
from
     test.user_reg_table as t1
-- 获取最近30天有购买记录的用户
left join
         (select
                 uid
         from
              test.first_order_table
         where
               date (order_time)
                   between current_date and date_sub(current_date,interval 30 day)
         group by uid) t2
on
    t1.uid = t2.uid
where
      t2.uid is null;

（26）获取新用户的订单数

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想获取最近7天注册的新用户在最近7天内的订单数是多少

解题思路：第一步，求出最近7天注册的新用户，第二步，求出最近7天下订单的用户和订单数，第三步，两表连接，第二列为null值为最近7天新注册没下订单的用户，否则为下订单的用户，求出第三列的个数即为获取最近7天注册的新用户在最近7天内的订单数

select
       sum(t2.order_count)
from
-- 获取最近7天注册的新用户
       (select
               uid
       from
            test.user_reg_table
       where date (reg_time) between current_date and date_sub(current_date,interval 7 day )) as t1
left join
-- 获取最近7天下订单的用户和订单数
       (select
               uid,
               count(order_id) as order_count
       from
            test.first_order_table
       where date (order_time) between current_date and date_sub(current_date,interval 7 day )
       group by
                uid) as t2
on t1.uid = t2.uid;

运行结果：

（27）获取借款到期名单

数据源：loan_table.csv

id,loan_time,expire_time,reback_time,amount,status
1,2019/12/1,2019/12/31,,2208,0
2,2019/12/1,2019/12/31,2019/12/31,5283,1
3,2019/12/5,2020/1/4,,5397,0
4,2019/12/5,2020/1/4,,4506,0
5,2019/12/10,2020/1/9,,3244,0
6,2019/12/10,2020/1/9,2020/1/12,4541,1
7,2020/1/1,2020/1/31,2020/1/10,3580,1
8,2020/1/1,2020/1/31,,7045,0
9,2020/1/5,2020/2/4,,2067,0
10,2020/1/5,2020/2/4,,7225,0

需求：这张表包含id、loan_time（借款时间）、expire_time（到期时间）、reback_time（还款时间）、amount（金额）和status（还款状态，1表示已还款、0表示未还款）六个字段，我们想获取每天到期的借款笔数、借款金额和平均借款天数

select
       count(id) as loan_count ,
       sum(amount) as loan_amount,
       avg(datediff(reback_time,loan_time)) as avg_day
from
     test.loan_table
where
      date (loan_time) = current_date;

运行结果：

（28）获取即将到期的借款信息

数据源：loan_table.csv

需求：我们想知道有多少笔借款会在未来7天内到期，以及其中有多少笔是已经还款的

解题思路：先获取今日到未来7天所有的还款日期，求出count(id)即为借款会在未来7天内到期的数量，在计算出目前状态为还款的数量。

select
       count(id) as loan_count,
       count(case when status = 1 then id end ) as reback_count
from
     test.loan_table
where
      expire_time between current_date and date_add(current_date,interval 7 day );

运行结果：

（29）获取历史逾期借款信息

数据源：loan_table.csv

需求：我们想知道历史逾期的借款笔数和金额，以及至今还逾期的借款笔数和金额

解题思路：这里面的关键信息在于，逾期怎么判断，对到期时间和还款时间进行比较，如果是逾期且现在已经还款的，可以直接比较到期时间和还款时间，如果还款时间大于到期时间，则说明是逾期的；还有一种是逾期且至今还未还款的，这种情况是没有还款时间的，也就是还款时间为空，但是到期时间是在今天之前，说明已经到期但是未还款。

select
       count(id) as loan_count,
       sum(amount) as loan_amount,
       count(case when status = 0 then id end ) as no_reback_count,
       sum(case when status = 0 then amount end ) as  no_reback_amount
from
     test.loan_table
where (current_date > expire_time and reback_time is null) or (reback_time > loan_table.expire_time);

运行结果：

（30）获取部门工资最高的员工

数据源：Employee.csv、Department.csv

Id,Name,Salary,DepartmentId
1,Joe,70000,1
2,Jim,90000,1
3,Henry,80000,2
4,Sam,60000,2
5,Max,90000,1

Id,Name
1,IT
2,Sales

需求1：如何求出Employee表薪水最高的员工（含重复值）

-- 问题一：如何求出并列第一第二的值
-- 解法一：运用子查询
select
       a.Id,
       Name,
       Salary
from
     test.Employee a
         inner join
         (select
                 Id
         from
              test.Employee
         where
               Salary = (select max(Salary) from test.Employee)) b on a.Id = b.Id;
-- 解法二：运用查rank（）窗口函数
select
       Name,
       Salary
from
     (select
             Name,
             Salary,rank() over (order by Salary desc ) as rank_num
     from test.Employee) as t
where t.rank_num = 1;

运行结果：

需求2：求出部门工资最高的员工（含重复值）

-- 问题二：部门工资最高的员工
select
       t2.Department,
       t2.Employee,
       t2.Salary
from
     (select
             Department,
             Employee,
             Salary,
             rank() over (partition by Department order by Salary desc ) as rank_num
     from
          (select
                  a.Name as Employee,
                  a.Salary as Salary ,
                  b.Name as Department
          from
               test.Employee as a left join test.Department as b on a.DepartmentId = b.Id) as t) as t2
where rank_num = 1;

运行结果：

（补）求连续登陆三天的用户

表结构：

解题思路：

第一步首先按照id，和date分组，

第二步，使用窗口函数，按照id分区，按日期排序，

第三步，使用date_sub函数让date向前移动cum天

第四步，按照id,result_date分组，求出result_date的个数，并取出result_date>3的个数即为连续登陆3天以上的用户

具体代码：

select
       id,
       result_date,
       count(result_date) cnt_3
from
     (select
             id,
             date,
             cum,
             date_sub(date,interval cum day ) as result_date
     from
          (select
                  id,
                  date(date) as date,
                  row_number() over (partition by id order by date(date)) as cum
          from
               test.demo01
          group by
                   id,
                   date(date)
          ) as date_table) as t1
group by
         id,
         result_date
having count(result_date) >= 3;

（补）某互联网公司面试题

问题：一张用户日活信息表，用户某一天的用户info为空，需要用该用户的最近一天的info去补，请问该如何实现

表结构：

dt         uid     info
20210811   001    1
20210811   002    0
20210811   003    1
20210811   004    1
20210812   001    null
20210812   002    0
20210812   003    null
20210813   002    1
20210813   003    0
20210813   004    1
20210814   001    0
20210814   003    1
20210814   004    null

比如说001用户12号的数据为null，就需要拿10号或者14号的数据去补，而不是拿15号

20210810   001    1      
20210812   001    null  
20210814   001    1  
20210815   001    1  
20210816   001    0  
20210817   001    null     
20210818   001    null  
20210819   001    null

思路：先做笛卡尔乘积，再取时间差的绝对值的最小值，再分组排序，取第一个

具体测试数据：

测试代码：

select
    dt1
    ,uid
    ,info as info_null
from (
                  select dt1
                       , uid
                       , info_null
                       , info
                       , min(dt_diff)                                                   as min_dt_diff
                       , row_number() over (partition by dt1,uid order by min(dt_diff)) as rank_num
                  from (
                           select a.dt                      as dt1
                                , b.dt                      as dt2
                                , a.uid                     as uid
                                , a.info                    as info_null
                                , b.info                    as info
                                , abs(datediff(b.dt, a.dt)) as dt_diff
                           from
                               -- 计算出为空的数据
                               (
                                   select dt
                                        , uid
                                        , info
                                        , row_number() over (partition by uid order by dt) as rank_num
                                   from user_login_info
                                   where info is null
                               ) a
                                   left join
                               -- 计算出不为空的数据
                                   (
                                       select dt
                                            , uid
                                            , info
                                            , row_number() over (partition by uid order by dt) as rank_num
                                       from user_login_info
                                       where info is not null
                                   ) as b
                               on a.uid = b.uid
                           group by a.dt
                                  , b.dt
                                  , a.uid
                                  , a.info
                                  , b.info
                       ) as t
                  group by dt1
                         , uid
                         , info_null
                         , info
              ) as abt
where rank_num = 1

运行结果：

但是我这里的写法可能不太好，因为用户日活数据表数据量可能比较大，在做笛卡尔积的时候会性能较低。

大数据开发面试必会的SQL 30题！！！（二）

（16）查询获奖员工信息

（17）计算用户留存情况

（18）筛选最受欢迎的课程

（19）筛选出每个年级最受欢迎的三门课程

（20）求累积和

（21）获取新增用户数

（22）获取用户首次购买时间

（23）同时获取用户和订单数据

（24）随机抽样

（25）获取沉默用户数

（26）获取新用户的订单数

（27）获取借款到期名单

（28）获取即将到期的借款信息

（29）获取历史逾期借款信息

（30）获取部门工资最高的员工

（补）求连续登陆三天的用户

（补）某互联网公司面试题

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

大数据开发面试必会的SQL 30题！！！（二）

（16）查询获奖员工信息

（17）计算用户留存情况

（18）筛选最受欢迎的课程

（19）筛选出每个年级最受欢迎的三门课程

（20）求累积和

（21）获取新增用户数

（22）获取用户首次购买时间

（23）同时获取用户和订单数据

（24）随机抽样

（25）获取沉默用户数

（26）获取新用户的订单数

（27）获取借款到期名单

（28）获取即将到期的借款信息

（29）获取历史逾期借款信息

（30）获取部门工资最高的员工

（补）求连续登陆三天的用户

（补）某互联网公司面试题

热门文章

最新文章

相关课程

相关电子书

相关实验场景