Hive SQL初级练习（30题）-阿里云开发者社区

前言

Hive 的重要性不必多说，离线批处理的王者，Hive 用来做数据分析，SQL 基础必须十分牢固。

环境准备

建表语句

这里建4张表，下面的练习题都用这些数据。

-- 创建学生表
create table if not exists student_info(
    stu_id string COMMENT '学生id',
    stu_name string COMMENT '学生姓名',
    birthday string COMMENT '出生日期',
    sex string COMMENT '性别'
)
row format delimited fields terminated by ','
stored as textfile;
-- 创建课程表
create table if not exists course_info(
    course_id string COMMENT '课程id',
    course_name string COMMENT '课程名',
    tea_id string COMMENT '任课老师id'
)
row format delimited fields terminated by ','
stored as textfile;
-- 创建老师表
create table if not exists teacher_info(
    tea_id string COMMENT '老师id',
    tea_name string COMMENT '学生姓名'
)
row format delimited fields terminated by ','
stored as textfile;
-- 创建分数表
create table if not exists score_info(
    stu_id string COMMENT '学生id',
    course_id string COMMENT '课程id',
    score int COMMENT '成绩'
)
row format delimited fields terminated by ','
stored as textfile;

数据

student_info.txt

001,彭于晏,1995-05-16,男
002,胡歌,1994-03-20,男
003,周杰伦,1995-04-30,男
004,刘德华,1998-08-28,男
005,唐国强,1993-09-10,男
006,陈道明,1992-11-12,男
007,陈坤,1999-04-09,男
008,吴京,1994-02-06,男
009,郭德纲,1992-12-05,男
010,于谦,1998-08-23,男
011,潘长江,1995-05-27,男
012,杨紫,1996-12-21,女
013,蒋欣,1997-11-08,女
014,赵丽颖,1990-01-09,女
015,刘亦菲,1993-01-14,女
016,周冬雨,1990-06-18,女
017,范冰冰,1992-07-04,女
018,李冰冰,1993-09-24,女
019,邓紫棋,1994-08-31,女
020,宋丹丹,1991-03-01,女

teacher_info.txt

1001,张高数
1002,李体音
1003,王子文
1004,刘丽英

course_info.txt

01,语文,1003
02,数学,1001
03,英语,1004
04,体育,1002
05,音乐,1002

score_info.txt

加载数据

加载数据到 Hive 的数据源目录

load data local inpath '/opt/module/hive-3.1.2/datas/student_info.txt' into table student_info;
load data local inpath '/opt/module/hive-3.1.2/datas/teacher_info.txt' into table teacher_info;
load data local inpath '/opt/module/hive-3.1.2/datas/course_info.txt' into table course_info;
load data local inpath '/opt/module/hive-3.1.2/datas/score_info.txt' into table score_info;

第一章简单查询

1.1、查找特定条件

重点就是一个 where ，可能涉及到一点多表联结。

1.1.1 查询姓名中带“冰”的学生名单

简单的可以用 like 配合 % 和 _ ，复杂的可以使用 Hive 扩展的 rlike 配合正则表达式。

-- 查询姓名中带“冰”的学生名单
select * from student_info where stu_name like '%冰%';

1.1.2 查询姓“王”老师的个数

-- 查询姓“王”老师的个数
select count(*) from teacher_info where tea_name like '王%';
-- 或者
select count(*) from teacher_info where tea_name rlike '^王';

1.1.3 检索课程编号为“04”且分数小于60的学生的课程信息，结果按分数降序排列

通过 course_id 联结两张表，找到不及格的成绩所对应的课程信息。

select c.* from course_info c
join score_info s on c.course_id = s.course_id
where c.course_id = 4 and s.score < 60
order by s.score desc;

1.1.4 查询数学成绩不及格的学生和其对应的成绩，按照学号升序排序

查询数学课对应的 course_id
通过该 course_id 在 score_info 表中查出不及格的成绩的学生信息
通过学生信息中的 stu_id 字段将 score_info 表和 student_info 表联结起来，输出需要的字段（学生信息，成绩）。

select stu.*,s.score from student_info stu
join (
    select * from score_info
    where course_id =
          (select course_id from course_info where course_name = '数学')
)s
    on s.stu_id = stu.stu_id
where s.score < 60
order by stu.stu_id;

第二章汇总分析

2.1 汇总分析

这里需要注意的是，聚合函数通常和 group by 配合使用！表示分组再做聚合处理。

2.1.1 查询编号为“02”的课程的总成绩

--查询编号为“02”的课程的总成绩
select course_id,sum(score) from score_info where course_id = 02
group by course_id;

2.1.2 查询参加考试的学生个数

select count(distinct stu_id) from score_info;

2.2 分组

重点就是一个 group by。

2.2.1 查询各科成绩最高和最低的分，显示格式：课程号，最高分，最低分

不同的科目对应不同的 course_id ，所以我们用 group by course_id。

-- 同样这里有聚合函数配合 group by 来使用
select course_id,max(score) max,min(score) min from score_info
group by course_id;

2.2.2 查询每门课程有多少学生参加了考试

select course_id, count(stu_id) from score_info
group by course_id;

2.2.3 查询男生、女生人数

-- 查询男生、女生人数
select sex,count(stu_id) from student_info
group by sex;

2.3 分组结果的条件

重点就是 group by 之后的条件判断语句用 having。

2.3.1 查询平均成绩大于60分的学生的学号和平均成绩

这里需要先分组后判断，所以不能用 where，因为 group by 后面的条件语句只能是 having。

-- 这里需要分组后再判断条件
select stu_id,avg(score) avg_score from score_info
group by stu_id
having avg_score>60;

2.3.2 查询至少选修四门课程的学生学号

-- 查询至少选修四门课程的学生学号
select stu_id,count(course_id) cnt from score_info
group by stu_id
having cnt>=4;

2.3.3 查询同姓的学生名单并统计同姓人数大于2的姓

这里用到一个没用过的函数 substr() ，需要记忆一下。

select t1.first_name,count(stu_id) cnt from (
     select *,substr(stu_name,0,1) first_name from student_info
   )t1
group by t1.first_name
having cnt>=2;

2.3.4 查询每门课程的平均成绩，结果按平均成绩升序排序，平均成绩相同时，按课程号降序排列

多重排序判断直接逗号隔开即可。

-- 查询每门课程的平均成绩，结果按平均成绩升序排序，平均成绩相同时，按课程号降序排列
select course_id,avg(score) avg_score from score_info
group by course_id
order by avg_score,course_id desc ;

2.3.5 统计参加考试人数大于等于15的学科

-- 统计参加考试人数大于等于15的学科
select course_id,count(*) cnt from score_info
group by course_id
having cnt>=15;

2.4 查询结果排序&分组指定条件

2.4.1 查询学生的总成绩并按照总成绩降序排序

-- 查询学生的总成绩并按照总成绩降序排序
select stu_id,sum(score) sum_score from score_info
group by stu_id
order by sum_score desc;

2.4.2 按照学生id 语文数学英语有效课程数平均成绩的格式输出成绩，没有成绩的记为 0

这里用到了 Hive 中的 if 语句，它的语法是：

IF(condition, true_value, false_value)

其中，condition是要评估的条件，true_value是当条件为真时要返回的值，false_value是当条件为假时要返回的值。

比如；

SELECT name, age, IF(age >= 18, 'Adult', 'Minor') AS age_group  
FROM users;

此外，还可以使用多重 if 嵌套语句：

SELECT name, age,   
  IF(age >= 18 AND gender = 'Male', 'Adult Male',   
     IF(age >= 18 AND gender = 'Female', 'Adult Female',   
        IF(age < 18 AND gender = 'Male', 'Minor Male', 'Minor Female'))) AS age_group  
FROM users;

本题：这里的反引号是引用的作用，这里代表的是列名。

select s.stu_id,
       sum(if(c.course_name='语文',score,0)) `语文`,
       sum(if(c.course_name='数学',score,0)) `数学`,
       sum(if(c.course_name='英语',score,0)) `英语`,
       count(*) `有效课程数`,
       avg(s.score) `平均成绩`
from score_info s
    join course_info c on s.course_id = c.course_id
group by s.stu_id
order by `平均成绩` desc ;

2.4.3 查询一共参加三门课程且其中一门为语文课程的学生的id和姓名

有点复杂，需要好好理解掌握。

-- 查询一共参加三门课程且其中一门为语文课程的学生的id和姓名
select t2.stu_id,s.stu_name from (
select t1.stu_id from (
    select stu_id,course_id from score_info where stu_id in (
        select stu_id from score_info
        where course_id = '01'
        ) -- 筛选出有语文成绩的学生的id
    ) t1 
        group by t1.stu_id
        having count(t1.course_id)=3
    ) t2
join student_info s on t2.stu_id = s.stu_id;

第三章复杂查询

3.1 子查询

3.1.1 查询所有课程成绩均小于60分的学生的学号、姓名

我们根据 stu_id 把每个学生的成绩信息聚合在一起。然后巧妙的使用了 if 语句来判断是否有不及格的科目，如果>=60分，结果+1，最后用 sum 函数统计出结果，如果 sum 等于0，则说明全部不及格。

-- 查询所有课程成绩均小于60分的学生的学号、姓名
select t1.stu_id,s.stu_name from(
    select stu_id,sum(if(score>=60,1,0)) flag from score_info
        group by stu_id
        having flag=0
    ) t1
join student_info s on s.stu_id = t1.stu_id;

3.1.2 查询没有学全所有课的学生的学号、姓名

这里需要注意：

在Hive SQL中，子查询的结果可能返回多行数据，因此需要使用IN关键字而不是=关键字。IN关键字用于匹配子查询结果中的任何一个值，而=关键字只能匹配单个值。
group by 不一定必须和聚合函数搭配使用，比如下面的查询 course_info 表的行数。

-- 查询没有学全所有课的学生的学号、姓名
select stu_id, stu_name
from student_info
where stu_id not in (
    select stu_id
    from score_info
    group by stu_id
    having count(distinct course_id) = (select count(distinct course_id) from course_info)
);

3.1.3 查询出只选修了三门课程的全部学生的学号和姓名

这里同样需要注意：当子查询的结果是多行值时，用 in 而不是 = ！

-- 查询出只选修了三门课程的全部学生的学号和姓名
select stu_id,stu_name from student_info where stu_id in (
    select stu_id from score_info
    group by stu_id
    having count(course_id)=3
    );

第四章多表查询

4.1 表联结

4.1.1 查询有两门以上的课程不及格的同学的姓名及其平均成绩

-- 查询有两门以上的课程不及格的同学的姓名及其平均成绩
select stu_name,avg_score from student_info st join (
    select stu_id,avg(score) avg_score from score_info
    group by stu_id
    having sum(if(score<60,1,0))>=2) t1
on st.stu_id=t1.stu_id;

4.1.2 查询所有学生的学号、姓名、选课数、总成绩

-- 查询所有学生的学号、姓名、选课数、总成绩
select t1.stu_id,s.stu_name,cnt,sum_score from (
    select stu_id,count(course_id) cnt,sum(score) sum_score from score_info
    group by stu_id)t1 join student_info s
on t1.stu_id=s.stu_id;

4.1.3 查询平均成绩大于85的所有学生的学号、姓名和平均成绩

-- 查询平均成绩大于85的所有学生的学号、姓名和平均成绩
select t1.stu_id,s.stu_name,avg from (
    select stu_id,avg(score) avg from score_info
    group by stu_id
    having avg>85)t1 join student_info s
on t1.stu_id=s.stu_id;

4.1.4 查询学生的选课情况：学号，姓名，课程号，课程名称

-- 查询学生的选课情况：学号，姓名，课程号，课程名称
select t1.stu_id,s.stu_name,t1.course_id,c.course_name from(
    select stu_id,course_id from score_info)t1
join student_info s
on t1.stu_id=s.stu_id
join course_info c
on c.course_id=t1.course_id;

输出结果明显按照科目分开，前几行都是选语文的学生信息。

或者

-- 查询学生的选课情况：学号，姓名，课程号，课程名称
select t1.stu_id,s.stu_name,t1.course_id,c.course_name from(
    select stu_id,course_id from score_info
    group by stu_id, course_id)t1
join student_info s
on t1.stu_id=s.stu_id
join course_info c
on c.course_id=t1.course_id;

这里的输出结果明显按照姓名分开，前几行都是同一个学生的选课信息（这里的 group by要么指定两个字段（即我们要查询的 stu_id 和 course_id），要么就不需要 group by）。

4.1.5 查询出每门课程的及格人数和不及格人数

-- 查询出每门课程的及格人数和不及格人数
select c.course_name,`及格人数`,`不及格人数` from (
    select course_id,sum(if(score>=60,1,0)) `及格人数`,sum(if(score<60,1,0)) `不及格人数` from score_info
    group by course_id)t1
join course_info c
on t1.course_id=c.course_id;

4.1.6 查询课程编号为03且课程成绩在80分以上的学生的学号和姓名及课程信息

这里需要注意如果查询结果中没有用到聚合函数就少用 group by，因为group by会触发生成 mapreduce 程序；能用 where 最好，因为 where 不会触发产生 mapreduce 程序；where 可以秒出结果，而 group by需要好多秒。

-- 查询课程编号为03且课程成绩在80分以上的学生的学号和姓名及课程信息
select t1.stu_id,s.stu_name,c.course_name from (
    select stu_id,course_id from score_info
    where course_id=03 and score>80)t1
join course_info c
on t1.course_id=c.course_id
join student_info s
on t1.stu_id=s.stu_id;

4.2 多表连接

4.2.1 课程编号为"01"且课程分数小于60，按分数降序排列的学生信息

-- 课程编号为"01"且课程分数小于60，按分数降序排列的学生信息
select s.*,t1.score from student_info s join (
    select stu_id,score from score_info
    where course_id=01 and score<60)t1
on s.stu_id=t1.stu_id
order by t1.score desc;

4.2.2 查询所有课程成绩在70分以上的学生的姓名、课程名称和分数，按分数升序排列

-- 查询所有课程成绩在70分以上的学生的姓名、课程名称和分数，按分数升序排列
select s.stu_id,s.stu_name,c.course_name,s2.score from student_info s
join (
    select stu_id,sum(if(score>=70,0,1)) flag from score_info
    group by stu_id
    having flag=0) t1
on s.stu_id=t1.stu_id
left join score_info s2
on s.stu_id=s2.stu_id
left join course_info c
on s2.course_id=c.course_id;

4.2.3 查询该学生不同课程的成绩相同的学生编号、课程编号、学生成绩

-- 查询该学生不同课程的成绩相同的学生编号、课程编号、学生成绩
select sc1.stu_id,sc1.course_id,sc1.score from score_info sc1
join score_info sc2
on sc1.stu_id=sc2.stu_id
and sc1.course_id <> sc2.course_id
and sc1.score=sc2.score;

4.2.4 查询课程编号为“01”的课程比“02”的课程成绩高的所有学生的学号

-- 查询课程编号为“01”的课程比“02”的课程成绩高的所有学生的学号
select s1.stu_id from (
    select stu_id,course_id,score from score_info sc1
    where sc1.course_id=01) s1
join (
    select sc2.stu_id,sc2.course_id,sc2.score from score_info sc2
    where sc2.course_id=02) s2
on s1.stu_id=s2.stu_id
where s1.score>s2.score;

4.2.5 查询学过编号为“01”的课程并且也学过编号为“02”的课程的学生的学号、姓名

-- 查询学过编号为“01”的课程并且也学过编号为“02”的课程的学生的学号、姓名
select s.stu_id,s.stu_name from(
    select stu_id,sum(if(course_id=01,1,0)+if(course_id=02,1,0)) sumOfCourse from score_info
    group by stu_id)t1
join student_info s
on t1.stu_id=s.stu_id
where sumOfCourse=2;

4.2.6 查询学过“李体音”老师所教的所有课的同学的学号、姓名

-- 查询学过“李体音”老师所教的所有课的同学的学号、姓名
select t1.stu_id,si.stu_name
from(
    select stu_id from score_info si
    where course_id in
    (
        select course_id from course_info c
        join teacher_info t
        on c.tea_id = t.tea_id
        where tea_name='李体音'      --李体音教的所有课程
    )
    group by stu_id
    having count(*)=2       --学习所有课程的学生
)t1
left join student_info si
on t1.stu_id=si.stu_id;

4.2.7 查询学过“李体音”老师所讲授的任意一门课程的学生的学号、姓名

select
    t1.stu_id,
    si.stu_name
from
(
    select
        stu_id
    from score_info si
    where course_id in
    (
        select
           course_id
        from course_info c
        join teacher_info t
        on c.tea_id = t.tea_id
        where tea_name='李体音'
    )
    group by stu_id
)t1
left join student_info si
on t1.stu_id=si.stu_id;

4.2.8 查询没学过"李体音"老师讲授的任一门课程的学生姓名

select
    stu_id,
    stu_name
from student_info
where stu_id not in
(
    select
        stu_id
    from score_info si
    where course_id in
    (
        select
           course_id
        from course_info c
        join teacher_info t
        on c.tea_id = t.tea_id
        where tea_name='李体音'
    )
    group by stu_id
);

4.2.9 查询至少有一门课与学号为“001”的学生所学课程相同的学生的学号和姓名

select
    si.stu_id,
    si.stu_name
from score_info sc
join student_info si
on sc.stu_id = si.stu_id
where sc.course_id in
(
    select
        course_id
    from score_info
    where stu_id='001'    --001的课程
) and sc.stu_id <> '001'  --排除001学生
group by si.stu_id,si.stu_name;

4.2.10 按平均成绩从高到低显示所有学生的所有课程的成绩以及平均成绩

-- 按平均成绩从高到低显示所有学生的所有课程的成绩以及平均成绩
select stu_name,course_name,sc.score,t1.avg_score from score_info sc
join student_info si
on sc.stu_id=si.stu_id
join course_info ci
on sc.course_id=ci.course_id
join(
select stu_id,avg(score) avg_score from score_info
group by stu_id,course_id)t1
on sc.stu_id=t1.stu_id
order by t1.avg_score desc;

总结

本想着跳过这些题目的，但是最后还是刷了一遍，期间确实发现了很多基础的不足，总之这次刷题收获满满。

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Hive SQL初级练习（30题）

前言