大数据技术之Hive SQL题库-初级1

2023-06-12 240

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云原生大数据计算服务 MaxCompute，5000CU*H 100GB 3个月

云原生大数据计算服务MaxCompute，500CU*H 100GB 3个月

简介： 大数据技术之Hive SQL题库-初级

第一章环境准备

1.1 建表语句

hive>

-- 创建学生表
DROP TABLE IF EXISTS student;
create table if not exists student_info(
    stu_id string COMMENT '学生id',
    stu_name string COMMENT '学生姓名',
    birthday string COMMENT '出生日期',
    sex string COMMENT '性别'
) 
row format delimited fields terminated by ',' 
stored as textfile;
-- 创建课程表
DROP TABLE IF EXISTS course;
create table if not exists course_info(
    course_id string COMMENT '课程id',
    course_name string COMMENT '课程名',
    tea_id string COMMENT '任课老师id'
) 
row format delimited fields terminated by ',' 
stored as textfile;
-- 创建老师表
DROP TABLE IF EXISTS teacher;
create table if not exists teacher_info(
    tea_id string COMMENT '老师id',
    tea_name string COMMENT '老师姓名'
) 
row format delimited fields terminated by ',' 
stored as textfile;
-- 创建分数表
DROP TABLE IF EXISTS score;
create table if not exists score_info(
    stu_id string COMMENT '学生id',
    course_id string COMMENT '课程id',
    score int COMMENT '成绩'
) 
row format delimited fields terminated by ',' 
stored as textfile;

1.2 数据准备

（1）创建/opt/module/data目录

[atguigu@hadoop102 module]$ mkdir data

（2）将如4个文件放到/opt/module/data目录下（数据内容如第三点所示）

（3）数据样式说明

[atguigu@hadoop102 data]$ vim student_info.txt

001,彭于晏,1995-05-16,男
002,胡歌,1994-03-20,男
003,周杰伦,1995-04-30,男
004,刘德华,1998-08-28,男
005,唐国强,1993-09-10,男
006,陈道明,1992-11-12,男
007,陈坤,1999-04-09,男
008,吴京,1994-02-06,男
009,郭德纲,1992-12-05,男
010,于谦,1998-08-23,男
011,潘长江,1995-05-27,男
012,杨紫,1996-12-21,女
013,蒋欣,1997-11-08,女
014,赵丽颖,1990-01-09,女
015,刘亦菲,1993-01-14,女
016,周冬雨,1990-06-18,女
017,范冰冰,1992-07-04,女
018,李冰冰,1993-09-24,女
019,邓紫棋,1994-08-31,女
020,宋丹丹,1991-03-01,女

[atguigu@hadoop102 data]$ vim course_info.txt

01,语文,1003
02,数学,1001
03,英语,1004
04,体育,1002
05,音乐,1002

[atguigu@hadoop102 data]$ vim teacher_info.txt

1001,张高数
1002,李体音
1003,王子文
1004,刘丽英

[atguigu@hadoop102 data]$ vim score_info.txt

1.3 插入数据

（1）插入数据

hive>

load data local inpath '/opt/module/data/student_info.txt' into table student_info;
load data local inpath '/opt/module/data/course_info.txt' into table course_info;
load data local inpath '/opt/module/data/teacher_info.txt' into table teacher_info;
load data local inpath '/opt/module/data/score_info.txt' into table score_info;

骚戴理解：（这里我不能通过下面的命令加载数据到hive数据库，我是直接把数据上传到hdfs对应的路径下）

hive>desc formatted teacher_info //查看这个表在hdfs中的存储路径

在把本地的teacher_info.txt上传到hdfs上面

[hive@node181 data]$ hdfs dfs -put teacher_info.txt hdfs://node181.hadoop.com:8020/warehouse/tablespace/managed/hive/teacher_info

（2）验证插入数据情况

hive>

select * from student_info limit 5;
select * from course_info limit 5;
select * from teacher_info limit 5;
select * from score_info limit 5;

第二章简单查询

2.1 查找特定条件

2.1.1 查询姓名中带“冰”的学生名单

hive>

select
    *
from student_info
where stu_name like "%冰%";

结果

stu_id stu_name birthday sex017 范冰冰 1992-07-04 女018 李冰冰 1993-09-24 女

2.1.2 查询姓“王”老师的个数

hive>

select 
    count(*)  wang_count
from teacher_info
where tea_name like '王%';

结果

wang_count1

2.1.3 检索课程编号为“04”且分数小于60的学生的课程信息，结果按分数降序排列

hive>

select
    stu_id,
    course_id,
    score
from score_info
where course_id ='04' and score<60
order by score desc;

结果

stu_id course_id score004 04 59001 04 54020 04 50014 04 40017 04 34010 04 34

2.1.4 查询数学成绩不及格的学生和其对应的成绩，按照学号升序排序

hive>

select
    s.stu_id,
    s.stu_name,
    t1.score
from student_info s
join (
    select
        *
    from score_info
    where course_id=(select course_id from course_info where course_name='数学') and score < 60
    ) t1 on s.stu_id = t1.stu_id
order by s.stu_id;
骚戴解法
select 
    si.stu_id ,si.stu_name ,si2.score 
from student_info si 
    inner join score_info si2 on si.stu_id =si2 .stu_id 
where si2.score <60 and si2.course_id ='02'
order by si.stu_id;

结果

s.stu_id s.stu_name t1.score

005 唐国强 44

007 陈坤 55

008 吴京 34

011 潘长江 49

013 蒋欣 35

014 赵丽颖 39

015 刘亦菲 48

017 范冰冰 34

018 李冰冰 58

019 邓紫棋 39

020 宋丹丹 59

第三章汇总分析

3.1 汇总分析

3.1.1 查询编号为“02”的课程的总成绩

hive>

select
    course_id,
    sum(score) score_sum
from score_info
where course_id='02'
group by course_id;

骚戴理解：这里我忘记分组了，group by course_id要写

结果

course_id score_sum02 1133

3.1.2 查询参加考试的学生个数

思路：对成绩表中的学号做去重并count

hive>

select
    count(distinct stu_id) stu_num
from score_info;

结果

stu_num19

3.2 分组

3.2.1 查询各科成绩最高和最低的分，以如下的形式显示：课程号，最高分，最低分

思路：按照学科分组并使用max和min。

hive>

select
    course_id,
    max(score) max_score,
    min(score) min_score
from score_info
group by course_id;

结果

course_id max_score min_score01 94 3802 93 3403 99 3204 100 3405 87 59

3.2.2 查询每门课程有多少学生参加了考试（有考试成绩）

hive>

select
    course_id,
    count(stu_id) stu_num
from score_info
group by course_id;

结果

course_id stu_num01 1902 1903 1904 1205 5

3.2.3 查询男生、女生人数

hive>

select
    sex,
    count(stu_id) count
from student_info
group by sex;

结果

sex count女 9男 11

3.3 分组结果的条件

3.3.1 查询平均成绩大于60分的学生的学号和平均成绩

1）思路分析

（1）平均成绩：展开来说就是计算每个学生的平均成绩

（2）这里涉及到“每个”就是要分组了

（3）平均成绩大于60分，就是对分组结果指定条件

（4）首先要分组求出每个学生的平均成绩，筛选高于60分的，并反查出这批学生，统计出这些学生总的平均成绩。

2）Hql实操

hive>

select
    stu_id,
    avg(score) score_avg
from score_info
group by stu_id
having score_avg > 60;

骚戴理解：having是对分组聚合后的结果进行判断比较，where是对每条数据进行判断比较

结果

stu_id score_avg

001 72.5

002 86.25

004 81.5

005 75.4

006 73.33333333333333

009 74.2

013 61.0

015 70.25

016 81.25

020 69.75

3.3.2 查询至少选修四门课程的学生学号

1）思路分析

（1）需要先计算出每个学生选修的课程数据，需要按学号分组

（2）至少选修两门课程：也就是每个学生选修课程数目>=4，对分组结果指定条件

2）Hql实操

hive> 
select
    stu_id,
    count(course_id) course_count
from score_info
group by stu_id
having course_count >=4;
骚戴解法：
select 
    stu_id 
from score_info
group by stu_id 
having count(course_id)>=4;

骚戴理解：这里可发现聚合函数可以直接在having后面使用！

结果

stu_id course_num

001 4

002 4

004 4

005 5

007 5

009 5

010 4

013 4

014 4

015 4

016 4

017 4

018 4

020 4

3.3.3 查询同姓（假设每个学生姓名的第一个字为姓）的学生名单并统计同姓人数大于等于2的姓

思路：先提取出每个学生的姓并分组，如果分组的count>=2则为同姓

hive>

select
    t1.first_name,
    count(*) count_first_name
from (
         select
             stu_id,
             stu_name,
             substr(stu_name,0,1) first_name
         from student_info
) t1
group by t1.first_name
having count_first_name >= 2;

骚戴理解：这里是通过子查询来实现的，多看看来理解子查询的妙处

结果

t1.first_name count_first_name

刘 2

周 2

陈 2

3.3.4 查询每门课程的平均成绩，结果按平均成绩升序排序，平均成绩相同时，按课程号降序排列

思路：按照课程号分组并求组内的平均值

hive>

select
    course_id,
    avg(score) score_avg
from score_info
group by course_id
order by score_avg asc, course_id desc;

结果

course_id score_avg

02 59.63157894736842

04 63.416666666666664

01 67.15789473684211

03 69.42105263157895

05 74.6

3.3.5 统计参加考试人数大于等于15的学科

按课程分组并统计组内人数，过滤条件大于等于15

hive>

select
    course_id,
    count(stu_id) stu_count
from score_info
group by course_id
having stu_count >= 15;

结果

course_id stu_count01 1902 1903 19

3.4 查询结果排序&分组指定条件

3.4.1 查询学生的总成绩并按照总成绩降序排序

思路：分组、sum、排序

hive>

select
    stu_id,
    sum(score) sum_score
from score_info
group by stu_id
order by sum_score desc;

结果

stu_id sum_score

005 377

009 371

002 345

004 326

016 325

007 299

001 290

015 281

020 279

013 244

010 233

018 232

006 220

014 192

017 181

012 180

011 180

019 178

008 129

*3.4.2 按照如下格式显示学生的语文、数学、英语三科成绩，没有成绩的输出为0，按照学生的有效平均成绩降序显示

学生id 语文数学英语有效课程数有效平均成绩

hive>

select
  si.stu_id,
  sum(if(ci.course_name='语文',score,0))  `语文`,
  sum(if(ci.course_name='数学',score,0))  `数学`,
  sum(if(ci.course_name='英语',score,0))  `英语`,
  count(*)  `有效课程数`,
  avg(si.score)  `平均成绩`
from
  score_info si
join
  course_info ci
on
  si.course_id=ci.course_id
group by
  si.stu_id
order by
  `平均成绩` desc

结果

学生id 语文数学英语有效课程数平均成绩

002 74 84 87 4 86.25

004 85 93 89 4 81.5

016 71 89 71 4 81.25

005 64 44 99 5 75.4

009 75 78 60 5 74.2

006 71 90 59 3 73.33333333333333

001 94 63 79 4 72.5

015 90 48 84 4 70.25

020 89 59 81 4 69.75

013 47 35 93 4 61.0

012 44 74 62 3 60.0

011 61 49 70 3 60.0

007 48 55 70 5 59.8

019 46 39 93 3 59.333333333333336

010 84 68 47 4 58.25

018 38 58 49 4 58.0

014 81 39 32 4 48.0

017 58 34 55 4 45.25

008 56 34 39 3 43.0

大数据技术之Hive SQL题库-初级1

第一章环境准备

1.1 建表语句

1.2 数据准备

1.3 插入数据

第二章简单查询

2.1 查找特定条件

2.1.1 查询姓名中带“冰”的学生名单

2.1.2 查询姓“王”老师的个数

2.1.3 检索课程编号为“04”且分数小于60的学生的课程信息，结果按分数降序排列

2.1.4 查询数学成绩不及格的学生和其对应的成绩，按照学号升序排序

第三章汇总分析

3.1 汇总分析

3.1.1 查询编号为“02”的课程的总成绩

3.1.2 查询参加考试的学生个数

3.2 分组

3.2.1 查询各科成绩最高和最低的分，以如下的形式显示：课程号，最高分，最低分

3.2.2 查询每门课程有多少学生参加了考试（有考试成绩）

3.2.3 查询男生、女生人数

3.3 分组结果的条件

3.3.1 查询平均成绩大于60分的学生的学号和平均成绩

3.3.2 查询至少选修四门课程的学生学号

3.3.3 查询同姓（假设每个学生姓名的第一个字为姓）的学生名单并统计同姓人数大于等于2的姓

3.3.4 查询每门课程的平均成绩，结果按平均成绩升序排序，平均成绩相同时，按课程号降序排列

3.3.5 统计参加考试人数大于等于15的学科

3.4 查询结果排序&分组指定条件

3.4.1 查询学生的总成绩并按照总成绩降序排序

*3.4.2 按照如下格式显示学生的语文、数学、英语三科成绩，没有成绩的输出为0，按照学生的有效平均成绩降序显示

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

大数据技术之Hive SQL题库-初级1

第一章环境准备

1.1 建表语句

1.2 数据准备

1.3 插入数据

第二章简单查询

2.1 查找特定条件

2.1.1 查询姓名中带“冰”的学生名单

2.1.2 查询姓“王”老师的个数

2.1.3 检索课程编号为“04”且分数小于60的学生的课程信息，结果按分数降序排列

2.1.4 查询数学成绩不及格的学生和其对应的成绩，按照学号升序排序

第三章汇总分析

3.1 汇总分析

3.1.1 查询编号为“02”的课程的总成绩

3.1.2 查询参加考试的学生个数

3.2 分组

3.2.1 查询各科成绩最高和最低的分，以如下的形式显示：课程号，最高分，最低分

3.2.2 查询每门课程有多少学生参加了考试（有考试成绩）

3.2.3 查询男生、女生人数

3.3 分组结果的条件

3.3.1 查询平均成绩大于60分的学生的学号和平均成绩

3.3.2 查询至少选修四门课程的学生学号

3.3.3 查询同姓（假设每个学生姓名的第一个字为姓）的学生名单并统计同姓人数大于等于2的姓

3.3.4 查询每门课程的平均成绩，结果按平均成绩升序排序，平均成绩相同时，按课程号降序排列

3.3.5 统计参加考试人数大于等于15的学科

3.4 查询结果排序&分组指定条件

3.4.1 查询学生的总成绩并按照总成绩降序排序

*3.4.2 按照如下格式显示学生的语文、数学、英语三科成绩，没有成绩的输出为0，按照学生的有效平均成绩降序显示

热门文章

最新文章

相关课程

相关电子书

相关实验场景