1 hive的介绍

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供类SQL查询功能。

本质是将SQL转换为MapReduce程序。

主要用途：用来做离线数据分析，比直接用MapReduce开发效率更高。

2 hive的架构

用户接口：包括 CLI JDBC/ODBC WebGUI。其中，CLI(command line interface)为shell命令行；JDBC/ODBC是Hive的JAVA实现，与传统数据库JDBC类似；WebGUI是通过浏览器访问Hive。

元数据存储：通常是存储在关系数据库如 mysql/derby中。Hive 将元数据存储在数据库中。Hive 中的元数据包括表的名字，表的列和分区及其属性，表的属性（是否为外部表等），表的数据所在目录等。

解释器编译器优化器执行器:完成 HQL 查询语句从词法分析语法分析编译优化以及查询计划的生成。生成的查询计划存储在 HDFS 中，并在随后有 MapReduce 调用执行。

3 Hive 数据模型

Hive中所有的数据都存储在HDFS中，没有专门的数据存储格式

在创建表时指定数据中的分隔符，Hive 就可以映射成功，解析数据。

Hive中包含以下数据模型：

db：在hdfs中表现为hive.metastore.warehouse.dir目录下一个文件夹

table：在hdfs中表现所属db目录下一个文件夹

external table：数据存放位置可以在HDFS任意指定路径

partition：在hdfs中表现为table目录下的子目录

bucket：在hdfs中表现为同一个表目录下根据hash散列之后的多个文件

4 常用操作

4.1 数据库相关

Hive配置单元包含一个名为 default 默认的数据库.

—创建数据库

create database [if not exists] <database name>；

–显示所有数据库

show databases;

–删除数据库

drop database if exists <database name> [restrict|cascade];

默认情况下，hive不允许删除含有表的数据库，要先将数据库中的表清空才能drop，否则会报错

–加入cascade关键字，可以强制删除一个数据库

hive> drop database if exists users cascade;

–切换数据库

use <database name>;

4.2 内部表外部表

建内部表
create table 
student(Sno int,Sname string,Sex string,Sage int,Sdept string) 
row format delimited fields terminated by ',';
建外部表
create external table 
student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string) 
row format delimited fields terminated by ',' location '/stu';

内外部表加载数据：
load data local inpath '/root/hivedata/students.txt' overwrite into table student;
load data inpath '/stu' into table student_ext;

4.3 创建分区表

分区建表分为2种，一种是单分区，也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区，表文件夹下出现多文件夹嵌套模式。

单分区建表语句

create table day_table (id int, content string) partitioned by (dt string);
单分区表，按天分区，在表结构中存在id，content，dt三列。

双分区建表语句

create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表，按天和小时分区，在表结构中新增加了dt和hour两列。

 导入数据
 load data local inpath '/root/hivedata/dat_table.txt' into table day_table partition(dt='2017-07-07');
 load data local inpath '/root/hivedata/dat_table.txt' into table day_hour_table partition(dt='2017-07-07', hour='08');
 基于分区的查询：
 SELECT day_table.* FROM day_table WHERE day_table.dt = '2017-07-07';
 查看分区
 show partitions day_hour_table;  
总的说来partition就是辅助查询，缩小查询范围，加快数据的检索速度和对数据按照一定的规格和条件进行管理。

指定分隔符

—指定分隔符创建分区表

create table day_table (id int, content string) partitioned by (dt string) row format delimited fields terminated by ',';

—复杂类型的数据表指定分隔符

数据如下

zhangsan  beijing,shanghai,tianjin,hangzhou
wangwu  shanghai,chengdu,wuhan,haerbin

建表语句

create table
complex_array(name string,work_locations array<string>) 
row format delimited fields terminated by '\t' 
collection items terminated by ',';

4.4 增删分区

增加分区

alter table t_partition add partition (dt='2008-08-08') location 'hdfs://node-21:9000/t_parti/';
执行添加分区  /t_parti文件夹下的数据不会被移动。并且没有分区目录dt=2008-08-08

删除分区

alter table t_partition drop partition (dt='2008-08-08');
执行删除分区时/t_parti下的数据会被删除并且连同/t_parti文件夹也会被删除
注意区别于load data时候添加分区:会移动数据 会创建分区目录

4.5 hive中的join

准备数据
1,a
2,b
3,c
4,d
7,y
8,u
2,bb
3,cc
7,yy
9,pp
建表：
create table a(id int,name string)
row format delimited fields terminated by ',';
create table b(id int,name string)
row format delimited fields terminated by ',';
导入数据：
load data local inpath '/root/hivedata/a.txt' into table a;
load data local inpath '/root/hivedata/b.txt' into table b;
实验：
** inner join
select * from a inner join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 7     | y       | 7     | yy      |
+-------+---------+-------+---------+--+
**left join   
select * from a left join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 1     | a       | NULL  | NULL    |
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 4     | d       | NULL  | NULL    |
| 7     | y       | 7     | yy      |
| 8     | u       | NULL  | NULL    |
+-------+---------+-------+---------+--+
**right join
select * from a right join b on a.id=b.id;
select * from b right join a on b.id=a.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 7     | y       | 7     | yy      |
| NULL  | NULL    | 9     | pp      |
+-------+---------+-------+---------+--+
**
select * from a full outer join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 1     | a       | NULL  | NULL    |
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 4     | d       | NULL  | NULL    |
| 7     | y       | 7     | yy      |
| 8     | u       | NULL  | NULL    |
| NULL  | NULL    | 9     | pp      |
+-------+---------+-------+---------+--+
**hive中的特别join
select * from a left semi join b on a.id = b.id;
select a.* from a inner join b on a.id=b.id;
+-------+---------
| a.id  | a.name  
+-------+---------
| 2     | b       
| 3     | c       
| 7     | y       
+-------+---------
相当于
select a.id,a.name from a where a.id in (select b.id from b); 在hive中效率极低
select a.id,a.name from a join b on (a.id = b.id);
select * from a inner join b on a.id=b.id;
cross join（##慎用）
返回两个表的笛卡尔积结果，不需要指定关联键。
select a.*,b.* from a cross join b;

4.6 json解析

1 先加载rating.json文件到hive的一个原始表 rat_json
样例：{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
create table rat_json(line string) row format delimited;
load data local inpath '/root/hivedata/rating.json' into table rat_json;
2 需要解析json数据成四个字段，插入一张新的表 t_rating
drop table if exists t_rating;
create table t_rating(movieid string,rate int,timestring string,uid string)
row format delimited fields terminated by '\t';
3 json表数据解析到rating表中
insert overwrite table t_rating
select 
get_json_object(line,'$.movie') as moive,
get_json_object(line,'$.rate') as rate,
get_json_object(line,'$.timeStamp') as timestring, get_json_object(line,'$.uid') as uid 
from rat_json limit 10;

5 常用函数

5.1 数值函数

指定精度取整函数 : round

语法: round(double a, int d)

返回值: DOUBLE

说明: 返回指定精度d的double类型

举例：

hive> select round(3.1415926,4) from dual;
3.1416

向下取整函数 : floor

语法: floor(double a)

返回值: BIGINT

说明: 返回等于或者小于该double变量的最大的整数

举例：

hive> select floor(3.1415926) from dual;
3
hive> select floor(25) from dual;
25

向上取整函数 : ceil

语法: ceil(double a)

返回值: BIGINT

说明: 返回等于或者大于该double变量的最小的整数

举例：

hive> select ceil(3.1415926) from dual;
4
hive> select ceil(46) from dual;
46

取随机数函数 : rand

语法: rand(),rand(int seed)

返回值: double

说明: 返回一个0到1范围内的随机数。如果指定种子seed，则会等到一个稳定的随机数序列

举例：

hive> select rand() from dual;
0.5577432776034763

绝对值函数 : abs

语法: abs(double a) abs(int a)

返回值: double int

说明: 返回数值a的绝对值

举例：

hive> select abs(-3.9) from dual;
3.9
hive> select abs(10.9) from dual;
10.9

5.2 日期函数

to_date(string timestamp):返回时间字符串中的日期部分,

如to_date(‘1970-01-01 00:00:00’)=‘1970-01-01’

current_date:返回当前日期

year(date)：返回日期date的年,类型为int

如year(‘2019-01-01’)=2019

month(date)：返回日期date的月,类型为int,

如month(‘2019-01-01’)=1

day(date): 返回日期date的天,类型为int,

如day(‘2019-01-01’)=1

weekofyear(date1)：返回日期date1位于该年第几周。

如weekofyear(‘2019-03-06’)=10

datediff(date1,date2):返回日期date1与date2相差的天数

如datediff(‘2019-03-06’,‘2019-03-05’)=1

date_add(date1,int1):返回日期date1加上int1的日期

如date_add(‘2019-03-06’,1)=‘2019-03-07’

date_sub(date1,int1):返回日期date1减去int1的日期

如date_sub(‘2019-03-06’,1)=‘2019-03-05’

months_between(date1,date2):返回date1与date2相差月份

如months_between(‘2019-03-06’,‘2019-01-01’)=2

add_months(date1,int1):返回date1加上int1个月的日期，int1可为负数

如add_months(‘2019-02-11’,-1)=‘2019-01-11’

last_day(date1):返回date1所在月份最后一天

如last_day(‘2019-02-01’)=‘2019-02-28’

next_day(date1,day1):返回日期date1的下个星期day1的日期。day1为星期X的英文前两字母

如next_day(‘2019-03-06’,‘MO’) 返回’2019-03-11’

**trunc(date1,string1)😗*返回日期最开始年份或月份。string1可为年(YYYY/YY/YEAR)或月(MONTH/MON/MM)。

如trunc(‘2019-03-06’,‘MM’)=‘2019-03-01’，trunc(‘2019-03-06’,‘YYYY’)=‘2019-01-01’

unix_timestamp():返回当前时间的unix时间戳，可指定日期格式。

如unix_timestamp(‘2019-03-06’,‘yyyy-mm-dd’)=1546704180

from_unixtime():返回unix时间戳的日期，可指定格式。

如select from_unixtime(unix_timestamp(‘2019-03-06’,‘yyyy-mm-dd’),‘yyyymmdd’)=‘20190306’

5.3 条件函数

if(boolean,t1,t2):若布尔值成立，则返回t1,反正返回t2。

如if(1>2,100,200)返回200

case when boolean then t1 else t2 end:若布尔值成立，则t1,否则t2,可加多重判断

coalesce(v0,v1,v2):返回参数中的第一个非空值,若所有值均为null,则返回null。

如coalesce(null,1,2)返回1

isnull(a):若a为null则返回true，否则返回false

5.4 字符串函数

length(string1):返回字符串长度

concat(string1,string2):返回拼接string1及string2后的字符串

concat_ws(sep,string1,string2):返回按指定分隔符拼接的字符串

lower(string1):返回小写字符串，同lcase(string1)。upper()/ucase()：返回大写字符串

trim(string1):去字符串左右空格，ltrim(string1):去字符串左空格。rtrim(string1):去字符串右空

repeat(string1，int1)：返回重复string1字符串int1次后的字符串

reverse(string1):返回string1反转后的字符串。

如reverse(‘abc’)返回’cba’

rpad(string1,len1,pad1):以pad1字符右填充string1字符串，至len1长度。

如rpad(‘abc’,5,‘1’)返回’abc11’。lpad()：左填充

split(string1,pat1):以pat1正则分隔字符串string1,返回数组。

如split(‘a,b,c’,’,’)返回[“a”,“b”,“c”]

substr(string1,index1,int1):以index位置起截取int1个字符。

如substr(‘abcde’,1,2)返回’ab’

5.5 类型转换

Hive的原子数据类型是可以进行隐式转换的，类似于Java的类型转换，例如某表达式使用INT类型，TINYINT会自动转换为INT类型，但是Hive不会进行反向转化，例如，某表达式使用TINYINT类型，INT不会自动转换为TINYINT类型，它会返回错误，除非使用CAST操作。

cast(value AS TYPE)

select cast(‘1’ as DOUBLE)；返回1.0

Hadoop Hive面试连环炮 1

1 hive的介绍

2 hive的架构

3 Hive 数据模型

4 常用操作

4.1 数据库相关

4.2 内部表外部表

4.3 创建分区表

4.4 增删分区

4.5 hive中的join

4.6 json解析

5 常用函数

5.1 数值函数

5.2 日期函数

5.3 条件函数

5.4 字符串函数

5.5 类型转换

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Hadoop Hive面试连环炮 1

1 hive的介绍

2 hive的架构

3 Hive 数据模型

4 常用操作

4.1 数据库相关

4.2 内部表外部表

4.3 创建分区表

4.4 增删分区

4.5 hive中的join

4.6 json解析

5 常用函数

5.1 数值函数

5.2 日期函数

5.3 条件函数

5.4 字符串函数

5.5 类型转换

热门文章

最新文章

相关课程

相关电子书

相关实验场景