最强最全面的Hive SQL开发指南，超四万字全面解析（一）-阿里云开发者社区

本文整体分为两部分，第一部分是简写，如果能看懂会用，就直接从此部分查，方便快捷，如果不是很理解此SQL的用法，则查看第二部分，是详细说明，当然第二部分语句也会更全一些！

第一部分：

hive模糊搜索表：show tables like '*name*';
查看表结构信息：desc table_name;
查看分区信息：show partitions table_name;
加载本地文件：load data local inpath '/xxx/test.txt' overwrite into table dm.table_name;
从查询语句给table插入数据：insert overwrite table table_name partition(dt) select * from table_name;
导出数据到本地系统：insert overwrite local directory '/tmp/text' select a.* from table_name a order by 1;
创建表时指定的一些属性：

字段分隔符：row format delimited fields terminated by '\t'
行分隔符：row format delimited lines terminated by '\n'
文件格式为文本型存储：stored as textfile

命令行操作：hive -e 'select table_cloum from table' 执行一个查询,在终端上显示mapreduce的进度，执行完毕后，最后把查询结果输出到终端上，接着hive进程退出，不会进入交互模式

hive -S -e 'select table_cloum from table' -S，终端上的输出不会有mapreduce的进度，执行完毕，只会把查询结果输出到终端上。

hive修改表名：alter table old_table_name rename to new_table_name;
hive复制表结构：create table new_table_name like table_name;
hive添加字段：alter table table_name add columns(columns_values bigint comment 'comm_text');
hive修改字段：alter table table_name change old_column new_column string comment 'comm_text';
删除分区：alter table table_name drop partition(dt='2021-11-30');
添加分区：alter table table_name add partition (dt='2021-11-30');
删除空数据库：drop database myhive2;
强制删除数据库：drop database myhive2 cascade;
删除表：drop table score5;
清空表：truncate table score6;

向hive表中加载数据

直接向分区表中插入数据： insert into table score partition(month ='202107') values ('001','002','100');

通过load方式加载数据： load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');

通过查询方式加载数据： insert overwrite table score2 partition(month = '202106') select s_id,c_id,s_score from score1;

查询语句中创建表并加载数据： create table score2 as select * from score1;

在创建表是通过location指定加载数据的路径： create external table score6 (s_id string,c_id string,s_score int) row format delimited fields terminated by ',' location '/myscore';

export导出与import 导入 hive表数据（内部表操作）：

create table techer2 like techer; --依据已有表结构创建表

export table techer to '/export/techer';

import table techer2 from '/export/techer';

hive表中数据导出

insert导出

将查询的结果导出到本地： insert overwrite local directory '/export/servers/exporthive' select * from score;

将查询的结果格式化导出到本地：insert overwrite local directory '/export/servers/exporthive' row format delimited fields terminated by '\t' collection items terminated by '#' select * from student;

将查询的结果导出到HDFS上(没有local)：insert overwrite directory '/export/servers/exporthive' row format delimited fields terminated by '\t' collection items terminated by '#' select * from score;

Hadoop命令导出到本地： dfs -get /export/servers/exporthive/000000_0 /export/servers/exporthive/local.txt;

hive shell 命令导出

基本语法：（hive -f/-e 执行语句或者脚本 > file） hive -e "select * from myhive.score;" > /export/servers/exporthive/score.txt

hive -f export.sh > /export/servers/exporthive/score.txt

export导出到HDFS上： export table score to '/export/exporthive/score';

Hive查询语句

GROUP BY 分组：select s_id ,avg(s_score) avgscore from score group by s_id having avgscore > 85; 对分组后的数据进行筛选，使用 having
join 连接：inner join 内连接；left join 左连接；right join 右链接；full join 全外链接。
order by 排序：ASC（ascend）: 升序（默认） DESC（descend）: 降序
sort by 局部排序：每个MapReduce内部进行排序，对全局结果集来说不是排序。
distribute by 分区排序：类似MR中partition，进行分区，结合sort by使用

Hive函数

1. 聚合函数

指定列值的数目：count()
指定列值求和：sum()
指定列的最大值：max()
指定列的最小值：min()
指定列的平均值：avg()
非空集合总体变量函数：var_pop(col)
非空集合样本变量函数：var_samp (col)
总体标准偏离函数：stddev_pop(col)
分位数函数：percentile(BIGINT col, p)
中位数函数：percentile(BIGINT col, 0.5)

2. 关系运算

A LIKE B： LIKE比较，如果字符串A符合表达式B 的正则语法，则为TRUE
A RLIKE B：JAVA的LIKE操作，如果字符串A符合JAVA正则表达式B的正则语法，则为TRUE
A REGEXP B：功能与RLIKE相同

3. 数学运算

支持所有数值类型：加(+)、减(-)、乘(*)、除(/)、取余(%)、位与(&)、位或(|)、位异或(^)、位取反(~)

4. 逻辑运算

支持：逻辑与(and)、逻辑或(or)、逻辑非(not)

5. 数值运算

取整函数：round(double a)
指定精度取整函数：round(double a, int d)
向下取整函数：floor(double a)
向上取整函数：ceil(double a)
取随机数函数：rand(),rand(int seed)
自然指数函数：exp(double a)
以10为底对数函数：log10(double a)
以2为底对数函数：log2()
对数函数：log()
幂运算函数：pow(double a, double p)
开平方函数：sqrt(double a)
二进制函数：bin(BIGINT a)
十六进制函数：hex()
绝对值函数：abs()
正取余函数：pmod()

6. 条件函数

if
case when
coalesce(c1,c2,c3)
nvl(c1，c2)

7. 日期函数

获得当前时区的UNIX时间戳: unix_timestamp()
时间戳转日期函数：from_unixtime()
日期转时间戳：unix_timestamp(string date)
日期时间转日期函数：to_date(string timestamp)
日期转年函数：year(string date)
日期转月函数：month (string date)
日期转天函数: day (string date)
日期转小时函数: hour (string date)
日期转分钟函数：minute (string date)
日期转秒函数: second (string date)
日期转周函数: weekofyear (string date)
日期比较函数: datediff(string enddate, string startdate)
日期增加函数: date_add(string startdate, int days)
日期减少函数：date_sub (string startdate, int days)

8. 字符串函数

字符串长度函数：length(string A)
字符串反转函数：reverse(string A)
字符串连接函数: concat(string A, string B…)
带分隔符字符串连接函数：concat_ws(string SEP, string A, string B…)
字符串截取函数: substr(string A, int start, int len)
字符串转大写函数: upper(string A)
字符串转小写函数：lower(string A)
去空格函数：trim(string A)
左边去空格函数：ltrim(string A)
右边去空格函数：rtrim(string A)
正则表达式替换函数： regexp_replace(string A, string B, string C)
正则表达式解析函数: regexp_extract(string subject, string pattern, int index)
URL解析函数：parse_url(string urlString, string partToExtract [, string keyToExtract]) 返回值: string
json解析函数：get_json_object(string json_string, string path)
空格字符串函数：space(int n)
重复字符串函数：repeat(string str, int n)
首字符ascii函数：ascii(string str)
左补足函数：lpad(string str, int len, string pad)
右补足函数：rpad(string str, int len, string pad)
分割字符串函数: split(string str, string pat)
集合查找函数: find_in_set(string str, string strList)

9. 窗口函数

分组求和函数：sum(pv) over(partition by cookieid order by createtime) 有坑，加不加 order by 差别很大，具体详情在下面第二部分。
分组内排序，从1开始顺序排：ROW_NUMBER() 如：1234567
分组内排序，排名相等会在名次中留下空位：RANK() 如：1233567
分组内排序，排名相等不会在名次中留下空位：DENSE_RANK() 如：1233456
有序的数据集合平均分配到指定的数量（num）个桶中：NTILE()
统计窗口内往上第n行值：LAG(col,n,DEFAULT)
统计窗口内往下第n行值：LEAD(col,n,DEFAULT)
分组内排序后，截止到当前行，第一个值：FIRST_VALUE(col)
分组内排序后，截止到当前行，最后一个值: LAST_VALUE(col)
小于等于当前值的行数/分组内总行数：CUME_DIST()

以下函数建议看第二部分详细理解下，此处仅简写，！

将多个group by 逻辑写在一个sql语句中: GROUPING SETS
根据GROUP BY的维度的所有组合进行聚合：CUBE
CUBE的子集，以最左侧的维度为主，从该维度进行层级聚合：ROLLUP

第二部分

1. 对数据库的操作

创建数据库:

create database if not exists myhive;
说明：hive的表存放位置模式是由hive-site.xml当中的一个属性指定的 :hive.metastore.warehouse.dir
创建数据库并指定hdfs存储位置 :
create database myhive2 location '/myhive2';

修改数据库:

alter  database  myhive2  set  dbproperties('createtime'='20210329');

说明：可以使用alter database 命令来修改数据库的一些属性。但是数据库的元数据信息是不可更改的，包括数据库的名称以及数据库所在的位置

查看数据库详细信息

查看数据库基本信息
hive (myhive)> desc  database  myhive2;
查看数据库更多详细信息
hive (myhive)> desc database extended  myhive2;

删除数据库

删除一个空数据库，如果数据库下面有数据表，那么就会报错
drop  database  myhive2;
强制删除数据库，包含数据库下面的表一起删除
drop  database  myhive  cascade;

2. 对数据表的操作

对管理表(内部表)的操作:

建内部表:

hive (myhive)> use myhive; -- 使用myhive数据库
hive (myhive)> create table stu(id int,name string);
hive (myhive)> insert into stu values (1,"zhangsan");
hive (myhive)> insert into stu values (1,"zhangsan"),(2,"lisi");  -- 一次插入多条数据
hive (myhive)> select * from stu;

hive建表时候的字段类型:

分类	类型	描述	字面量示例
原始类型	BOOLEAN	true/false	TRUE
	TINYINT	1字节的有符号整数 -128~127	1Y
	SMALLINT	2个字节的有符号整数，-32768~32767	1S
	INT	4个字节的带符号整数	1
	BIGINT	8字节带符号整数	1L
	FLOAT	4字节单精度浮点数1.0
	DOUBLE	8字节双精度浮点数	1.0
	DEICIMAL	任意精度的带符号小数	1.0
	STRING	字符串，变长	“a”,’b’
	VARCHAR	变长字符串	“a”,’b’
	CHAR	固定长度字符串	“a”,’b’
	BINARY	字节数组	无法表示
	TIMESTAMP	时间戳，毫秒值精度	122327493795
	DATE	日期	‘2016-03-29’
	INTERVAL	时间频率间隔
复杂类型	ARRAY	有序的的同类型的集合	array(1,2)
	MAP	key-value,key必须为原始类型，value可以任意类型	map(‘a’,1,’b’,2)
	STRUCT	字段集合,类型可以不同	struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
	UNION	在有限取值范围内的一个值	create_union(1,’a’,63)

对decimal类型简单解释下：

用法：decimal(11,2) 代表最多有11位数字，其中后2位是小数，整数部分是9位；如果整数部分超过9位，则这个字段就会变成null；如果小数部分不足2位，则后面用0补齐两位，如果小数部分超过两位，则超出部分四舍五入

也可直接写 decimal，后面不指定位数，默认是 decimal(10,0) 整数10位，没有小数

创建表并指定字段之间的分隔符

create  table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' stored as textfile location '/user/stu2';

row format delimited fields terminated by '\t' 指定字段分隔符，默认分隔符为 '\001'

stored as 指定存储格式

location 指定存储位置

根据查询结果创建表

create table stu3 as select * from stu2;

根据已经存在的表结构创建表

create table stu4 like stu2;

查询表的结构

只查询表内字段及属性
desc stu2;
详细查询
desc formatted  stu2;

查询创建表的语句

show create table stu2;

对外部表操作

外部表因为是指定其他的hdfs路径的数据加载到表当中来，所以hive表会认为自己不完全独占这份数据，所以删除hive表的时候，数据仍然存放在hdfs当中，不会删掉，只会删除表的元数据

构建外部表

create external table student (s_id string,s_name string) row format delimited fields terminated by '\t';

从本地文件系统向表中加载数据

追加操作
load data local inpath '/export/servers/hivedatas/student.csv' into table student;
覆盖操作
load data local inpath '/export/servers/hivedatas/student.csv' overwrite  into table student;

从hdfs文件系统向表中加载数据

load data inpath '/hivedatas/techer.csv' into table techer;
加载数据到指定分区
load data inpath '/hivedatas/techer.csv' into table techer partition(cur_date=20201210);

注意：

1.使用 load data local 表示从本地文件系统加载，文件会拷贝到hdfs上
2.使用 load data 表示从hdfs文件系统加载，文件会直接移动到hive相关目录下，注意不是拷贝过去，因为hive认为hdfs文件已经有3副本了，没必要再次拷贝了
3.如果表是分区表，load 时不指定分区会报错
4.如果加载相同文件名的文件，会被自动重命名

对分区表的操作

创建分区表的语法

create table score(s_id string, s_score int) partitioned by (month string);

创建一个表带多个分区

create table score2 (s_id string, s_score int) partitioned by (year string,month string,day string);

注意：

hive表创建的时候可以用 location 指定一个文件或者文件夹，当指定文件夹时，hive会加载文件夹下的所有文件，当表中无分区时，这个文件夹下不能再有文件夹，否则报错

当表是分区表时，比如 partitioned by (day string)，则这个文件夹下的每一个文件夹就是一个分区，且文件夹名为 day=20201123 这种格式，然后使用：msck repair table score; 修复表结构，成功之后即可看到数据已经全部加载到表当中去了

加载数据到一个分区的表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');

加载数据到一个多分区的表中去

load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition(year='2018',month='06',day='01');

查看分区

show  partitions  score;

添加一个分区

alter table score add partition(month='201805');

同时添加多个分区

alter table score add partition(month='201804') partition(month = '201803');

注意：添加分区之后就可以在hdfs文件系统当中看到表下面多了一个文件夹

删除分区

alter table score drop partition(month = '201806');

对分桶表操作

将数据按照指定的字段进行分成多个桶中去，就是按照分桶字段进行哈希划分到多个文件当中去

分区就是分文件夹，分桶就是分文件

分桶优点：

1. 提高join查询效率

2. 提高抽样效率

开启hive的捅表功能

set hive.enforce.bucketing=true;

设置reduce的个数

set mapreduce.job.reduces=3;

创建桶表

create table course (c_id string,c_name string) clustered by(c_id) into 3 buckets;

桶表的数据加载：由于桶表的数据加载通过hdfs dfs -put文件或者通过load data均不可以，只能通过insert overwrite 进行加载

所以把文件加载到桶表中，需要先创建普通表，并通过insert overwrite的方式将普通表的数据通过查询的方式加载到桶表当中去

通过insert overwrite给桶表中加载数据

insert overwrite table course select * from course_common cluster by(c_id);  -- 最后指定桶字段

修改表和删除表

修改表名称

alter  table  old_table_name  rename  to  new_table_name;

增加/修改列信息

查询表结构
desc score5;
添加列
alter table score5 add columns (mycol string, mysco string);
更新列
alter table score5 change column mysco mysconew int;

删除表操作

drop table score5;

清空表操作

truncate table score6;
说明：只能清空管理表，也就是内部表；清空外部表，会产生错误

注意：truncate 和 drop：

如果 hdfs 开启了回收站，drop 删除的表数据是可以从回收站恢复的，表结构恢复不了，需要自己重新创建；truncate 清空的表是不进回收站的，所以无法恢复truncate清空的表

所以 truncate 一定慎用，一旦清空将无力回天

向hive表中加载数据

直接向分区表中插入数据

insert into table score partition(month ='201807') values ('001','002','100');

通过load方式加载数据

load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');

通过查询方式加载数据

insert overwrite table score2 partition(month = '201806') select s_id,c_id,s_score from score1;

查询语句中创建表并加载数据

create table score2 as select * from score1;

在创建表是通过location指定加载数据的路径

create external table score6 (s_id string,c_id string,s_score int) row format delimited fields terminated by ',' location '/myscore';

export导出与import 导入 hive表数据（内部表操作）

create table techer2 like techer; --依据已有表结构创建表
export table techer to  '/export/techer';
import table techer2 from '/export/techer';

最强最全面的Hive SQL开发指南，超四万字全面解析（一）

第一部分：

Hive查询语句

Hive函数

1. 聚合函数

2. 关系运算

3. 数学运算

4. 逻辑运算

5. 数值运算

6. 条件函数

7. 日期函数

8. 字符串函数

9. 窗口函数

第二部分

1. 对数据库的操作

2. 对数据表的操作

对管理表(内部表)的操作:

对外部表操作

对分区表的操作

对分桶表操作

修改表和删除表

向hive表中加载数据

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

最强最全面的Hive SQL开发指南，超四万字全面解析 （一）

第一部分：

Hive查询语句

Hive函数

1. 聚合函数

2. 关系运算

3. 数学运算

4. 逻辑运算

5. 数值运算

6. 条件函数

7. 日期函数

8. 字符串函数

9. 窗口函数

第二部分

1. 对数据库的操作

2. 对数据表的操作

对管理表(内部表)的操作:

对外部表操作

对分区表的操作

对分桶表操作

修改表和删除表

向hive表中加载数据

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

最强最全面的Hive SQL开发指南，超四万字全面解析（一）