十五、Hive中数据备份
1.export导出数据 :
EXPORT TABLE db_hive.order TO '/user/caizhengjie/datas/export/order'
2.import导入数据:
import table order_imp from '/user/caizhengjie/datas/export/order';
IMPORT TABLE order_imp_1 FROM '/user/caizhengjie/datas/export/order' LOCATION '/user/kfk/datas/imp/order'; --(location)指定数据表目录
十六、Hive常用的查询
准备数据:
/opt/datas/hive/order.txt
0001,cenry,2018-10-09,product-1,350,guangzhou 0002,beny,2018-09-09,product-2,180,beijing 0003,ben,2018-09-09,product-3,580,beijing 0004,cherry,2018-10-09,product-4,450,shenzheng 0005,jone,2018-10-09,product-5,30,nanjing 0006,lili,2018-09-09,product-6,50,hangzhou 0007,chen,2018-10-09,product-7,90,wuhan 0008,wiwi,2018-09-09,product-8,150,chengdu
创建表:
create table order(userid string,username string,order_date string,product_name string,price int,city string) row format delimit
1.字段查询
select userid,username from order;
userid username 0001 cenry 0002 beny 0003 ben 0004 cherry 0005 jone 0006 lili 0007 chen 0008 wiwi
2.where查询
select * from order where price > 200;
order.userid order.username order.order_date order.product_name order.price order.city 0001 cenry 2018-10-09 product-1 350 guangzhou 0003 ben 2018-09-09 product-3 580 beijing 0004 cherry 2018-10-09 product-4 450 shenzheng
3.limit查询
select * from order limit 3;
order.userid order.username order.order_date order.product_name order.price order.city 0001 cenry 2018-10-09 product-1 350 guangzhou 0002 beny 2018-09-09 product-2 180 beijing 0003 ben 2018-09-09 product-3 580 beijing
4.distinct去重
select distinct city from order;
city beijing chengdu guangzhou hangzhou nanjing shenzheng wuhan
5.max/min/count/sum
select max(price) from order;
_c0 580
6.group by / having 分组查询
select sum(price) price,city from order group by city having price > 200;
price city 760 beijing 350 guangzhou 450 shenzheng
十七、Hive中join表的关联
准备数据order1.txt
1,0001,2018-09-09,product-1,180 2,0001,2018-09-09,product-2,240 3,0001,2018-09-09,product-3,350 4,0002,2018-09-09,product-4,110 5,0002,2018-09-09,product-5,230 6,0003,2018-09-09,product-6,245 7,0004,2018-09-09,product-7,965 8,0004,2018-09-09,product-8,741 9,0008,2018-09-09,product-8,690 10,0009,2018-09-09,product-9,120 11,0009,2018-09-09,product-9,120
customer.txt
0001,jack,17425845325,guangzhou 0002,beny,15451225965,shenzhen 0003,jone,13754859326,nanjing 0004,cherry,13785415255,suzhou 0005,alex,15745869325,beijing 0006,aili,13245632581,xiamen
创建order1表:
create table if not exists db_hive.order1(orderid string,userid string,order_date string,product_name string,price string) row format delimited fields terminated by ',' stored as textfile;
创建customer表:
create table if not exists db_hive.customer(userid string,username string,telephone string,city string) row format delimited fields terminated by ',' stored as textfile;
加载数据:
load data local inpath '/opt/datas/hive/order1.txt' into table order1; load data local inpath '/opt/datas/hive/customer.txt' into table customer;
1.等值连接
select * from customer t1,order1 t2 where t1.userid = t2.userid;
t1.userid t1.username t1.telephone t1.city t2.orderid t2.userid t2.order_date t2.product_name t2.price 0001 jack 17425845325 guangzhou 1 0001 2018-09-09 product-1 180 0001 jack 17425845325 guangzhou 2 0001 2018-09-09 product-2 240 0001 jack 17425845325 guangzhou 3 0001 2018-09-09 product-3 350 0002 beny 15451225965 shenzhen 4 0002 2018-09-09 product-4 110 0002 beny 15451225965 shenzhen 5 0002 2018-09-09 product-5 230 0003 jone 13754859326 nanjing 6 0003 2018-09-09 product-6 245 0004 cherry 13785415255 suzhou 7 0004 2018-09-09 product-7 965 0004 cherry 13785415255 suzhou 8 0004 2018-09-09 product-8 741
2.左连接
select t1.username,t2.product_name from customer t1 left join order1 t2 on t1.userid = t2.userid;
t1.username t2.product_name jack product-1 jack product-2 jack product-3 beny product-4 beny product-5 jone product-6 cherry product-7 cherry product-8 alex NULL aili NULL
select t2.username,t1.product_name from order1 t1 left join customer t2 on t1.userid = t2.userid;
t2.username t1.product_name jack product-1 jack product-2 jack product-3 beny product-4 beny product-5 jone product-6 cherry product-7 cherry product-8 NULL product-8 NULL product-9 NULL product-9
3.右连接
select t1.username,t2.product_name from customer t1 right join order1 t2 on t1.userid = t2.userid;
t1.username t2.product_name jack product-1 jack product-2 jack product-3 beny product-4 beny product-5 jone product-6 cherry product-7 cherry product-8 NULL product-8 NULL product-9 NULL product-9
select t2.username,t1.product_name from order1 t1 right join customer t2 on t1.userid = t2.userid;
t2.username t1.product_name jack product-1 jack product-2 jack product-3 beny product-4 beny product-5 jone product-6 cherry product-7 cherry product-8 alex NULL aili NULL
4.全连接
select t2.username,t1.product_name from order1 t1 full join customer t2 on t1.userid = t2.userid;
t2.username t1.product_name jack product-3 jack product-2 jack product-1 beny product-5 beny product-4 jone product-6 cherry product-8 cherry product-7 alex NULL aili NULL NULL product-8 NULL product-9 NULL product-9
十八、Hive几种By的使用详解
order /sort / cluster / distribute by
准备数据:
order.userid order.username order.order_date order.product_name order.price order.city 0001 cenry 2018-10-09 product-1 350 guangzhou 0002 beny 2018-09-09 product-2 180 beijing 0003 ben 2018-09-09 product-3 580 beijing 0004 cherry 2018-10-09 product-4 450 shenzheng 0005 jone 2018-10-09 product-5 30 nanjing 0006 lili 2018-09-09 product-6 50 hangzhou 0007 chen 2018-10-09 product-7 90 wuhan 0008 wiwi 2018-09-09 product-8 150 chengdu
(1)order by
order by (只有一个reduce ,全局排序)
select * from order order by price desc; select * from order order by price asc;
order.userid order.username order.order_date order.product_name order.price order.city 0003 ben 2018-09-09 product-3 580 beijing 0004 cherry 2018-10-09 product-4 450 shenzheng 0001 cenry 2018-10-09 product-1 350 guangzhou 0002 beny 2018-09-09 product-2 180 beijing 0008 wiwi 2018-09-09 product-8 150 chengdu 0007 chen 2018-10-09 product-7 90 wuhan 0006 lili 2018-09-09 product-6 50 hangzhou 0005 jone 2018-10-09 product-5 30 nanjing
设置reduce个数,这里设置为3个
set mapreduce.job.reduces=3;
但是不起作用,运行的时候还是一个。
(2)sort by
每个reduce端都会进行排序,也就是局部有序,可以指定多个reduce。同时,如果想测试一下执行的效果,建议将输出结果保存到本地,并调整reduce的数量。(我设置成3个)
sort by (对每一个reduce内部的数据进行排序,最后的全局结果集不排序)
设置reduce个数,这里设置为3个
set mapreduce.job.reduces=3;
这里的写法是将数据写入到本地
insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order sort by price desc;
[caizhengjie@bigdata-pro-m01 sort]$ ls 000000_0 000001_0 000002_0 [caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0 0008,wiwi,2018-09-09,product-8,150,chengdu 0007,chen,2018-10-09,product-7,90,wuhan 0006,lili,2018-09-09,product-6,50,hangzhou 0005,jone,2018-10-09,product-5,30,nanjing
[caizhengjie@bigdata-pro-m01 sort]$ cat 000001_0 0003,ben,2018-09-09,product-3,580,beijing 0004,cherry,2018-10-09,product-4,450,shenzheng 0002,beny,2018-09-09,product-2,180,beijing
[caizhengjie@bigdata-pro-m01 sort]$ cat 000002_0 0001,cenry,2018-10-09,product-1,350,guangzhou
(3)distribute by
指定分区原则。通常和sort by一起用,distribute by必须要写在sort by之前。理解成:按照XX字段分区,再按照XX字段排序。
distribute by (作用类似于partition,一般与sort by 一起使用)
设置reduce个数,这里设置为3个
set mapreduce.job.reduces=3;
insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order distribute by city sort by price desc;
[caizhengjie@bigdata-pro-m01 sort]$ ls 000000_0 000001_0 000002_0 [caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0 0004,cherry,2018-10-09,product-4,450,shenzheng 0001,cenry,2018-10-09,product-1,350,guangzhou
[caizhengjie@bigdata-pro-m01 sort]$ cat 000001_0 0003,ben,2018-09-09,product-3,580,beijing 0002,beny,2018-09-09,product-2,180,beijing 0007,chen,2018-10-09,product-7,90,wuhan 0006,lili,2018-09-09,product-6,50,hangzhou 0005,jone,2018-10-09,product-5,30,nanjing
[caizhengjie@bigdata-pro-m01 sort]$ cat 000002_0 0008,wiwi,2018-09-09,product-8,150,chengdu
(4)cluster by
当distribute by 和 sort by 所指定的字段相同时,即可以使用cluster by。
注意:cluster by指定的列只能是升序,不能指定asc和desc。
设置reduce个数,这里设置为3个
set mapreduce.job.reduces=3;
cluster by (distribute by 和sort by字段相同时,使用cluster by 代替)
insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order distribute by username sort by username;
[caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0 0003,ben,2018-09-09,product-3,580,beijing 0007,chen,2018-10-09,product-7,90,wuhan 0004,cherry,2018-10-09,product-4,450,shenzheng 0006,lili,2018-09-09,product-6,50,hangzhou
等价于:
insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order cluster by username;
[caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0 0003,ben,2018-09-09,product-3,580,beijing 0007,chen,2018-10-09,product-7,90,wuhan 0004,cherry,2018-10-09,product-4,450,shenzheng 0006,lili,2018-09-09,product-6,50,hangzhou
十九、Hive中UDF编程详解
1.自定义UDF函数
加载pom.xml文件内容:
<properties> <hive.version>0.13.1</hive.version> </properties>
<dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>${hive.version}</version> </dependency>
编写udf代码
package com.kfk.hive; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; /** * @author : 蔡政洁 * @email :caizhengjie888@icloud.com * @date : 2020/10/23 * @time : 8:31 下午 */ public class KfkUDF extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toUpperCase()); } public static void main(String[] args) { System.out.println(new KfkUDF().evaluate(new Text("spark"))); } }
2.打jar包
3.添加jar包
hive (db_hive)> add jar /opt/jars/bigdata_study_udf.jar; hive (db_hive)> list jar;
4.Create Function
create temporary function kfk_udf as 'com.kfk.hive.KfkUDF';
5.使用自定义函数
select kfk_udf(city) city from order;
city GUANGZHOU BEIJING BEIJING SHENZHENG NANJING HANGZHOU WUHAN CHENGDU
有关hive的UDF操作可以见官网:
https://cwiki.apache.org/confluence/display/Hive/HivePlugins
二十、通过hiverserver2/beeline连接Hive
1.启动metastore:
bin/hive --service metastore
2.启动hiverserver2:
bin/hiveserver2
3.启动beeline客户端:
bin/beeline
4.连接hive:
% bin/beeline beeline> !connect jdbc:hive2://<host>:<port>/<db>;auth=noSasl hiveuser pass
!connect jdbc:hive2://bigdata-pro-m01:10000 caizhengjie 199911
5.查看数据
1: jdbc:hive2://bigdata-pro-m01:10000> select * from order; +---------------+-----------------+-------------------+---------------------+--------------+-------------+--+ | order.userid | order.username | order.order_date | order.product_name | order.price | order.city | +---------------+-----------------+-------------------+---------------------+--------------+-------------+--+ | 0001 | cenry | 2018-10-09 | product-1 | 350 | guangzhou | | 0002 | beny | 2018-09-09 | product-2 | 180 | beijing | | 0003 | ben | 2018-09-09 | product-3 | 580 | beijing | | 0004 | cherry | 2018-10-09 | product-4 | 450 | shenzheng | | 0005 | jone | 2018-10-09 | product-5 | 30 | nanjing | | 0006 | lili | 2018-09-09 | product-6 | 50 | hangzhou | | 0007 | chen | 2018-10-09 | product-7 | 90 | wuhan | | 0008 | wiwi | 2018-09-09 | product-8 | 150 | chengdu | +---------------+-----------------+-------------------+---------------------+--------------+-------------+--+
可以看出来,通过beeline客户端连接的hive比hive本身的命令显示出来的数据格式更加直观。
具体beeline操作可以详见官网:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903124#HiveServer2Clients-table
二十一、Hive与JDBC的集成
1.添加pom.xml配置文件
<properties> <hive.version>0.13.1</hive.version> </properties>
<dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>${hive.version}</version> </dependency>
2.编写JDBC代码
package com.kfk.hive; /** * @author : 蔡政洁 * @email :caizhengjie888@icloud.com * @date : 2020/10/23 * @time : 9:53 下午 */ import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveJDBC { private static String driverName = "org.apache.hive.jdbc.HiveDriver"; /** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } //replace "hive" here with the name of the user the queries should run as Connection con = DriverManager.getConnection("jdbc:hive2://bigdata-pro-m01:10000/db_hive", "caizhengjie", "199911"); Statement stmt = con.createStatement(); String tableName = "order"; ResultSet res = null; // select * query String sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getString(1)) + "\t" + res.getString(2)); } } }
运行结果:
Running: select * from order 0001 cenry 0002 beny 0003 ben 0004 cherry 0005 jone 0006 lili 0007 chen 0008 wiwi
有关Hive与JDBC的集成操作见官网: