Hive高级应用(三)

简介: 笔记

十五、Hive中数据备份


1.export导出数据 :

EXPORT TABLE db_hive.order TO '/user/caizhengjie/datas/export/order'

2.import导入数据:

import table order_imp from '/user/caizhengjie/datas/export/order';
IMPORT  TABLE order_imp_1 FROM '/user/caizhengjie/datas/export/order' LOCATION '/user/kfk/datas/imp/order';    --(location)指定数据表目录


十六、Hive常用的查询


准备数据:

/opt/datas/hive/order.txt
0001,cenry,2018-10-09,product-1,350,guangzhou
0002,beny,2018-09-09,product-2,180,beijing
0003,ben,2018-09-09,product-3,580,beijing
0004,cherry,2018-10-09,product-4,450,shenzheng
0005,jone,2018-10-09,product-5,30,nanjing
0006,lili,2018-09-09,product-6,50,hangzhou
0007,chen,2018-10-09,product-7,90,wuhan
0008,wiwi,2018-09-09,product-8,150,chengdu

创建表:

create table order(userid string,username string,order_date string,product_name string,price int,city string) row format delimit

1.字段查询

select userid,username from order;
userid  username
0001    cenry
0002    beny
0003    ben
0004    cherry
0005    jone
0006    lili
0007    chen
0008    wiwi

2.where查询

select * from order where price > 200;
order.userid    order.username  order.order_date        order.product_name      order.price     order.city
0001    cenry   2018-10-09      product-1       350     guangzhou
0003    ben     2018-09-09      product-3       580     beijing
0004    cherry  2018-10-09      product-4       450     shenzheng

3.limit查询

select * from order limit 3;
order.userid    order.username  order.order_date        order.product_name      order.price     order.city
0001    cenry   2018-10-09      product-1       350     guangzhou
0002    beny    2018-09-09      product-2       180     beijing
0003    ben     2018-09-09      product-3       580     beijing

4.distinct去重

select distinct city from order;
city
beijing
chengdu
guangzhou
hangzhou
nanjing
shenzheng
wuhan

5.max/min/count/sum

select max(price) from order;
_c0
580

6.group by / having 分组查询

select sum(price) price,city from order group by city having price > 200;
price   city
760     beijing
350     guangzhou
450     shenzheng


十七、Hive中join表的关联


准备数据order1.txt

1,0001,2018-09-09,product-1,180
2,0001,2018-09-09,product-2,240
3,0001,2018-09-09,product-3,350
4,0002,2018-09-09,product-4,110
5,0002,2018-09-09,product-5,230
6,0003,2018-09-09,product-6,245
7,0004,2018-09-09,product-7,965
8,0004,2018-09-09,product-8,741
9,0008,2018-09-09,product-8,690
10,0009,2018-09-09,product-9,120
11,0009,2018-09-09,product-9,120

customer.txt

0001,jack,17425845325,guangzhou
0002,beny,15451225965,shenzhen
0003,jone,13754859326,nanjing
0004,cherry,13785415255,suzhou
0005,alex,15745869325,beijing
0006,aili,13245632581,xiamen

创建order1表:

create table if not exists db_hive.order1(orderid string,userid string,order_date string,product_name string,price string) row format delimited fields terminated by ',' stored as textfile;

创建customer表:

create table if not exists db_hive.customer(userid string,username string,telephone string,city string) row format delimited fields terminated by ',' stored as textfile; 

加载数据:

load data local inpath '/opt/datas/hive/order1.txt' into table order1;
load data local inpath '/opt/datas/hive/customer.txt' into table customer;

1.等值连接

select * from customer t1,order1 t2 where t1.userid = t2.userid;
t1.userid       t1.username     t1.telephone    t1.city t2.orderid      t2.userid       t2.order_date   t2.product_name t2.price
0001    jack    17425845325     guangzhou       1       0001    2018-09-09      product-1       180
0001    jack    17425845325     guangzhou       2       0001    2018-09-09      product-2       240
0001    jack    17425845325     guangzhou       3       0001    2018-09-09      product-3       350
0002    beny    15451225965     shenzhen        4       0002    2018-09-09      product-4       110
0002    beny    15451225965     shenzhen        5       0002    2018-09-09      product-5       230
0003    jone    13754859326     nanjing 6       0003    2018-09-09      product-6       245
0004    cherry  13785415255     suzhou  7       0004    2018-09-09      product-7       965
0004    cherry  13785415255     suzhou  8       0004    2018-09-09      product-8       741

2.左连接

select  t1.username,t2.product_name  from customer t1 left join order1 t2 on t1.userid = t2.userid;
t1.username     t2.product_name
jack    product-1
jack    product-2
jack    product-3
beny    product-4
beny    product-5
jone    product-6
cherry  product-7
cherry  product-8
alex    NULL
aili    NULL
select  t2.username,t1.product_name  from order1 t1 left join customer t2 on t1.userid = t2.userid;
t2.username     t1.product_name
jack    product-1
jack    product-2
jack    product-3
beny    product-4
beny    product-5
jone    product-6
cherry  product-7
cherry  product-8
NULL    product-8
NULL    product-9
NULL    product-9

3.右连接

select  t1.username,t2.product_name  from customer t1 right join order1 t2 on t1.userid = t2.userid;
t1.username     t2.product_name
jack    product-1
jack    product-2
jack    product-3
beny    product-4
beny    product-5
jone    product-6
cherry  product-7
cherry  product-8
NULL    product-8
NULL    product-9
NULL    product-9
select  t2.username,t1.product_name  from order1 t1 right join customer t2 on t1.userid = t2.userid;
t2.username     t1.product_name
jack    product-1
jack    product-2
jack    product-3
beny    product-4
beny    product-5
jone    product-6
cherry  product-7
cherry  product-8
alex    NULL
aili    NULL

4.全连接

select  t2.username,t1.product_name  from order1 t1 full join customer t2 on t1.userid = t2.userid;


t2.username     t1.product_name
jack    product-3
jack    product-2
jack    product-1
beny    product-5
beny    product-4
jone    product-6
cherry  product-8
cherry  product-7
alex    NULL
aili    NULL
NULL    product-8
NULL    product-9
NULL    product-9


十八、Hive几种By的使用详解


order /sort / cluster / distribute by

准备数据:

order.userid    order.username  order.order_date        order.product_name      order.price     order.city
0001    cenry   2018-10-09      product-1       350     guangzhou
0002    beny    2018-09-09      product-2       180     beijing
0003    ben     2018-09-09      product-3       580     beijing
0004    cherry  2018-10-09      product-4       450     shenzheng
0005    jone    2018-10-09      product-5       30      nanjing
0006    lili    2018-09-09      product-6       50      hangzhou
0007    chen    2018-10-09      product-7       90      wuhan
0008    wiwi    2018-09-09      product-8       150     chengdu


(1)order by

order by (只有一个reduce ,全局排序)

select * from order order by price desc;
select * from order order by price asc;
order.userid    order.username  order.order_date        order.product_name      order.price     order.city
0003    ben     2018-09-09      product-3       580     beijing
0004    cherry  2018-10-09      product-4       450     shenzheng
0001    cenry   2018-10-09      product-1       350     guangzhou
0002    beny    2018-09-09      product-2       180     beijing
0008    wiwi    2018-09-09      product-8       150     chengdu
0007    chen    2018-10-09      product-7       90      wuhan
0006    lili    2018-09-09      product-6       50      hangzhou
0005    jone    2018-10-09      product-5       30      nanjing

设置reduce个数,这里设置为3个

set mapreduce.job.reduces=3;

但是不起作用,运行的时候还是一个。


(2)sort by

每个reduce端都会进行排序,也就是局部有序,可以指定多个reduce。同时,如果想测试一下执行的效果,建议将输出结果保存到本地,并调整reduce的数量。(我设置成3个)


sort by (对每一个reduce内部的数据进行排序,最后的全局结果集不排序)

设置reduce个数,这里设置为3个

set mapreduce.job.reduces=3;

这里的写法是将数据写入到本地

insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order sort by price desc;
[caizhengjie@bigdata-pro-m01 sort]$ ls
000000_0  000001_0  000002_0
[caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0
0008,wiwi,2018-09-09,product-8,150,chengdu
0007,chen,2018-10-09,product-7,90,wuhan
0006,lili,2018-09-09,product-6,50,hangzhou
0005,jone,2018-10-09,product-5,30,nanjing
[caizhengjie@bigdata-pro-m01 sort]$ cat 000001_0
0003,ben,2018-09-09,product-3,580,beijing
0004,cherry,2018-10-09,product-4,450,shenzheng
0002,beny,2018-09-09,product-2,180,beijing
[caizhengjie@bigdata-pro-m01 sort]$ cat 000002_0
0001,cenry,2018-10-09,product-1,350,guangzhou


(3)distribute by

指定分区原则。通常和sort by一起用,distribute by必须要写在sort by之前。理解成:按照XX字段分区,再按照XX字段排序。


distribute by (作用类似于partition,一般与sort by 一起使用)

设置reduce个数,这里设置为3个

set mapreduce.job.reduces=3;
insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order distribute by city sort by price desc;
[caizhengjie@bigdata-pro-m01 sort]$ ls
000000_0  000001_0  000002_0
[caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0
0004,cherry,2018-10-09,product-4,450,shenzheng
0001,cenry,2018-10-09,product-1,350,guangzhou
[caizhengjie@bigdata-pro-m01 sort]$ cat 000001_0
0003,ben,2018-09-09,product-3,580,beijing
0002,beny,2018-09-09,product-2,180,beijing
0007,chen,2018-10-09,product-7,90,wuhan
0006,lili,2018-09-09,product-6,50,hangzhou
0005,jone,2018-10-09,product-5,30,nanjing
[caizhengjie@bigdata-pro-m01 sort]$ cat 000002_0
0008,wiwi,2018-09-09,product-8,150,chengdu


(4)cluster by

当distribute by 和 sort by 所指定的字段相同时,即可以使用cluster by。

注意:cluster by指定的列只能是升序,不能指定asc和desc。

设置reduce个数,这里设置为3个

set mapreduce.job.reduces=3;

cluster by (distribute by 和sort by字段相同时,使用cluster by 代替)

insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order distribute by username sort by username;
[caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0
0003,ben,2018-09-09,product-3,580,beijing
0007,chen,2018-10-09,product-7,90,wuhan
0004,cherry,2018-10-09,product-4,450,shenzheng
0006,lili,2018-09-09,product-6,50,hangzhou

等价于:

insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order cluster by username;
[caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0
0003,ben,2018-09-09,product-3,580,beijing
0007,chen,2018-10-09,product-7,90,wuhan
0004,cherry,2018-10-09,product-4,450,shenzheng
0006,lili,2018-09-09,product-6,50,hangzhou


十九、Hive中UDF编程详解


1.自定义UDF函数

加载pom.xml文件内容:

<properties>
    <hive.version>0.13.1</hive.version>
</properties>
<dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>${hive.version}</version>
</dependency>

编写udf代码

package com.kfk.hive;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
/**
 * @author : 蔡政洁
 * @email :caizhengjie888@icloud.com
 * @date : 2020/10/23
 * @time : 8:31 下午
 */
public class KfkUDF extends UDF {
    public Text evaluate(final Text s) {
        if (s == null) { return null; }
        return new Text(s.toString().toUpperCase());
    }
    public static void main(String[] args) {
        System.out.println(new KfkUDF().evaluate(new Text("spark")));
    }
}

2.打jar包

3.添加jar包

hive (db_hive)> add jar /opt/jars/bigdata_study_udf.jar;
hive (db_hive)> list jar;

4.Create Function

create temporary function kfk_udf  as 'com.kfk.hive.KfkUDF';

5.使用自定义函数

select kfk_udf(city) city from order;


city
GUANGZHOU
BEIJING
BEIJING
SHENZHENG
NANJING
HANGZHOU
WUHAN
CHENGDU

有关hive的UDF操作可以见官网:

https://cwiki.apache.org/confluence/display/Hive/HivePlugins


二十、通过hiverserver2/beeline连接Hive


1.启动metastore:

bin/hive --service metastore

2.启动hiverserver2:

bin/hiveserver2

3.启动beeline客户端:

bin/beeline

4.连接hive:

% bin/beeline
beeline> !connect jdbc:hive2://<host>:<port>/<db>;auth=noSasl hiveuser pass
!connect jdbc:hive2://bigdata-pro-m01:10000 caizhengjie 199911

5.查看数据

1: jdbc:hive2://bigdata-pro-m01:10000> select * from order;
+---------------+-----------------+-------------------+---------------------+--------------+-------------+--+
| order.userid  | order.username  | order.order_date  | order.product_name  | order.price  | order.city  |
+---------------+-----------------+-------------------+---------------------+--------------+-------------+--+
| 0001          | cenry           | 2018-10-09        | product-1           | 350          | guangzhou   |
| 0002          | beny            | 2018-09-09        | product-2           | 180          | beijing     |
| 0003          | ben             | 2018-09-09        | product-3           | 580          | beijing     |
| 0004          | cherry          | 2018-10-09        | product-4           | 450          | shenzheng   |
| 0005          | jone            | 2018-10-09        | product-5           | 30           | nanjing     |
| 0006          | lili            | 2018-09-09        | product-6           | 50           | hangzhou    |
| 0007          | chen            | 2018-10-09        | product-7           | 90           | wuhan       |
| 0008          | wiwi            | 2018-09-09        | product-8           | 150          | chengdu     |
+---------------+-----------------+-------------------+---------------------+--------------+-------------+--+

可以看出来,通过beeline客户端连接的hive比hive本身的命令显示出来的数据格式更加直观。

具体beeline操作可以详见官网:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903124#HiveServer2Clients-table


二十一、Hive与JDBC的集成


1.添加pom.xml配置文件

<properties>
    <hive.version>0.13.1</hive.version>
</properties>
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>${hive.version}</version>
</dependency>

2.编写JDBC代码

package com.kfk.hive;
/**
 * @author : 蔡政洁
 * @email :caizhengjie888@icloud.com
 * @date : 2020/10/23
 * @time : 9:53 下午
 */
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveJDBC {
    private static String driverName = "org.apache.hive.jdbc.HiveDriver";
    /**
     * @param args
     * @throws SQLException
     */
    public static void main(String[] args) throws SQLException {
        try {
            Class.forName(driverName);
        } catch (ClassNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
            System.exit(1);
        }
        //replace "hive" here with the name of the user the queries should run as
        Connection con = DriverManager.getConnection("jdbc:hive2://bigdata-pro-m01:10000/db_hive", "caizhengjie", "199911");
        Statement stmt = con.createStatement();
        String tableName = "order";
        ResultSet res = null;
        // select * query
        String sql = "select * from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(String.valueOf(res.getString(1)) + "\t" + res.getString(2));
        }
    }
}

运行结果:

Running: select * from order
0001  cenry
0002  beny
0003  ben
0004  cherry
0005  jone
0006  lili
0007  chen
0008  wiwi

有关Hive与JDBC的集成操作见官网:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903124#HiveServer2Clients-UsingJDBC


相关文章
|
6月前
|
SQL JSON 关系型数据库
bigdata-22-Hive高阶应用
bigdata-22-Hive高阶应用
52 0
|
6月前
|
SQL HIVE
hive窗口函数应用实例
hive窗口函数应用实例
160 3
|
6月前
|
SQL 数据采集 数据挖掘
大数据行业应用之Hive数据分析航班线路相关的各项指标
大数据行业应用之Hive数据分析航班线路相关的各项指标
189 1
|
SQL 大数据 开发工具
大数据Hive窗口函数应用实例 2
大数据Hive窗口函数应用实例
153 0
|
SQL 大数据 开发工具
大数据Hive窗口函数应用实例 1
大数据Hive窗口函数应用实例
89 0
|
2月前
|
SQL JavaScript 前端开发
用Java来开发Hive应用
用Java来开发Hive应用
35 7
|
2月前
|
SQL JavaScript 前端开发
用Java、Python来开发Hive应用
用Java、Python来开发Hive应用
33 6
|
6月前
|
SQL HIVE 索引
Hive【Hive(五)函数-高级聚合函数、炸裂函数】
Hive【Hive(五)函数-高级聚合函数、炸裂函数】
|
SQL 存储 数据库
Hive权限管理应用
Hive权限管理应用
343 1
|
SQL 分布式计算 关系型数据库
大数据Hive安装与应用
大数据Hive安装与应用
155 0
下一篇
无影云桌面