Hive环境搭建
centos安装hive3.1.2(精讲篇)
https://blog.csdn.net/m0_54925305/article/details/120554242?spm=1001.2014.3001.5502
Hive数据仓库的操作
一、创建数据库
hive> show databases; OK default Time taken: 0.067 seconds, Fetched: 1 row(s) hive> create database if not exists DB; OK Time taken: 0.064 seconds hive> show databases; OK db default Time taken: 0.018 seconds, Fetched: 2 row(s)
二、查看数据仓库DB的信息及路径
hive> describe database DB; OK db hdfs://master:9000/user/hive/warehouse/db.db root USER Time taken: 0.065 seconds, Fetched: 1 row(s)
Hive数据表的操作
Hive的数据表分为两种:内部表和外部表。
Hive创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据,生产中常使用外部表。
下面详细介绍对表操作的命令及使用方法:
即将创建的表,表名不能和已有表名重复,否则会报错,现在我们show tables 一下,查看已存在的表。
一、创建一个名为cat的内部表,有两个字段为cat_id和cat_name,字符类型为string
hive> create table cat(cat_id string,cat_name string); OK Time taken: 0.72 seconds hive> show tables; OK cat Time taken: 0.057 seconds, Fetched: 1 row(s)
二、创建一个外部表,表名为cat2,有两个字段为cat_id和cat_name,字符类型为string
hive> create external table if not exists goods(group_id string,group_name string) row format delimited fields terminated by '\t' location '/user/root/goods'; OK Time taken: 0.155 seconds hive> show tables; OK cat goods Time taken: 0.026 seconds, Fetched: 2 row(s)
三、修改cat表的表结构。对cat表添加两个字段group_id和cat_code
hive> alter table cat add columns(group_id string,cat_code string); OK Time taken: 0.372 seconds hive> desc cat; OK cat_id string cat_name string group_id string cat_code string Time taken: 0.087 seconds, Fetched: 4 row(s)
四、修改表名cat为cat2
hive> alter table cat rename to cat2; OK Time taken: 0.275 seconds
这个命令可以让用户为表更名,数据所在的位置和分区名并不改变。
五、创建与已知表相同结构的表,创建一个与cat表结构相同的表,名为cat3,这里要用到 like 关键字
hive> create table cat3 like cat2; OK Time taken: 1.391 seconds hive> show tables; OK cat2 cat3 goods Time taken: 0.047 seconds, Fetched: 3 row(s) hive> desc cat3; OK cat_id string cat_name string group_id string cat_code string Time taken: 0.118 seconds, Fetched: 4 row(s)
Hive中数据的导入与导出
一、从本地文件系统中导入数据到Hive表
首先,在Hive 中创建一个cat_group表,包含group_id和group_name两个字段,字符类型为string,以“\t”为分隔符,并查看结果。
hive> create table cat_group(group_id string,group_name string) row format delimited fields terminated by '\t' stored as textfile; OK Time taken: 0.218 seconds hive> show tables; OK cat2 cat3 cat_group goods Time taken: 0.048 seconds, Fetched: 4 row(s)
[row format delimited]关键字,是用来设置创建的表在加载数据的时候,支持的列分隔符。
[stored as textfile]关键字,是用来设置加载数据的数据类型,默认是 TEXTFILE,如果文件数据是纯文本,就是使用[stored as textfile],然后从本地直接拷贝到HDFS上,Hive直接可以识别数据。
二、将Linux本地 /input/hive/目录下的myhive 文件导入到 Hive 中的cat_group表中
hive> load data local inpath '/input/hive/myhive' into table cat_group; Loading data to table db.cat_group OK Time taken: 1.081 seconds
通过select语句查看cat_group表中是否成功导入数据,使用limit关键字限制输出5条记录。
hive> select * from cat_group limit 5; OK 101 孙悟空 102 唐僧 103 猪八戒 104 沙僧 105 托马斯 Time taken: 2.088 seconds, Fetched: 5 row(s)
三、将HDFS中的数据导入到Hive中
1、首先,另外开启一个操作窗口,在HDFS上创建/output/hive目录
[root@master hive]# hadoop fs -mkdir /output/hive
2、将本地/iutput/hive/下的myhive文件上传到HDFS的/output/hive上,并查看是否创建成功
[root@master hive]# hadoop fs -put /input/hive/myhive /output/hive/ [root@master hive]# hadoop fs -ls /output/hive Found 1 items -rw-r--r-- 2 root supergroup 64 2022-03-05 22:19 /output/hive/myhive
3、在Hive 中创建名为cat_group1的表,创表语句如下
hive> create table cat_group1(group_id string,group_name string) > row format delimited fields terminated by '\t' stored as textfile; OK Time taken: 0.243 seconds
4、将HDFS 下/output/hive 中的表cat_group导入到Hive 中的cat_group1表中,并查看结果
hive> load data inpath '/output/hive/myhive' into table cat_group1; Loading data to table db.cat_group1 OK Time taken: 0.539 seconds hive> select * from cat_group1 limit 5; OK 101 孙悟空 102 唐僧 103 猪八戒 104 沙僧 105 托马斯 Time taken: 0.262 seconds, Fetched: 5 row(s)
注:数据导入成功。
HDFS中数据导入到 Hive中与本地数据导入到 hive中的区别是load data后少了local。
四、从别的表中查询出相应的数据并导入到Hive中
1、首先在Hive 中创建一个名为cat_group2的表。
hive> create table cat_group2(group_id string,group_name string) > row format delimited fields terminated by '\t' stored as textfile; OK Time taken: 0.111 seconds
2、用下面两种方式将cat_group1表中的数据导入到cat_group2表中。
hive> insert into table cat_group2 select * from cat_group1; Query ID = root_20220306040659_42572420-db7d-4412-bbc3-495abd9ce479 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646528951444_0003, Tracking URL = http://master:8088/proxy/application_1646528951444_0003/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646528951444_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2022-03-06 04:07:31,799 Stage-1 map = 0%, reduce = 0% 2022-03-06 04:07:51,642 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.89 sec 2022-03-06 04:08:00,165 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.47 sec MapReduce Total cumulative CPU time: 3 seconds 470 msec Ended Job = job_1646528951444_0003 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group2/.hive-staging_hive_2022-03-06_04-06-59_043_3456913091663343579-1/-ext-10000 Loading data to table db.cat_group2 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.47 sec HDFS Read: 13409 HDFS Write: 348 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 470 msec OK Time taken: 63.711 seconds
hive> insert overwrite table cat_group2 select * from cat_group1; Query ID = root_20220306041024_bf920fd1-b42d-4ed7-ad7b-66955905fa19 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646528951444_0004, Tracking URL = http://master:8088/proxy/application_1646528951444_0004/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646528951444_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2022-03-06 04:10:47,981 Stage-1 map = 0%, reduce = 0% 2022-03-06 04:11:12,568 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.33 sec 2022-03-06 04:11:22,231 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.1 sec MapReduce Total cumulative CPU time: 4 seconds 100 msec Ended Job = job_1646528951444_0004 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group2/.hive-staging_hive_2022-03-06_04-10-24_167_6531779411761470258-1/-ext-10000 Loading data to table db.cat_group2 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.1 sec HDFS Read: 13494 HDFS Write: 348 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 100 msec OK Time taken: 60.895 seconds
注:insert overwrite会覆盖数据
3、查询表 cat_group2
hive> select * from cat_group2 limit 5; OK 101 孙悟空 102 唐僧 103 猪八戒 104 沙僧 105 托马斯 Time taken: 0.33 seconds, Fetched: 5 row(s)
4、在创建表的时候从别的表中查询出相应数据并插入到所创建的表中
Hive中创建表cat_group3并直接从cat_group2中获得数据。
hive> create table cat_group3 as select * from cat_group2; Query ID = root_20220306041630_3200b863-b9b3-4c2e-ac0d-c7caff9b6611 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1646528951444_0005, Tracking URL = http://master:8088/proxy/application_1646528951444_0005/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646528951444_0005 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2022-03-06 04:16:54,438 Stage-1 map = 0%, reduce = 0% 2022-03-06 04:17:02,430 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.58 sec MapReduce Total cumulative CPU time: 1 seconds 580 msec Ended Job = job_1646528951444_0005 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/.hive-staging_hive_2022-03-06_04-16-30_327_7813330832683742274-1/-ext-10002 Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group3 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.58 sec HDFS Read: 4969 HDFS Write: 133 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 580 msec OK Time taken: 34.65 seconds
5、查询表 cat_group3
hive> select * from cat_group3 limit 5; OK 101 孙悟空 102 唐僧 103 猪八戒 104 沙僧 105 托马斯 Time taken: 0.229 seconds, Fetched: 5 row(s)
五、常见的三种数据导出方式
1、导出到本地文件系统
在本地创建目录 /output/hive 并将Hive中的cat_group表导出到本地文件系统/output/hive/中。
[root@master hive]# mkdir -p /output/hive/
hive> insert overwrite local directory '/output/hive/' > row format delimited fields terminated by '\t' select * from cat_group; Query ID = root_20220306062829_b059a3f5-e4ad-4dd7-a000-e294c4ccbee2 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1646528951444_0006, Tracking URL = http://master:8088/proxy/application_1646528951444_0006/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646528951444_0006 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2022-03-06 06:28:51,743 Stage-1 map = 0%, reduce = 0% 2022-03-06 06:29:00,515 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.59 sec MapReduce Total cumulative CPU time: 1 seconds 590 msec Ended Job = job_1646528951444_0006 Moving data to local directory /output/hive MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.59 sec HDFS Read: 4738 HDFS Write: 64 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 590 msec OK Time taken: 32.116 seconds [root@master out]# cd /output/hive/ [root@master hive]# ll total 4 -rw-r--r--. 1 root root 64 Mar 6 06:29 000000_0 [root@master hive]# cat 000000_0 101 孙悟空 102 唐僧 103 猪八戒 104 沙僧 105 托马斯
注意:方法和导入数据到 Hive不一样,不能用insert into来将数据导出。
2、Hive中数据导出到HDFS中
将Hive中的表cat_group中的数据导入到HDFS的/output/hive目录里。
hive> insert overwrite directory '/output/hive' > row format delimited fields terminated by '\t' select group_id, > group_name from cat_group; Query ID = root_20220306063621_b359d338-77ee-4571-a425-5415f9c6fb03 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1646528951444_0007, Tracking URL = http://master:8088/proxy/application_1646528951444_0007/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646528951444_0007 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2022-03-06 06:36:41,866 Stage-1 map = 0%, reduce = 0% 2022-03-06 06:36:55,679 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.75 sec MapReduce Total cumulative CPU time: 1 seconds 750 msec Ended Job = job_1646528951444_0007 Stage-3 is selected by condition resolver. Stage-2 is filtered out by condition resolver. Stage-4 is filtered out by condition resolver. Moving data to directory hdfs://master:9000/output/hive/.hive-staging_hive_2022-03-06_06-36-21_452_7432529204143275493-1/-ext-10000 Moving data to directory /output/hive MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.75 sec HDFS Read: 4772 HDFS Write: 64 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 750 msec OK Time taken: 36.494 seconds
在HDFS上查看结果
[root@master hive]# hadoop fs -ls /output/hive Found 1 items -rw-r--r-- 2 root supergroup 64 2022-03-06 06:36 /output/hive/000000_0
3、导出到Hive的另一张表中
将Hive中表cat_group中的数据导入到cat_group4中(两表字段及字符类型相同)。
首先在Hive 中创建一个表cat_group4,有group_id和group_name 两个字段,字符类型为string,以\t’为分隔符。
hive> create table cat_group4(group_id string,group_name string) > row format delimited fields terminated by '\t' stored as textfile; OK Time taken: 0.195 seconds
然后将cat_group中的数据导入到cat_group4中。
hive> insert into table cat_group4 select * from cat_group; Query ID = root_20220306064421_722364dd-7475-4ae5-ba44-553f3df856e2 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646528951444_0008, Tracking URL = http://master:8088/proxy/application_1646528951444_0008/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646528951444_0008 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2022-03-06 06:44:47,514 Stage-1 map = 0%, reduce = 0% 2022-03-06 06:44:58,359 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.74 sec 2022-03-06 06:45:11,880 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.4 sec MapReduce Total cumulative CPU time: 3 seconds 400 msec Ended Job = job_1646528951444_0008 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group4/.hive-staging_hive_2022-03-06_06-44-21_318_6696628966307745769-1/-ext-10000 Loading data to table db.cat_group4 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.4 sec HDFS Read: 13474 HDFS Write: 348 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 400 msec OK Time taken: 52.617 seconds
导入完成后,查看cat_group4表中数据。
hive> select * from cat_group4 limit 10; OK 101 孙悟空 102 唐僧 103 猪八戒 104 沙僧 105 托马斯 Time taken: 0.249 seconds, Fetched: 5 row(s)
六、Hive分区表的操作
创建表分区,在 Hive中创建一个分区表goods,包含 goods_id和goods_status两个字段,字符类型为string,分区为cat_id,字符类型为string,以“\t“为分隔符。
hive> create table goods(goods_id string,goods_status string) partitioned by (cat_id string) > row format delimited fields terminated by '\t'; OK Time taken: 0.107 seconds
查看表 goods 结构
hive> desc goods; OK goods_id string goods_status string cat_id string # Partition Information # col_name data_type comment cat_id string Time taken: 0.108 seconds, Fetched: 7 row(s)
向分区表插入数据,将本地/output/hive下的表goods 中数据,插入到分区表goods中。
[root@master hive]# cat goods 1020405 6 52052 1020405 6 52052 1020405 6 52052 1020405 6 52052 1020405 6 52052 1020405 6 52052 1020405 6 52052 1020405 6 52052 1020405 6 52052 1020405 6 52052
在Hive中创建一个非分区表goods_1表,用于存储本地/input/hive/下的表goods 中数据。
hive> create table goods_1(goods_id string,goods_status string,cat_id string) > row format delimited fields terminated by '\t'; OK Time taken: 0.179 seconds
将本地/input/hive/下的表goods 中数据导入到Hive中的goods_1表中。
hive> load data local inpath '/input/hive/goods' into table goods_1; Loading data to table db.goods_1 OK Time taken: 0.511 seconds
再将表goods_1中的数据导入到分区表goods中
hive> insert into table db.goods partition(cat_id = '52052') select goods_id, goods_status from db.goods_1 where cat_id = '52052'; Query ID = root_20220307041832_30f47fc3-629d-4eda-821a-5f0c3a9edb0d Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646636256603_0002, Tracking URL = http://master:8088/proxy/application_1646636256603_0002/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646636256603_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2022-03-07 04:19:05,274 Stage-1 map = 0%, reduce = 0% 2022-03-07 04:19:18,487 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.77 sec 2022-03-07 04:19:27,292 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.59 sec MapReduce Total cumulative CPU time: 4 seconds 590 msec Ended Job = job_1646636256603_0002 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/goods/cat_id=52052/.hive-staging_hive_2022-03-07_04-18-32_060_6446641423854979060-1/-ext-10000 Loading data to table db.goods partition (cat_id=52052) MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.59 sec HDFS Read: 14777 HDFS Write: 320 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 590 msec OK Time taken: 59.931 seconds
查看表 goods中数据
hive> select goods_id, goods_status from goods; OK 1624123 6 1020405 6 1020405 6 1020405 6 1020405 6 1020405 6 1020405 6 1020405 6 Time taken: 0.252 seconds, Fetched: 8 row(s)
修改表分区,将分区表goods中的分区列 cat_id = 52050 改为cat_id = 52051,并查看修改后的分区名。
hive> alter table goods partition(cat_id=52052) rename to partition(cat_id=52051); OK Time taken: 0.678 seconds hive> show partitions goods; OK cat_id=52051 Time taken: 0.139 seconds, Fetched: 1 row(s)
删除表分区
在删除goods分区表之前,先将goods表备份出一个goods_2表
hive> create table goods_2(goods_id string,goods_status string) partitioned by (cat_id string) row format delimited fields terminated by '\t'; OK Time taken: 0.178 seconds hive> insert into table goods_2 partition(cat_id='52052') select goods_id,goods_status from goods_1 where cat_id = '52052'; Query ID = root_20220307054238_db58a379-17f6-4ecb-86e0-402e0d7bbf54 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646636256603_0003, Tracking URL = http://master:8088/proxy/application_1646636256603_0003/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646636256603_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2022-03-07 05:43:04,534 Stage-1 map = 0%, reduce = 0% 2022-03-07 05:43:17,542 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.76 sec 2022-03-07 05:43:26,197 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.55 sec MapReduce Total cumulative CPU time: 4 seconds 550 msec Ended Job = job_1646636256603_0003 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/goods_2/cat_id=52052/.hive-staging_hive_2022-03-07_05-42-38_498_2225361888387483704-1/-ext-10000 Loading data to table db.goods_2 partition (cat_id=52052) MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.55 sec HDFS Read: 14813 HDFS Write: 322 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 550 msec OK Time taken: 49.84 seconds
删除goods表中的cat_id 分区
hive> alter table goods drop if exists partition(cat_id = '52051'); Dropped the partition cat_id=52051 OK Time taken: 0.405 seconds hive> show partitions goods; OK Time taken: 0.137 seconds
七、Hive桶的操作
在建立桶之前,需要设置hive.enforce.bucketing属性为true,使用Hive能识别桶。
1、创建桶
创建一个名为goods_t表,包含两个字段goods_id 和goods_status ,字段类型都为string ,按cat_id string 做分区,按goods_status 列聚类和goods_id列排列,划分成两个桶。
hive> create table goods_t(goods_id string, goods_status string) partitioned by (cat_id string) clustered by(goods_status) sorted by(goods_id) into 2 buckets; OK Time taken: 0.148 seconds
2、设置环境变量set hive.enforce.bucketing=true;
hive> set hive.enforce.bucketing=true;
3、向goods_t表中插入goods_2表中的数据
hive> insert overwrite table goods_t partition(cat_id='52063') select goods_id,goods_status from goods_2; Query ID = root_20220307060336_c76fa90c-ea59-4fa4-9dd5-654c843421fd Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks determined at compile time: 2 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646636256603_0004, Tracking URL = http://master:8088/proxy/application_1646636256603_0004/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646636256603_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2 2022-03-07 06:04:01,531 Stage-1 map = 0%, reduce = 0% 2022-03-07 06:04:12,389 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.73 sec 2022-03-07 06:04:29,170 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 4.23 sec 2022-03-07 06:04:30,371 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.99 sec MapReduce Total cumulative CPU time: 7 seconds 410 msec Ended Job = job_1646636256603_0004 Loading data to table db.goods_t partition (cat_id=52063) Launching Job 2 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1646636256603_0005, Tracking URL = http://master:8088/proxy/application_1646636256603_0005/ Kill Command = /home/hadoop//bin/mapred job -kill job_1646636256603_0005 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2022-03-07 06:04:54,726 Stage-3 map = 0%, reduce = 0% 2022-03-07 06:05:07,008 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.75 sec 2022-03-07 06:05:16,566 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 3.93 sec MapReduce Total cumulative CPU time: 3 seconds 930 msec Ended Job = job_1646636256603_0005 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 7.41 sec HDFS Read: 19414 HDFS Write: 469 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 3.93 sec HDFS Read: 11591 HDFS Write: 173 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 340 msec OK Time taken: 102.151 seconds
4、抽样桶表
hive> select * from goods_t tablesample(bucket 1 out of 2 on goods_status); OK Time taken: 0.281 seconds