前言
经过前面几篇文章的讲解,相信大家都已经成功搭建Hadoop
集群,Spark
集群以及安装好了Hive
。由于Hive
默认的引擎是MR
,相信体验过的小伙伴在执行SQL
语句时,都会感叹怎么这么龟速呢,那有没有办法提升一下速度呢,答案是:yes
!那就开始我们今天的学习之旅吧!
了解几个概念
Hive
引擎包括:MR
(默认)、tez
、spark
Hive on Spark
:Hive
既负责存储元数据又负责SQL
的解析优化,语法是HQL
语法,执行引擎变成了Spark
,Spark
负责采用RDD
执行。Spark on Hive
:Hive
只负责存储元数据,Spark
负责SQL
解析优化,语法是Spark SQL
语法,Spark
负责采用RDD
执行。
Hive3.1.2源码编译
使用hive3.1.2
和spark3.0.0
配置hive on spark
的时候,发现官方下载的hive3.1.2
和spark3.0.0
不兼容,hive3.1.2
对应的版本是spark2.3.0
,而spark3.0.0
对应的hadoop
版本是hadoop2.6
或hadoop2.7
。
所以,如果想要使用高版本的hive
和hadoop
,我们要重新编译hive
,兼容spark3.0.0
。除了兼容spark3.0.0
外,还将hive3.1.2
的guava
的版本进行了提升,和hadoop3.x
保持一致,以便兼容hadoop3.1.3
。
编译详细步骤
1、下载hive3.1.2源码包
下载地址:
https://archive.apache.org/dist/hive/hive-3.1.2/hive-3.1.2-src.tar.gz
2、本地解压并修改源码
2.1 本地解压
解压hive-3.1.2-src.tar.gz
后,使用IDEA
打开hive
源码程序 ,没有安装IDEA
的小伙伴,建议安装一下,真的很好用!
2.2 修改源码
修改源码中pom.xml
文件:
<!-- 提升guava版本,和hadoop3.x保持一致 --> <guava.version>27.0-jre</guava.version> <!-- 修改spark版本以及对应的scala版本 --> <spark.version>3.0.0</spark.version> <scala.binary.version>2.12</scala.binary.version> <scala.version>2.12.10</scala.version>
修改源码中的27个类:
修改内容参考
1. druid-handler/src/java/org/apache/hadoop/hive/druid/serde/DruidScanQueryRecordReader.java
2. llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/AMReporter.java
3. llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/LlapTaskReporter.java
4. llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/TaskExecutorService.java
5. ql/src/test/org/apache/hadoop/hive/ql/exec/tez/SampleTezSessionState.java
6. ql/src/java/org/apache/hadoop/hive/ql/exec/tez/WorkloadManager.java
7. llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
8. llap-common/src/java/org/apache/hadoop/hive/llap/AsyncPbRpcProxy.java
9. ql/src/test/org/apache/hadoop/hive/ql/stats/TestStatsUtils.java
10. spark-client/src/main/java/org/apache/hive/spark/client/metrics/ShuffleWriteMetrics.java
11. spark-client/src/main/java/org/apache/hive/spark/counter/SparkCounter.java
12. 新建类standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/ColumnsStatsUtils.java
13. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/DateColumnStatsAggregator.java
14. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/DecimalColumnStatsAggregator.java
15. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/DoubleColumnStatsAggregator.java
16. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java
17. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/StringColumnStatsAggregator.java
18. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/DateColumnStatsDataInspector.java
19. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/DecimalColumnStatsDataInspector.java
20. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/DoubleColumnStatsDataInspector.java
21. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/LongColumnStatsDataInspector.java
22. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/StringColumnStatsDataInspector.java
23. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/DateColumnStatsMerger.java
24. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/DecimalColumnStatsMerger.java
25. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/DoubleColumnStatsMerger.java
26. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/LongColumnStatsMerger.java
27. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java
3、压缩并上传
上述源码修改完成后,将hive-3.1.2-src
进行压缩
tar -zcf hive-3.1.2-src.tar ./hive-3.1.2-src
然后将生成的压缩包上传到一台虚拟机的/opt/resource
目录下。
注意:上传的虚拟机需要提前安装
JDK
以及Maven
,这个比较简单,可自行百度安装一下!
4、解压编译
# 解压 tar -zxvf hive-3.1.2-src.tar cd /opt/resource # 使用Maven进行编译打包 mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true
编译成功后,在/opt/resource/packaging/target/
中就会生成我们重新编译好的apache-hive-3.1.2-bin.tar.gz
,有需要的小伙伴可以在这里下载。
改造原有部署hive
1、拷贝修改配置文件
之前没有使用官方包安装hive
的,可以直接使用我们编译好的hive
进行安装,我这里使用官方的包安装过,需要进行改造原来的hive
。
cd /opt/module # 修改旧的hive文件夹名称 mv hive hive-bak # 解压我们重新编译的安装包 tar -zxvf apache-hive-3.1.2-bin.tar.gz # 修改新的hive文件名称 mv apache-hive-3.1.2-bin hive # 将之前安装的hive配置文件放入新hive中 cp ./hive-bak/conf/hive-site.xml ./hive/conf cp ./hive-bak/conf/spark-defaults.conf ./hive/conf # 修改hive-site.xml vim ./hive/conf/hive-site.xml
添加如下内容:
<!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)--> <property> <name>spark.yarn.jars</name> <value>hdfs://hadoop1:8020/spark-jars/*.jar</value> </property> <!--Hive执行引擎--> <property> <name>hive.execution.engine</name> <value>spark</value> </property> <!--Hive和Spark连接超时时间--> <property> <name>hive.spark.client.connect.timeout</name> <value>10000ms</value> </property>
注意:
hive.spark.client.connect.timeout
的默认值是1000ms
,如果执行hive
的insert
语句时,抛如下异常,可以调大该参数到10000ms
。
FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for Spark session d9e0224c-3d14-4bf4-95bc-ee3ec56df48e
2、拷贝spark的jar到hive/lib目录下
cd /opt/module/spark-3.0.0-bin-hadoop3.2 cp py4j-0.10.9.jar pyrolite-4.30.jar RoaringBitmap-0.7.45.jar scala*.jar snappy-java-1.1.7.5.jar spark-core_2.12-3.0.0.jar spark-kvstore_2.12-3.0.0.jar spark-launcher_2.12-3.0.0.jar spark-network-common_2.12-3.0.0.jar spark-network-shuffle_2.12-3.0.0.jar spark-tags_2.12-3.0.0.jar spark-unsafe_2.12-3.0.0.jar /opt/module/apache-hive-3.1.2-bin/lib/
注意:一定要拷贝
spark jar
包!一定要拷贝spark jar
包!一定要拷贝spark jar
包!重要的事情说三遍!!!如果没有这步操作,还是无法成功执行。
测试
1、启动环境
1.spark
集群yarn
模式成功搭建
2.启动hadoop
集群的hdfs
,yarn
服务
3.启动hive
元数据存储服务MySQL
cd /opt/module/hive # 启动客户端 bin/hive
2、插入数据测试
# 创建一张表: hive (default)> create table student(id int, name string); # 插入一条数据: hive (default)> insert into table student values(1,'abc');
执行结果:
hive (default)> insert into table student values(1,'abc'); Query ID = root_20220517214157_a976e115-4cbe-46d1-a26b-27878214e920 Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Running with YARN Application = application_1652771632964_0004 Kill Command = /opt/module/hadoop/bin/yarn application -kill application_1652771632964_0004 Hive on Spark Session Web UI URL: http://hadoop3:45607 Query Hive on Spark job[4] stages: [16, 17] Spark job[4] status = RUNNING -------------------------------------------------------------------------------------- STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED -------------------------------------------------------------------------------------- Stage-16 ....... 0 FINISHED 1 1 0 0 0 Stage-17 ....... 0 FINISHED 1 1 0 0 0 -------------------------------------------------------------------------------------- STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.12 s -------------------------------------------------------------------------------------- Spark job[4] finished successfully in 5.12 second(s) Loading data to table default.student OK col1 col2 Time taken: 31.066 seconds hive (default)>
出现上述类似信息打印,证明hive on spark
配置成功!
至此,我们就顺利完成了Hive 和 Spark的整合,大大提升了HQL执行的速度!