【大数据开发运维解决方案】Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装

本文涉及的产品
容器服务 Serverless 版 ACK Serverless,317元额度 多规格
云原生大数据计算服务 MaxCompute,5000CU*H 100GB 3个月
云原生大数据计算服务MaxCompute,500CU*H 100GB 3个月
简介: 1、本文档使用的基础hadoop环境是基于本人写的另一篇文章的基础上新增的spark和hudi的安装部署文档,基础环境部署文档2、整篇文章配置相对简单,走了一些坑,没有写在文档里,为了像我一样的小白看我的文档,按着错误的路径走了,文章整体写的较为详细,按照文章整体过程来做应该不会出错,如果需要搭建基础大数据环境的,可以看上面本人写的hadoop环境部署文档,写的较为详细。3、关于spark和hudi的介绍这里不再赘述,网上和官方文档有很多的文字介绍,本文所有安装所需的介质或官方文档均已给出可以直接下载或跳转的路径,方便各位免费下载与我文章安装的一致版本的介质。4、下面是本实验安装完成后本

Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装

注意

1、本文档使用的基础hadoop环境是基于本人写的另一篇文章的基础上新增的spark和hudi的安装部署文档,基础环境部署文档
2、整篇文章配置相对简单,走了一些坑,没有写在文档里,为了像我一样的小白看我的文档,按着错误的路径走了,文章整体写的较为详细,按照文章整体过程来做应该不会出错,如果需要搭建基础大数据环境的,可以看上面本人写的hadoop环境部署文档,写的较为详细。
3、关于spark和hudi的介绍这里不再赘述,网上和官方文档有很多的文字介绍,本文所有安装所需的介质或官方文档均已给出可以直接下载或跳转的路径,方便各位免费下载与我文章安装的一致版本的介质。
4、下面是本实验安装完成后本人实验环境整体hadoop系列组件的版本情况:

软件名称 版本号
Hadoop 2.7.6
Mysql 5.7
Hive 2.3.2
Hbase 1.4.9
Spark 2.4.4
Hudi 0.5.2
JDK 1.8.0_151
Scala 2.11.12
OGG for bigdata 12.3
Kylin 2.4
Kafka 2.11-1.1.1
Zookeeper 3.4.6
Oracle Linux 6.8x64

一、安装spark依赖的Scala

因为其他版本的Spark都是基于2.11.版本,只有2.4.2版本的才使用Scala2.12. 版本进行开发,hudi官方用的是spark2.4.4,而spark:"Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)",所以这里我们下载scala2.11.12。

1.1 下载和解压缩Scala

下载地址:
点击进入
下载linux版本:
在这里插入图片描述
在Linux服务器的opt目录下新建一个名为scala的文件夹,并将下载的压缩包上载上去:

[root@hadoop opt]# cd /usr/
[root@hadoop usr]# mkdir scala
[root@hadoop usr]# cd scala/
[root@hadoop scala]# pwd
/usr/scala
[root@hadoop scala]# ls
scala-2.11.12.tgz
[root@hadoop scala]# tar -zxvf scala-2.11.12.tgz 
[root@hadoop scala]# ls
scala-2.11.12  scala-2.11.12.tgz
[root@hadoop scala]# rm -rf *tgz
[root@hadoop scala]# cd scala-2.11.12/
[root@hadoop scala-2.11.12]# pwd
/usr/scala/scala-2.11.12

1.2 配置环境变量

编辑/etc/profile这个文件,在文件中增加配置:

export SCALA_HOME=/usr/scala/scala-2.11.12
在该文件的PATH变量中增加下面的内容:
${SCALA_HOME}/bin

添加完成后,我的/etc/profile的配置如下:

export JAVA_HOME=/usr/java/jdk1.8.0_151
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/hadoop/
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"
export HIVE_HOME=/hadoop/hive
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export HCAT_HOME=$HIVE_HOME/hcatalog
export HIVE_DEPENDENCY=/hadoop/hive/conf:/hadoop/hive/lib/*:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-core-2.3.3.jar:/hadoop/hiv
e/hcatalog/share/hcatalog/hive-hcatalog-server-extensions-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-streaming-2.3.3.jar:/hadoop/hive/lib/hive-exec-2.3.3.jarexport HBASE_HOME=/hadoop/hbase/
export ZOOKEEPER_HOME=/hadoop/zookeeper
export KAFKA_HOME=/hadoop/kafka
export KYLIN_HOME=/hadoop/kylin/
export GGHOME=/hadoop/ogg12
export SCALA_HOME=/usr/scala/scala-2.11.12
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$HCAT_HOME/bin:$HBASE_HOME/bin:$ZOOKEEPER_HOME:$KAFKA_HOME:$KYLIN_HOME/bin:${SCALA_HOME}/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:${HIVE_HOME}/lib:$HBASE_HOME/lib:$KYLIN_HOME/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/libjsig.so:$JAVA_HOME/jre/lib/amd64/server/libjvm.so:$JAVA_HOME/jre/lib/amd64/server:$JAVA_HOME/jre/lib/amd64:$GG_HOME:/lib

保存退出,source一下使环境变量生效:

[root@hadoop ~]# source /etc/profile

1.3 验证Scala

[root@hadoop scala-2.11.12]# scala -version
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

二、 下载和解压缩Spark

2.1、下载Spark

下载地址:
点击进入

2.2 解压缩Spark

在/hadoop创建spark目录用户存放spark。

[root@hadoop scala-2.11.12]# cd /hadoop/
[root@hadoop hadoop]# mkdir spark
[root@hadoop hadoop]# cd spark/
通过xftp上传安装包到spark目录
[root@hadoop spark]# tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz
[root@hadoop spark]# ls
spark-2.4.4-bin-hadoop2.7  spark-2.4.4-bin-hadoop2.7.tgz
[root@hadoop spark]# rm -rf *tgz
[root@hadoop spark]# mv spark-2.4.4-bin-hadoop2.7/* .
[root@hadoop spark]# ls
bin  conf  data  examples  jars  kubernetes  LICENSE  licenses  NOTICE  python  R  README.md  RELEASE  sbin  spark-2.4.4-bin-hadoop2.7  yarn

三、Spark相关的配置

3.1、配置环境变量

编辑/etc/profile文件,增加

export  SPARK_HOME=/hadoop/spark

上面的变量添加完成后编辑该文件中的PATH变量,添加

${SPARK_HOME}/bin

修改完成后,我的/etc/profile文件内容是:

export JAVA_HOME=/usr/java/jdk1.8.0_151
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/hadoop/
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"
export HIVE_HOME=/hadoop/hive
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export HCAT_HOME=$HIVE_HOME/hcatalog
export HIVE_DEPENDENCY=/hadoop/hive/conf:/hadoop/hive/lib/*:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-core-2.3.3.jar:/hadoop/hiv
e/hcatalog/share/hcatalog/hive-hcatalog-server-extensions-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-streaming-2.3.3.jar:/hadoop/hive/lib/hive-exec-2.3.3.jarexport HBASE_HOME=/hadoop/hbase/
export ZOOKEEPER_HOME=/hadoop/zookeeper
export KAFKA_HOME=/hadoop/kafka
export KYLIN_HOME=/hadoop/kylin/
export GGHOME=/hadoop/ogg12
export SCALA_HOME=/usr/scala/scala-2.11.12
export SPARK_HOME=/hadoop/spark
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$HCAT_HOME/bin:$HBASE_HOME/bin:$ZOOKEEPER_HOME:$KAFKA_HOME:$KYLIN_HOME/bin:${SCALA_HOME}/bin:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:${HIVE_HOME}/lib:$HBASE_HOME/lib:$KYLIN_HOME/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/libjsig.so:$JAVA_HOME/jre/lib/amd64/server/libjvm.so:$JAVA_HOME/jre/lib/amd64/server:$JAVA_HOME/jre/lib/amd64:$GG_HOME:/lib

编辑完成后,执行命令 source /etc/profile使环境变量生效。

3.2、配置参数文件

进入conf目录

[root@hadoop conf]# pwd
/hadoop/spark/conf

复制一份配置文件并重命名

root@hadoop conf]# cp spark-env.sh.template   spark-env.sh
[root@hadoop conf]# ls 
docker.properties.template  fairscheduler.xml.template  log4j.properties.template  metrics.properties.template  slaves.template  spark-defaults.conf.template  spark-env.sh  spark-env.sh.template

编辑spark-env.h文件,在里面加入配置(具体路径以自己的为准):

export SCALA_HOME=/usr/scala/scala-2.11.12
export JAVA_HOME=/usr/java/jdk1.8.0_151
export HADOOP_HOME=/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_HOME=/hadoop/spark
export SPARK_MASTER_IP=192.168.1.66
export SPARK_EXECUTOR_MEMORY=1G

source /etc/profile生效。

3.3、新建slaves文件

以spark为我们创建好的模板创建一个slaves文件,命令是:

[root@hadoop conf]# pwd
/hadoop/spark/conf
[root@hadoop conf]# cp slaves.template slaves

四、启动spark

因为spark是依赖于hadoop提供的分布式文件系统的,所以在启动spark之前,先确保hadoop在正常运行。

[root@hadoop hadoop]# jps
23408 RunJar
23249 JobHistoryServer
23297 RunJar
24049 Jps
22404 DataNode
22774 ResourceManager
23670 Kafka
22264 NameNode
22889 NodeManager
23642 QuorumPeerMain
22589 SecondaryNameNode

在hadoop正常运行的情况下,在hserver1(也就是hadoop的namenode,spark的marster节点)上执行命令:

[root@hadoop hadoop]# cd /hadoop/spark/sbin
[root@hadoop sbin]# ./start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /hadoop/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-hadoop.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /hadoop/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop.out
[root@hadoop sbin]# cat /hadoop/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-hadoop.out
Spark Command: /usr/java/jdk1.8.0_151/bin/java -cp /hadoop/spark/conf/:/hadoop/spark/jars/*:/hadoop/etc/hadoop/ -Xmx1g org.apache.spark.deploy.master.Master --host hadoop --port 7077 --webui-port 8080
========================================
20/03/30 22:42:27 INFO master.Master: Started daemon with process name: 24079@hadoop
20/03/30 22:42:27 INFO util.SignalUtils: Registered signal handler for TERM
20/03/30 22:42:27 INFO util.SignalUtils: Registered signal handler for HUP
20/03/30 22:42:27 INFO util.SignalUtils: Registered signal handler for INT
20/03/30 22:42:27 WARN master.MasterArguments: SPARK_MASTER_IP is deprecated, please use SPARK_MASTER_HOST
20/03/30 22:42:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/30 22:42:27 INFO spark.SecurityManager: Changing view acls to: root
20/03/30 22:42:27 INFO spark.SecurityManager: Changing modify acls to: root
20/03/30 22:42:27 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/30 22:42:27 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/30 22:42:27 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permiss
ions: Set(root); groups with modify permissions: Set()20/03/30 22:42:27 INFO util.Utils: Successfully started service 'sparkMaster' on port 7077.
20/03/30 22:42:27 INFO master.Master: Starting Spark master at spark://hadoop:7077
20/03/30 22:42:27 INFO master.Master: Running Spark version 2.4.4
20/03/30 22:42:28 INFO util.log: Logging initialized @1497ms
20/03/30 22:42:28 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/03/30 22:42:28 INFO server.Server: Started @1560ms
20/03/30 22:42:28 INFO server.AbstractConnector: Started ServerConnector@6182300a{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
20/03/30 22:42:28 INFO util.Utils: Successfully started service 'MasterUI' on port 8080.
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@f1f0276{/app,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@f1af444{/app/json,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@259b10d3{/,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6fc2f56f{/json,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@37a28407{/static,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@e99fa57{/app/kill,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@66be5bb8{/driver/kill,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO ui.MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://hadoop:8080
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6b2c0980{/metrics/master/json,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4ac1749f{/metrics/applications/json,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO master.Master: I have been elected leader! New state: ALIVE
20/03/30 22:42:31 INFO master.Master: Registering worker 192.168.1.66:39384 with 8 cores, 4.6 GB RAM
[root@hadoop sbin]# cat  /hadoop/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop.out
Spark Command: /usr/java/jdk1.8.0_151/bin/java -cp /hadoop/spark/conf/:/hadoop/spark/jars/*:/hadoop/etc/hadoop/ -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://hadoop:7077
========================================
20/03/30 22:42:29 INFO worker.Worker: Started daemon with process name: 24173@hadoop
20/03/30 22:42:29 INFO util.SignalUtils: Registered signal handler for TERM
20/03/30 22:42:29 INFO util.SignalUtils: Registered signal handler for HUP
20/03/30 22:42:29 INFO util.SignalUtils: Registered signal handler for INT
20/03/30 22:42:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/30 22:42:30 INFO spark.SecurityManager: Changing view acls to: root
20/03/30 22:42:30 INFO spark.SecurityManager: Changing modify acls to: root
20/03/30 22:42:30 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/30 22:42:30 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/30 22:42:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permiss
ions: Set(root); groups with modify permissions: Set()20/03/30 22:42:30 INFO util.Utils: Successfully started service 'sparkWorker' on port 39384.
20/03/30 22:42:30 INFO worker.Worker: Starting Spark worker 192.168.1.66:39384 with 8 cores, 4.6 GB RAM
20/03/30 22:42:30 INFO worker.Worker: Running Spark version 2.4.4
20/03/30 22:42:30 INFO worker.Worker: Spark home: /hadoop/spark
20/03/30 22:42:31 INFO util.log: Logging initialized @1682ms
20/03/30 22:42:31 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/03/30 22:42:31 INFO server.Server: Started @1758ms
20/03/30 22:42:31 INFO server.AbstractConnector: Started ServerConnector@3d598dff{HTTP/1.1,[http/1.1]}{0.0.0.0:8081}
20/03/30 22:42:31 INFO util.Utils: Successfully started service 'WorkerUI' on port 8081.
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5099c1b0{/logPage,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@64348087{/logPage/json,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46dcda1b{/,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1617f7cc{/json,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@56e77d31{/static,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@643123b6{/log,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO ui.WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://hadoop:8081
20/03/30 22:42:31 INFO worker.Worker: Connecting to master hadoop:7077...
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1cf30aaa{/metrics/json,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO client.TransportClientFactory: Successfully created connection to hadoop/192.168.1.66:7077 after 36 ms (0 ms spent in bootstraps)
20/03/30 22:42:31 INFO worker.Worker: Successfully registered with master spark://hadoop:7077

启动没问题,访问Webui:http://192.168.1.66:8080/
在这里插入图片描述

五、运行Spark提供的计算圆周率的示例程序

这里只是简单的用local模式运行一个计算圆周率的Demo。按照下面的步骤来操作。

[root@hadoop sbin]# cd /hadoop/spark/
[root@hadoop spark]# ./bin/spark-submit  --class  org.apache.spark.examples.SparkPi  --master local   examples/jars/spark-examples_2.11-2.4.4.jar 
20/03/30 22:45:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/30 22:45:59 INFO spark.SparkContext: Running Spark version 2.4.4
20/03/30 22:45:59 INFO spark.SparkContext: Submitted application: Spark Pi
20/03/30 22:45:59 INFO spark.SecurityManager: Changing view acls to: root
20/03/30 22:45:59 INFO spark.SecurityManager: Changing modify acls to: root
20/03/30 22:45:59 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/30 22:45:59 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/30 22:45:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permiss
ions: Set(root); groups with modify permissions: Set()20/03/30 22:45:59 INFO util.Utils: Successfully started service 'sparkDriver' on port 39352.
20/03/30 22:45:59 INFO spark.SparkEnv: Registering MapOutputTracker
20/03/30 22:45:59 INFO spark.SparkEnv: Registering BlockManagerMaster
20/03/30 22:45:59 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/03/30 22:45:59 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/03/30 22:45:59 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-63bf7c92-8908-4784-8e16-4c6ef0c93dc0
20/03/30 22:45:59 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
20/03/30 22:45:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
20/03/30 22:46:00 INFO util.log: Logging initialized @2066ms
20/03/30 22:46:00 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/03/30 22:46:00 INFO server.Server: Started @2179ms
20/03/30 22:46:00 INFO server.AbstractConnector: Started ServerConnector@3abd581e{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/03/30 22:46:00 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@36dce7ed{/jobs,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6a1ebcff{/jobs/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@19868320{/jobs/job,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c20be82{/jobs/job/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@13c612bd{/stages,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3ef41c66{/stages/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6b739528{/stages/stage,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f577419{/stages/stage/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28fa700e{/stages/pool,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3d526ad9{/stages/pool/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@e041f0c{/storage,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6a175569{/storage/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@11963225{/storage/rdd,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3f3c966c{/storage/rdd/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@11ee02f8{/environment,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4102b1b1{/environment/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@61a5b4ae{/executors,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3a71c100{/executors/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b69fd74{/executors/threadDump,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@f325091{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@437e951d{/static,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@467f77a5{/,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1bb9aa43{/api,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@66b72664{/jobs/job/kill,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7a34b7b8{/stages/stage/kill,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop:4040
20/03/30 22:46:00 INFO spark.SparkContext: Added JAR file:/hadoop/spark/examples/jars/spark-examples_2.11-2.4.4.jar at spark://hadoop:39352/jars/spark-examples_2.11-2.4.4.jar with timestamp 1585579560287
20/03/30 22:46:00 INFO executor.Executor: Starting executor ID driver on host localhost
20/03/30 22:46:00 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38875.
20/03/30 22:46:00 INFO netty.NettyBlockTransferService: Server created on hadoop:38875
20/03/30 22:46:00 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/03/30 22:46:00 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop, 38875, None)
20/03/30 22:46:00 INFO storage.BlockManagerMasterEndpoint: Registering block manager hadoop:38875 with 366.3 MB RAM, BlockManagerId(driver, hadoop, 38875, None)
20/03/30 22:46:00 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop, 38875, None)
20/03/30 22:46:00 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop, 38875, None)
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6f8e0cee{/metrics/json,null,AVAILABLE,@Spark}
20/03/30 22:46:01 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:38
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Parents of final stage: List()
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Missing parents: List()
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
20/03/30 22:46:01 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 366.3 MB)
20/03/30 22:46:01 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 366.3 MB)
20/03/30 22:46:01 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop:38875 (size: 1256.0 B, free: 366.3 MB)
20/03/30 22:46:01 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
20/03/30 22:46:01 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
20/03/30 22:46:01 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7866 bytes)
20/03/30 22:46:01 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
20/03/30 22:46:01 INFO executor.Executor: Fetching spark://hadoop:39352/jars/spark-examples_2.11-2.4.4.jar with timestamp 1585579560287
20/03/30 22:46:01 INFO client.TransportClientFactory: Successfully created connection to hadoop/192.168.1.66:39352 after 45 ms (0 ms spent in bootstraps)
20/03/30 22:46:01 INFO util.Utils: Fetching spark://hadoop:39352/jars/spark-examples_2.11-2.4.4.jar to /tmp/spark-9e0481a2-756b-436f-bc74-dd42fb5ea839/userFiles-86767584-1e78-45f2-a9ed-8ac4360ab170/fetchFileTem
p2974211155688432975.tmp20/03/30 22:46:01 INFO executor.Executor: Adding file:/tmp/spark-9e0481a2-756b-436f-bc74-dd42fb5ea839/userFiles-86767584-1e78-45f2-a9ed-8ac4360ab170/spark-examples_2.11-2.4.4.jar to class loader
20/03/30 22:46:01 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 824 bytes result sent to driver
20/03/30 22:46:01 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7866 bytes)
20/03/30 22:46:01 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
20/03/30 22:46:01 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 308 ms on localhost (executor driver) (1/2)
20/03/30 22:46:01 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 824 bytes result sent to driver
20/03/30 22:46:01 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 31 ms on localhost (executor driver) (2/2)
20/03/30 22:46:01 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
20/03/30 22:46:01 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.606 s
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.703911 s
Pi is roughly 3.1386756933784667
20/03/30 22:46:01 INFO server.AbstractConnector: Stopped Spark@3abd581e{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/03/30 22:46:01 INFO ui.SparkUI: Stopped Spark web UI at http://hadoop:4040
20/03/30 22:46:01 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/03/30 22:46:01 INFO memory.MemoryStore: MemoryStore cleared
20/03/30 22:46:01 INFO storage.BlockManager: BlockManager stopped
20/03/30 22:46:01 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
20/03/30 22:46:01 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/03/30 22:46:01 INFO spark.SparkContext: Successfully stopped SparkContext
20/03/30 22:46:01 INFO util.ShutdownHookManager: Shutdown hook called
20/03/30 22:46:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e019897d-3160-4bb1-ab59-f391e32ec47a
20/03/30 22:46:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-9e0481a2-756b-436f-bc74-dd42fb5ea839

可以看到输出:Pi is roughly 3.137355686778434
已经打印出了圆周率。
上面只是使用了单机本地模式调用Demo,使用集群模式运行Demo,请继续看。

六、用yarn-cluster模式执行计算程序

进入到Spark的安装目录,执行命令,用yarn-cluster模式运行计算圆周率的Demo:

[root@hadoop spark]# ./bin/spark-submit  --class  org.apache.spark.examples.SparkPi  --master  yarn-cluster   examples/jars/spark-examples_2.11-2.4.4.jar 
Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
20/03/30 22:47:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/30 22:47:48 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
20/03/30 22:47:48 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
20/03/30 22:47:48 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
20/03/30 22:47:48 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
20/03/30 22:47:48 INFO yarn.Client: Setting up container launch context for our AM
20/03/30 22:47:48 INFO yarn.Client: Setting up the launch environment for our AM container
20/03/30 22:47:48 INFO yarn.Client: Preparing resources for our AM container
20/03/30 22:47:48 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/03/30 22:47:51 INFO yarn.Client: Uploading resource file:/tmp/spark-d554f7cd-c7d4-4dfa-bc86-11a340925db6/__spark_libs__3389017089811757919.zip -> hdfs://192.168.1.66:9000/user/root/.sparkStaging/application_
1585579247054_0001/__spark_libs__3389017089811757919.zip20/03/30 22:47:59 INFO yarn.Client: Uploading resource file:/hadoop/spark/examples/jars/spark-examples_2.11-2.4.4.jar -> hdfs://192.168.1.66:9000/user/root/.sparkStaging/application_1585579247054_0001/spark-exa
mples_2.11-2.4.4.jar20/03/30 22:47:59 INFO yarn.Client: Uploading resource file:/tmp/spark-d554f7cd-c7d4-4dfa-bc86-11a340925db6/__spark_conf__559264393694354636.zip -> hdfs://192.168.1.66:9000/user/root/.sparkStaging/application_1
585579247054_0001/__spark_conf__.zip20/03/30 22:47:59 INFO spark.SecurityManager: Changing view acls to: root
20/03/30 22:47:59 INFO spark.SecurityManager: Changing modify acls to: root
20/03/30 22:47:59 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/30 22:47:59 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/30 22:47:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permiss
ions: Set(root); groups with modify permissions: Set()20/03/30 22:48:01 INFO yarn.Client: Submitting application application_1585579247054_0001 to ResourceManager
20/03/30 22:48:01 INFO impl.YarnClientImpl: Submitted application application_1585579247054_0001
20/03/30 22:48:02 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:02 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1585579681188
     final status: UNDEFINED
     tracking URL: http://hadoop:8088/proxy/application_1585579247054_0001/
     user: root
20/03/30 22:48:03 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:04 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:05 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:06 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:07 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:08 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:09 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:11 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:12 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:13 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:14 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:15 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:16 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:17 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:19 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:20 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:21 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:22 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:23 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:24 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:25 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:26 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:27 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:28 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:29 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:29 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: hadoop
     ApplicationMaster RPC port: 37844
     queue: default
     start time: 1585579681188
     final status: UNDEFINED
     tracking URL: http://hadoop:8088/proxy/application_1585579247054_0001/
     user: root
20/03/30 22:48:30 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:31 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:32 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:33 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:34 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:35 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:36 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:37 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:38 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:39 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:40 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:41 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:42 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:43 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:44 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:45 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:46 INFO yarn.Client: Application report for application_1585579247054_0001 (state: FINISHED)
20/03/30 22:48:46 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: hadoop
     ApplicationMaster RPC port: 37844
     queue: default
     start time: 1585579681188
     final status: SUCCEEDED
     tracking URL: http://hadoop:8088/proxy/application_1585579247054_0001/
     user: root
20/03/30 22:48:46 INFO util.ShutdownHookManager: Shutdown hook called
20/03/30 22:48:46 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-4c243c24-9489-4c8a-a1bc-a6a9780615d6
20/03/30 22:48:46 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-d554f7cd-c7d4-4dfa-bc86-11a340925db6

注意,使用yarn-cluster模式计算,结果没有输出在控制台,结果写在了Hadoop集群的日志中,如何查看计算结果?注意到刚才的输出中有地址:
tracking URL: http://hadoop:8088/proxy/application_1585579247054_0001/
进去看看:

在这里插入图片描述
再点进logs:
在这里插入图片描述
查看stdout内容:
在这里插入图片描述
圆周率结果已经打印出来了。
这里再给出几个常用命令:

启动spark
./sbin/start-all.sh
启动Hadoop以**及Spark:
./starths.sh
停止命令改成stop

七、配置spark读取hive表

由于在hive里面操作表是通过mapreduce的方式,效率较低,本文主要描述如何通过spark读取hive表到内存进行计算。

第一步,先把$HIVE_HOME/conf/hive-site.xml放入$SPARK_HOME/conf内,使得spark能够获取hive配置

[root@hadoop spark]# pwd
/hadoop/spark
[root@hadoop spark]# cp $HIVE_HOME/conf/hive-site.xml conf/
[root@hadoop spark]# chmod 777 conf/hive-site.xml
[root@hadoop spark]# cp /hadoop/hive/lib/mysql-connector-java-5.1.47.jar jars/

通过spark-shell进入交互界面

[root@hadoop spark]# /hadoop/spark/bin/spark-shell
20/03/31 10:31:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/31 10:32:41 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/03/31 10:32:41 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Spark context Web UI available at http://hadoop:4042
Spark context available as 'sc' (master = local[*], app id = local-1585621962060).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala>  import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala>  val hiveContext = new HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@62966c9f

scala> hiveContext.sql("show databases").show()
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.client.capability.check does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.false.positive.probability does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.broker.address.default does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.orc.time.counters does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.task.scale.memory.reserve-fraction.min does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.ms.footer.cache.ppd.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.event.message.factory does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.metrics.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.hs2.user.access does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.storage.storageDirectory does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.am.liveness.connection.timeout.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.dynamic.semijoin.reduction.threshold does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.connect.retry.limit does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.xmx.headroom does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.dynamic.semijoin.reduction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.direct does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.enforce.stats does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.client.consistent.splits does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.session.lifetime does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.timedout.txn.reaper.start does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.ttl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.management.acl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.delegation.token.lifetime does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.authentication.ldap.guidKey does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.ats.hook.queue.capacity does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.large.query does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bigtable.minsize.semijoin.reduction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.alloc.min does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.user does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.alloc.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.wait.queue.comparator.class.name does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.service.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.cache.use.soft.references does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.task.scale.memory.reserve.fraction.max does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.communicator.listener.thread-count does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.container.max.java.heap.fraction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.stats.column.autogather does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.am.liveness.heartbeat.interval.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.decoding.metrics.percentiles.intervals does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.groupby.position.alias does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.txn.store.impl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.use.groupby.shuffle does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.object.cache.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.parallel.ops.in.session does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.groupby.limit.extrastep does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.use.ssl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.file.location does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.retry.delay.seconds does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.materializedview.fileformat does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.num.file.cleaner.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.test.fail.compaction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.blobstore.use.blobstore.as.scratchdir does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.class does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.mmap.path does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.download.permanent.fns does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.max.historic.queries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.execution.reducesink.new.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.max.num.delta does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.retention.attempted does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.initiator.failed.compacts.threshold does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.reporter does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.service.max.pending.writes does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.execution.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.enable.grace.join.in.llap does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.limittranspose does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.memory.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.threadpool.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.select.threshold does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.scratchdir.lock does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.use.spnego does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.file.frequency does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.hs2.coordinator.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.timeout.seconds does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.filter.stats.reduction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.orc.base.delta.ratio does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.fastpath does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.clear.dangling.scratchdir does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.test.fail.heartbeater does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.file.cleanup.delay.seconds does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.management.rpc.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.mapjoin.hybridgrace.bloomfilter does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.enforce.tree does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.stats.ndv.tuner does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.direct.sql.max.query.length does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.retention.failed does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.close.session.on.disconnect does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.ppd.windowing does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.initial.metadata.count.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.host does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.ms.footer.cache.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.point.lookup.min does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.file.metadata.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.service.refresh.interval.sec does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.max.output.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.driver.parallel.compilation does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.remote.token.requires.signing does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bucket.pruning does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.cache.allow.synthetic.fileid does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.hash.table.inflation.factor does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.hbase.ttl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.enforce.vectorized does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.writeset.reaper.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.vector.serde.deserialize does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.order.columnalignment does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.service.send.buffer.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.schema.evolution does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.direct.sql.max.elements.values.clause does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.llap.concurrent.queries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.allow.uber does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.indexer.partition.size.max does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.auth does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.include.fileid does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.communicator.num.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orderby.position.alias does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.communicator.connection.sleep.between.retries.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.max.partitions does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.hadoop2.component does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.yarn.shuffle.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.direct.sql.max.elements.in.clause does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.passiveWaitTimeMs does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.load.dynamic.partitions.thread does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.indexer.segments.granularity does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.http.response.header.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.conf.internal.variable.list does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.limittranspose.reductionpercentage does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cm.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.retry.limit does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.resultset.serialize.in.tasks does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.query.timeout.seconds does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.hadoop2.frequency does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.directory.batch.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.max.reader.wait does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.node.reenable.max.timeout.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.max.open.txns does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.auto.convert.sortmerge.join.reduce.side does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.zookeeper.publish.configs does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.auto.convert.join.hashtable.max.entries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.sessions.init.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.authorization.storage.check.externaltable.drop does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.execution.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.cbo.cnf.maxnodes does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.adaptor.usage.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.materializedview.rewriting does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.authentication.ldap.groupMembershipKey does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.catalog.cache.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.cbo.show.warnings does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.fshandler.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.max.bloom.filter.entries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.metadata.fraction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.materializedview.serde does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.task.scheduler.wait.queue.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.cache.entries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.operational.properties does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.memory.ttl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.rpc.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.nonvector.wrapper.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.cache.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.vectorized.input.format does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.cte.materialize.threshold does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.clean.until does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.semijoin.conversion does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.dynamic.partition.pruning does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.metrics.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.rootdir does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.limit.partition.request does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.async.log.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.logger does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.allow.udf.load.on.demand does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.cli.tez.session.async does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bloom.filter.factor does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.am-reporter.max.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.use.file.size.for.mapjoin does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.bucketing does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bucket.pruning.compat does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.spnego.principal does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.task.preemption.metrics.intervals does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.shuffle.dir.watcher.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.arena.count does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.use.SSL does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.communicator.connection.timeout.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.transpose.aggr.join does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.maxTries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.dynamic.partition.pruning.max.data.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.metadata.base does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.invalidator.frequency does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.use.lrfu does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.mmap does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.coordinator.address.default does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.resultset.max.fetch.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.conf.hidden.list does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.io.sarg.cache.max.weight.mb does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.clear.dangling.scratchdir.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.sleep.time does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.row.serde.deserialize does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.compile.lock.timeout does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.timedout.txn.reaper.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.max.variance does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.lrfu.lambda does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.metadata.db.type does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.stream.timeout does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.transactional.events.mem does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.resultset.default.fetch.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cm.retain does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.merge.cardinality.check does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.authentication.ldap.groupClassKey does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.point.lookup does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.allow.permanent.fns does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.web.ssl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.manager.dump.lock.state.on.acquire.timeout does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.retention.succeeded does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.use.fileid.path does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.slice.row.count does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.mapjoin.optimized.hashtable.probe.percent does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.select.distribute does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.am.use.fqdn does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.node.reenable.min.timeout.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.validate.acls does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.support.special.characters.tablename does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.mv.files.thread does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.skip.compile.udf.check does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.vector.serde.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cm.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.sleep.interval.between.start.attempts does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.yarn.container.mb does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.http.read.timeout does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.blobstore.optimizations.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.orc.gap.cache does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.dynamic.partition.hashjoin does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.copyfile.maxnumfiles does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.formats does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.http.numConnection does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.task.scheduler.enable.preemption does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.num.executors does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.max.full does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.connection.class does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.sessions.custom.queue.allowed does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.slice.lrr does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.password does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.max.writer.wait does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.http.request.header.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.max.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.limittranspose.reductiontuples does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.test.rollbacktxn does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.num.schedulable.tasks.per.node does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.acl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.memory.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.type.safety does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.async.exec.async.compile does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.max.input.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.enable.memory.manager does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.msck.repair.batch.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.blobstore.supported.schemes does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.allow.synthetic.fileid does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.stats.filter.in.factor does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.use.op.stats does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.input.listing.max.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.session.lifetime.jitter does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.web.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.cartesian.product does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.rpc.num.handlers does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.vcpus.per.instance does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.count.open.txns.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.min.bloom.filter.entries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.partition.columns.separate does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.cache.stripe.details.mem.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.heartbeat.threadpool.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.locality.delay does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cmrootdir does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.node.disable.backoff.factor does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.am.liveness.connection.sleep.between.retries.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.exec.inplace.progress does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.working.directory does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.memory.per.instance.mb does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.msck.path.validation does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.task.scale.memory.reserve.fraction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.merge.nway.joins does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.reaper.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.strict.locking.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.vector.serde.async.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.input.generate.consistent.splits does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.in.place.progress does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.indexer.memory.rownum.max does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.xsrf.filter.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.alloc.max does not exist
+------------+
|databaseName|
+------------+
|     default|
|      hadoop|
+------------+
scala>  hiveContext.sql("show tables").show()
+--------+--------------------+-----------+
|database|           tableName|isTemporary|
+--------+--------------------+-----------+
| default|                  aa|      false|
| default|                  bb|      false|
| default|                  dd|      false|
| default|       kylin_account|      false|
| default|        kylin_cal_dt|      false|
| default|kylin_category_gr...|      false|
| default|       kylin_country|      false|
| default|kylin_intermediat...|      false|
| default|kylin_intermediat...|      false|
| default|         kylin_sales|      false|
| default|                test|      false|
| default|           test_null|      false|
+--------+--------------------+-----------+

可以看到已经查询到结果了,但是为啥上面报了一堆WARN 。
比如:
WARN conf.HiveConf: HiveConf of name hive.llap.skip.compile.udf.check does not exis

hive-site配置文件删除掉:

<property>
    <name>hive.llap.skip.compile.udf.check</name>
    <value>false</value>
    <description>
      Whether to skip the compile-time check for non-built-in UDFs when deciding whether to
      execute tasks in LLAP. Skipping the check allows executing UDFs from pre-localized
      jars in LLAP; if the jars are not pre-localized, the UDFs will simply fail to load.
    </description>
  </property>

再次登录执行警告就消失了。

八、配置Hudi

8.1、检阅官方文档重点地方

先来看下官方文档getstart首页:
在这里插入图片描述
我之前装的hadoop环境是2.7版本的,前面之所以装spark2.4.4就是因为目前官方案例就是用的hadoop2.7+spark2.4.4,而且虽然现在hudi、spark是支持scala2.11.x/2.12.x,但是官网这里也是用的2.11,我这里为了保持和hudi官方以及spark2.4.4(Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151))一致,也就装的2.11.12版本的scala。
因为目前为止,Hudi已经出了0.5.2版本,但是Hudi官方仍然用的0.5.1的做示例,接下来,先切换到hudi0.5.1的发布文档:
点击查看
在这里插入图片描述
在这里插入图片描述
上面发布文档讲的意思是:

版本升级 将Spark版本从2.1.0升级到2.4.4 将Avro版本从1.7.7升级到1.8.2
将Parquet版本从1.8.1升级到1.10.1
将Kafka版本从0.8.2.1升级到2.0.0,这是由于将spark-streaming-kafka
artifact从0.8_2.11升级到0.10_2.11/2.12间接升级 重要:Hudi
0.5.1版本需要将spark的版本升级到2.4+

Hudi现在支持Scala 2.11和2.12,可以参考Scala 2.12构建来使用Scala 2.12来构建Hudi,另外,
hudi-spark, hudi-utilities, hudi-spark-bundle and
hudi-utilities-bundle包名现已经对应变更为 hudi-spark_{scala_version},
hudi-spark_{scala_version}, hudi-utilities_{scala_version},
hudi-spark-bundle_{scala_version}和
hudi-utilities-bundle_{scala_version}. 注意这里的scala_version为2.11或2.12。
在0.5.1版本中,对于timeline元数据的操作不再使用重命名方式,这个特性在创建Hudi表时默认是打开的。对于已存在的表,这个特性默认是关闭的,在已存在表开启这个特性之前,请参考这部分(https://hudi.apache.org/docs/deployment.html#upgrading)。若开启新的Hudi
timeline布局方式(layout),即避免重命名,可设置写配置项hoodie.timeline.layout.version=1。当然,你也可以在CLI中使用repair
overwrite-hoodie-props命令来添加hoodie.timeline.layout.version=1至hoodie.properties文件。注意,无论使用哪种方式,在升级Writer之前请先升级Hudi
Reader(查询引擎)版本至0.5.1版本。 CLI支持repair
overwrite-hoodie-props来指定文件来重写表的hoodie.properties文件,可以使用此命令来的更新表名或者使用新的timeline布局方式。注意当写hoodie.properties文件时(毫秒),一些查询将会暂时失败,失败后重新运行即可。
DeltaStreamer用来指定表类型的参数从--storage-type变更为了--table-type,可以参考wiki来了解更多的最新变化的术语。
配置Kafka Reset
Offset策略的值变化了。枚举值从LARGEST变更为LATEST,SMALLEST变更为EARLIEST,对应DeltaStreamer中的配置项为auto.offset.reset。
当使用spark-shell来了解Hudi时,需要提供额外的--packages
org.apache.spark:spark-avro_2.11:2.4.4,可以参考quickstart了解更多细节。 Key
generator(键生成器)移动到了单独的包下org.apache.hudi.keygen,如果你使用重载键生成器类(对应配置项:hoodie.datasource.write.keygenerator.class),请确保类的全路径名也对应进行变更。
Hive同步工具将会为MOR注册带有_ro后缀的RO表,所以查询也请带_ro后缀,你可以使用--skip-ro-suffix配置项来保持旧的表名,即同步时不添加_ro后缀。
0.5.1版本中,供presto/hive查询引擎使用的hudi-hadoop-mr-bundle包shaded了avro包,以便支持real
time
queries(实时查询)。Hudi支持可插拔的记录合并逻辑,用户只需自定义实现HoodieRecordPayload。如果你使用这个特性,你需要在你的代码中relocate
avro依赖,这样可以确保你代码的行为和Hudi保持一致,你可以使用如下方式来relocation。
org.apache.avro.
org.apache.hudi.org.apache.avro.
DeltaStreamer更好的支持Delete,可参考blog了解更多细节。
DeltaStreamer支持AWS Database Migration Service(DMS) ,可参考blog了解更多细节。
支持DynamicBloomFilter(动态布隆过滤器),默认是关闭的,可以使用索引配置项hoodie.bloom.index.filter.type=DYNAMIC_V0来开启。
HDFSParquetImporter支持bulkinsert,可配置--command为bulkinsert。 支持AWS WASB和
WASBS云存储。

8.2、错误的安装尝试

好了,看完了发布文档,而且已经定下了我们的使用版本关系,那么直接切换到Hudi0.5.2最新版本的官方文档:
点此跳转在这里插入图片描述因为之前没用过spark和hudi,在看到hudi官网的第一眼时候,首先想到的是先下载一个hudi0.5.1对应的应用程序,然后再进行部署,部署好了之后再执行上面官网给的命令代码,比如下面我之前做的错误示范:

由于官方目前案例都是用的0.5.1,所以我也下载这个版本:
https://downloads.apache.org/incubator/hudi/0.5.1-incubating/hudi-0.5.1-incubating.src.tgz
将下载好的安装包,上传到/hadoop/spark目录下并解压:
[root@hadoop spark]# ls
bin  conf  data  examples  hudi-0.5.1-incubating.src.tgz  jars  kubernetes  LICENSE  licenses  logs  NOTICE  python  R  README.md  RELEASE  sbin  spark-2.4.4-bin-hadoop2.7  work  yarn
[root@hadoop spark]# tar -zxvf hudi-0.5.1-incubating.src.tgz
[root@hadoop spark]# ls
bin  conf  data  examples  hudi-0.5.1-incubating  hudi-0.5.1-incubating.src.tgz  jars  kubernetes  LICENSE  licenses  logs  NOTICE  python  R  README.md  RELEASE  sbin  spark-2.4.4-bin-hadoop2.7  work  yarn
[root@hadoop spark]# rm -rf *tgz
[root@hadoop ~]# /hadoop/spark/bin/spark-shell \
>     --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
>     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/hadoop/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hudi#hudi-spark-bundle_2.11 added as a dependency
org.apache.spark#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5717aa3e-7bfb-42c4-aadd-2a884f3521d5;1.0
    confs: [default]
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
:: resolution report :: resolve 454ms :: artifacts dl 1ms
    :: modules in use:
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   2   |   0   |   0   |   0   ||   0   |   0   |
    ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
    Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.1-incubating/hudi-spark-bundle_2.11-0.5.1-incubating.pom

    Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.1-incubating/hudi-spark-bundle_2.11-0.5.1-incubating.jar
。。。。。。。。。。

        ::::::::::::::::::::::::::::::::::::::::::::::

        ::          UNRESOLVED DEPENDENCIES         ::

        ::::::::::::::::::::::::::::::::::::::::::::::

        :: org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating: not found

        :: org.apache.spark#spark-avro_2.11;2.4.4: not found

        ::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating: not found, unresolved dependency: org.apache.spark#spark-avro_2.11;2.4.4: 
not found]    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302)
    at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

8.3、正确的“安装部署”

其实下载的这个应该算是个源码包,不是可直接运行的。
而且spark-shell --packages是指定java包的maven地址,若不给定,则会使用该机器安装的maven默认源中下载此jar包,也就是说指定的这两个jar是需要自动下载的,我的虚拟环境一没设置外部网络,二没配置maven,这肯定会报错找不到jar包。
官方这里的代码:
--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
说白了其实就是指定maven项目pom文件的依赖,翻了一下官方文档,找到了Hudi给的中央仓库地址,然后从中找到了官方案例代码中指定的两个包:
在这里插入图片描述
直接拿出来,就是下面这两个:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>2.4.4</version>
</dependency>
<dependency>
  <groupId>org.apache.hudi</groupId>
  <artifactId>hudi-spark-bundle_2.11</artifactId>
  <version>0.5.2-incubating</version>
</dependency>

好吧,那我就在这直接下载了这俩包,然后再继续看官方文档:
在这里插入图片描述
这里说了我也可以通过自己构建hudi来快速开始, 并在spark-shell命令中使用--jars /packaging/hudi-spark-bundle/target/hudi-spark-bundle-..*-SNAPSHOT.jar, 而不是--packages org.apache.hudi:hudi-spark-bundle:0.5.2-incubating
,看到这个提示,我在linux看了下 spark-shell的帮助:

[root@hadoop external_jars]# /hadoop/spark/bin/spark-shell --help
Usage: ./bin/spark-shell [options]

Scala REPL options:
  -I <file>                   preload <file>, enforcing line-by-line interpretation

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Cluster deploy mode only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

原来--jasrs是指定机器上存在的jar文件,接下来将前面下载的两个包上传到服务器:

[root@hadoop spark]# mkdir external_jars
[root@hadoop spark]# cd external_jars/
[root@hadoop external_jars]# pwd
/hadoop/spark/external_jars
通过xftp上传jar到此目录
[root@hadoop external_jars]# ls
hudi-spark-bundle_2.11-0.5.2-incubating.jar  scala-library-2.11.12.jar  spark-avro_2.11-2.4.4.jar  spark-tags_2.11-2.4.4.jar  unused-1.0.0.jar

然后将官方案例代码:

spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

修改为:

[root@hadoop external_jars]# /hadoop/spark/bin/spark-shell --jars /hadoop/spark/external_jars/spark-avro_2.11-2.4.4.jar,/hadoop/spark/external_jars/hudi-spark-bundle_2.11-0.5.2-incubating.jar --conf 'spark.seri
alizer=org.apache.spark.serializer.KryoSerializer
'20/03/31 15:19:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://hadoop:4040
Spark context available as 'sc' (master = local[*], app id = local-1585639157881).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

OK!!!没有报错了,接下来开始尝试进行增删改查操作。

8.4、Hudi增删改查

基于上面步骤

8.4.1、设置表名、基本路径和数据生成器来生成记录

scala> import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.QuickstartUtils._

scala> import scala.collection.JavaConversions._
import scala.collection.JavaConversions._

scala> import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql.SaveMode._

scala> import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceReadOptions._

scala> import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.DataSourceWriteOptions._

scala> import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.config.HoodieWriteConfig._

scala> val tableName = "hudi_cow_table"
tableName: String = hudi_cow_table

scala> val basePath = "file:///tmp/hudi_cow_table"
basePath: String = file:///tmp/hudi_cow_table

scala> val dataGen = new DataGenerator
dataGen: org.apache.hudi.QuickstartUtils.DataGenerator = org.apache.hudi.QuickstartUtils$DataGenerator@4bf6bc2d

数据生成器 可以基于行程样本模式 生成插入和更新的样本。

8.4.2、插入数据
生成一些新的行程样本,将其加载到DataFrame中,然后将DataFrame写入Hudi数据集中,如下所示。

scala> val inserts = convertToStringList(dataGen.generateInserts(10))
inserts: java.util.List[String] = [{"ts": 0.0, "uuid": "81a9b76c-655b-4527-85fc-7696bdeab4fd", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.4726905879569653, "begin_lon": 0.46157858450465483, "e
nd_lat": 0.754803407008858, "end_lon": 0.9671159942018241, "fare": 34.158284716382845, "partitionpath": "americas/brazil/sao_paulo"}, {"ts": 0.0, "uuid": "0d612dd2-5f10-4296-a434-b34e6558e8f1", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.6100070562136587, "begin_lon": 0.8779402295427752, "end_lat": 0.3407870505929602, "end_lon": 0.5030798142293655, "fare": 43.4923811219014, "partitionpath": "americas/brazil/sao_paulo"}, {"ts": 0.0, "uuid": "0e170de4-7eda-4ab5-8c06-e351e8b23e3d", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.5731835407930634, "begin_...scala> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
warning: there was one deprecation warning; re-run with -deprecation for details
df: org.apache.spark.sql.DataFrame = [begin_lat: double, begin_lon: double ... 8 more fields]

scala> df.write.format("org.apache.hudi").
     |     options(getQuickstartWriteConfigs).
     |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |     option(TABLE_NAME, tableName).
     |     mode(Overwrite).
     |     save(basePath);
20/03/31 15:28:11 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.

mode(Overwrite)覆盖并重新创建数据集(如果已经存在)。 您可以检查在/tmp/hudi_cow_table/\<region>/\<country>/\<city>/下生成的数据。我们提供了一个记录键 (schema中的uuid),分区字段(region/county/city)和组合逻辑(schema中的ts) 以确保行程记录在每个分区中都是唯一的。更多信息请参阅 对Hudi中的数据进行建模, 有关将数据提取到Hudi中的方法的信息,请参阅写入Hudi数据集。 这里我们使用默认的写操作:插入更新。 如果您的工作负载没有更新,也可以使用更快的插入或批量插入操作。 想了解更多信息,请参阅写操作。

8.4.3、查询数据
将数据文件加载到DataFrame中。

scala> df.write.format("org.apache.hudi").
     |     options(getQuickstartWriteConfigs).
     |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |     option(TABLE_NAME, tableName).
     |     mode(Overwrite).
     |     save(basePath);
20/03/31 15:28:11 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.

scala> val roViewDF = spark.
     |     read.
     |     format("org.apache.hudi").
     |     load(basePath + "/*/*/*/*")
20/03/31 15:30:03 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.
roViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields]

scala> roViewDF.registerTempTable("hudi_ro_table")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_ro_table where fare > 20.0").show()

+------------------+-------------------+-------------------+---+
|              fare|          begin_lon|          begin_lat| ts|
+------------------+-------------------+-------------------+---+
| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+------------------+-------------------+-------------------+---+


scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_ro_table").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
|     20200331152807|264170aa-dd3f-4a7...|  americas/united_s...|rider-213|driver-213| 93.56018115236618|
|     20200331152807|0e170de4-7eda-4ab...|  americas/united_s...|rider-213|driver-213| 64.27696295884016|
|     20200331152807|fb06d140-cd00-413...|  americas/united_s...|rider-213|driver-213| 27.79478688582596|
|     20200331152807|eb1d495c-57b0-4b3...|  americas/united_s...|rider-213|driver-213| 33.92216483948643|
|     20200331152807|2b3380b7-2216-4ca...|  americas/united_s...|rider-213|driver-213|19.179139106643607|
|     20200331152807|81a9b76c-655b-452...|  americas/brazil/s...|rider-213|driver-213|34.158284716382845|
|     20200331152807|d24e8cb8-69fd-4cc...|  americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
|     20200331152807|0d612dd2-5f10-429...|  americas/brazil/s...|rider-213|driver-213|  43.4923811219014|
|     20200331152807|a6a7e7ed-3559-4ee...|    asia/india/chennai|rider-213|driver-213|17.851135255091155|
|     20200331152807|824ee8d5-6f1f-4d5...|    asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+

该查询提供已提取数据的读取优化视图。由于我们的分区路径(region/country/city)是嵌套的3个级别 从基本路径开始,我们使用了load(basePath + "/*/*/*/*")。 有关支持的所有存储类型和视图的更多信息,请参考存储类型和视图。

8.4.4、更新数据
这类似于插入新数据。使用数据生成器生成对现有行程的更新,加载到DataFrame中并将DataFrame写入hudi数据集。

scala> val updates = convertToStringList(dataGen.generateUpdates(10))
updates: java.util.List[String] = [{"ts": 0.0, "uuid": "0e170de4-7eda-4ab5-8c06-e351e8b23e3d", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.7340133901254792, "begin_lon": 0.5142184937933181, "en
d_lat": 0.7814655558162802, "end_lon": 0.6592596683641996, "fare": 49.527694252432056, "partitionpath": "americas/united_states/san_francisco"}, {"ts": 0.0, "uuid": "81a9b76c-655b-4527-85fc-7696bdeab4fd", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.1593867607188556, "begin_lon": 0.010872312870502165, "end_lat": 0.9808530350038475, "end_lon": 0.7963756520507014, "fare": 29.47661370147079, "partitionpath": "americas/brazil/sao_paulo"}, {"ts": 0.0, "uuid": "81a9b76c-655b-4527-85fc-7696bdeab4fd", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.71801964677...scala> val df = spark.read.json(spark.sparkContext.parallelize(updates, 2));
warning: there was one deprecation warning; re-run with -deprecation for details
df: org.apache.spark.sql.DataFrame = [begin_lat: double, begin_lon: double ... 8 more fields]

scala> df.write.format("org.apache.hudi").
     |     options(getQuickstartWriteConfigs).
     |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |     option(TABLE_NAME, tableName).
     |     mode(Append).
     |     save(basePath);
20/03/31 15:32:27 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.

注意,保存模式现在为追加。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。 查询现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的commit 。在之前提交的相同的_hoodie_record_key中寻找_hoodie_commit_time, rider, driver字段变更。

8.4.5、增量查询
Hudi还提供了获取给定提交时间戳以来已更改的记录流的功能。 这可以通过使用Hudi的增量视图并提供所需更改的开始时间来实现。 如果我们需要给定提交之后的所有更改(这是常见的情况),则无需指定结束时间。

scala> // reload data

scala> spark.
     |     read.
     |     format("org.apache.hudi").
     |     load(basePath + "/*/*/*/*").
     |     createOrReplaceTempView("hudi_ro_table")
20/03/31 15:33:55 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.

scala> 

scala> val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_ro_table order by commitTime").map(k => k.getString(0)).take(50)
commits: Array[String] = Array(20200331152807, 20200331153224)

scala> val beginTime = commits(commits.length - 2) // commit time we are interested in
beginTime: String = 20200331152807
scala> // 增量查询数据

scala> val incViewDF = spark.
     |     read.
     |     format("org.apache.hudi").
     |     option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
     |     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     |     load(basePath);
20/03/31 15:34:40 WARN hudi.DefaultSource: hoodie.datasource.view.type is deprecated and will be removed in a later release. Please use hoodie.datasource.query.type
incViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields]

scala> incViewDF.registerTempTable("hudi_incr_table")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_incr_table where fare > 20.0").show()
+-------------------+------------------+--------------------+-------------------+---+
|_hoodie_commit_time|              fare|           begin_lon|          begin_lat| ts|
+-------------------+------------------+--------------------+-------------------+---+
|     20200331153224|49.527694252432056|  0.5142184937933181| 0.7340133901254792|0.0|
|     20200331153224|  98.3428192817987|  0.3349917833248327| 0.4777395067707303|0.0|
|     20200331153224|  90.9053809533154| 0.19949323322922063|0.18294079059016366|0.0|
|     20200331153224| 90.25710109008239|  0.4006983139989222|0.08528650347654165|0.0|
|     20200331153224| 29.47661370147079|0.010872312870502165| 0.1593867607188556|0.0|
|     20200331153224| 63.72504913279929|   0.888493603696927| 0.6570857443423376|0.0|
+-------------------+------------------+--------------------+-------------------+---+

这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。

8.4.6、特定时间点查询
让我们看一下如何查询特定时间的数据。可以通过将结束时间指向特定的提交时间,将开始时间指向”000”(表示最早的提交时间)来表示特定时间。

scala> val beginTime = "000" // Represents all commits > this time.
beginTime: String = 000

scala> val endTime = commits(commits.length - 2) // commit time we are interested in
endTime: String = 20200331152807

scala> 

scala> // 增量查询数据

scala> val incViewDF = spark.read.format("org.apache.hudi").
     |     option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
     |     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     |     option(END_INSTANTTIME_OPT_KEY, endTime).
     |     load(basePath);
20/03/31 15:36:00 WARN hudi.DefaultSource: hoodie.datasource.view.type is deprecated and will be removed in a later release. Please use hoodie.datasource.query.type
incViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields]

scala> incViewDF.registerTempTable("hudi_incr_table")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_incr_table where fare > 20.0").show()
+-------------------+------------------+-------------------+-------------------+---+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+---+
|     20200331152807| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
|     20200331152807| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
|     20200331152807| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
|     20200331152807| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|     20200331152807|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
|     20200331152807| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|     20200331152807|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
|     20200331152807| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+-------------------+------------------+-------------------+-------------------+---+
相关实践学习
基于MaxCompute的热门话题分析
本实验围绕社交用户发布的文章做了详尽的分析,通过分析能得到用户群体年龄分布,性别分布,地理位置分布,以及热门话题的热度。
SaaS 模式云数据仓库必修课
本课程由阿里云开发者社区和阿里云大数据团队共同出品,是SaaS模式云原生数据仓库领导者MaxCompute核心课程。本课程由阿里云资深产品和技术专家们从概念到方法,从场景到实践,体系化的将阿里巴巴飞天大数据平台10多年的经过验证的方法与实践深入浅出的讲给开发者们。帮助大数据开发者快速了解并掌握SaaS模式的云原生的数据仓库,助力开发者学习了解先进的技术栈,并能在实际业务中敏捷的进行大数据分析,赋能企业业务。 通过本课程可以了解SaaS模式云原生数据仓库领导者MaxCompute核心功能及典型适用场景,可应用MaxCompute实现数仓搭建,快速进行大数据分析。适合大数据工程师、大数据分析师 大量数据需要处理、存储和管理,需要搭建数据仓库?学它! 没有足够人员和经验来运维大数据平台,不想自建IDC买机器,需要免运维的大数据平台?会SQL就等于会大数据?学它! 想知道大数据用得对不对,想用更少的钱得到持续演进的数仓能力?获得极致弹性的计算资源和更好的性能,以及持续保护数据安全的生产环境?学它! 想要获得灵活的分析能力,快速洞察数据规律特征?想要兼得数据湖的灵活性与数据仓库的成长性?学它! 出品人:阿里云大数据产品及研发团队专家 产品 MaxCompute 官网 https://www.aliyun.com/product/odps&nbsp;
相关文章
|
20天前
|
分布式计算 大数据 Apache
ClickHouse与大数据生态集成:Spark & Flink 实战
【10月更文挑战第26天】在当今这个数据爆炸的时代,能够高效地处理和分析海量数据成为了企业和组织提升竞争力的关键。作为一款高性能的列式数据库系统,ClickHouse 在大数据分析领域展现出了卓越的能力。然而,为了充分利用ClickHouse的优势,将其与现有的大数据处理框架(如Apache Spark和Apache Flink)进行集成变得尤为重要。本文将从我个人的角度出发,探讨如何通过这些技术的结合,实现对大规模数据的实时处理和分析。
55 2
ClickHouse与大数据生态集成:Spark & Flink 实战
|
1月前
|
存储 分布式计算 算法
大数据-106 Spark Graph X 计算学习 案例:1图的基本计算、2连通图算法、3寻找相同的用户
大数据-106 Spark Graph X 计算学习 案例:1图的基本计算、2连通图算法、3寻找相同的用户
60 0
|
1月前
|
消息中间件 分布式计算 NoSQL
大数据-104 Spark Streaming Kafka Offset Scala实现Redis管理Offset并更新
大数据-104 Spark Streaming Kafka Offset Scala实现Redis管理Offset并更新
40 0
|
1月前
|
消息中间件 存储 分布式计算
大数据-103 Spark Streaming Kafka Offset管理详解 Scala自定义Offset
大数据-103 Spark Streaming Kafka Offset管理详解 Scala自定义Offset
87 0
|
21天前
|
SQL 机器学习/深度学习 分布式计算
Spark快速上手:揭秘大数据处理的高效秘密,让你轻松应对海量数据
【10月更文挑战第25天】本文全面介绍了大数据处理框架 Spark,涵盖其基本概念、安装配置、编程模型及实际应用。Spark 是一个高效的分布式计算平台,支持批处理、实时流处理、SQL 查询和机器学习等任务。通过详细的技术综述和示例代码,帮助读者快速掌握 Spark 的核心技能。
48 6
|
19天前
|
存储 分布式计算 Hadoop
数据湖技术:Hadoop与Spark在大数据处理中的协同作用
【10月更文挑战第27天】在大数据时代,数据湖技术凭借其灵活性和成本效益成为企业存储和分析大规模异构数据的首选。Hadoop和Spark作为数据湖技术的核心组件,通过HDFS存储数据和Spark进行高效计算,实现了数据处理的优化。本文探讨了Hadoop与Spark的最佳实践,包括数据存储、处理、安全和可视化等方面,展示了它们在实际应用中的协同效应。
68 2
|
1月前
|
Java 大数据 数据库连接
大数据-163 Apache Kylin 全量增量Cube的构建 手动触发合并 JDBC 操作 Scala
大数据-163 Apache Kylin 全量增量Cube的构建 手动触发合并 JDBC 操作 Scala
29 2
大数据-163 Apache Kylin 全量增量Cube的构建 手动触发合并 JDBC 操作 Scala
|
20天前
|
存储 分布式计算 Hadoop
数据湖技术:Hadoop与Spark在大数据处理中的协同作用
【10月更文挑战第26天】本文详细探讨了Hadoop与Spark在大数据处理中的协同作用,通过具体案例展示了两者的最佳实践。Hadoop的HDFS和MapReduce负责数据存储和预处理,确保高可靠性和容错性;Spark则凭借其高性能和丰富的API,进行深度分析和机器学习,实现高效的批处理和实时处理。
58 1
|
21天前
|
分布式计算 大数据 OLAP
AnalyticDB与大数据生态集成:Spark & Flink
【10月更文挑战第25天】在大数据时代,实时数据处理和分析变得越来越重要。AnalyticDB(ADB)是阿里云推出的一款完全托管的实时数据仓库服务,支持PB级数据的实时分析。为了充分发挥AnalyticDB的潜力,将其与大数据处理工具如Apache Spark和Apache Flink集成是非常必要的。本文将从我个人的角度出发,分享如何将AnalyticDB与Spark和Flink集成,构建端到端的大数据处理流水线,实现数据的实时分析和处理。
50 1
|
1月前
|
分布式计算 大数据 Apache
利用.NET进行大数据处理:Apache Spark与.NET for Apache Spark
【10月更文挑战第15天】随着大数据成为企业决策和技术创新的关键驱动力,Apache Spark作为高效的大数据处理引擎,广受青睐。然而,.NET开发者面临使用Spark的门槛。本文介绍.NET for Apache Spark,展示如何通过C#和F#等.NET语言,结合Spark的强大功能进行大数据处理,简化开发流程并提升效率。示例代码演示了读取CSV文件及统计分析的基本操作,突显了.NET for Apache Spark的易用性和强大功能。
37 1