- 安装scala,brew install scala
- 配置scala到环境变量
- 安装Spark,brew install spark
- 配置spark到环境变量
- 不能使用的版本,安装brew install apache-spark
- 更改环境变量,使用spark-shell检验是否安装成功
- 进入到sbin目录下,使用./start-all.sh启动sprk,注意加.,因为不加的话,我启动的命令会是hadoop的start-all.sh
- 进入web页面查看http://localhost:8080/
- 使用命令
spark-shell --master spark://bogon:7077
将spark作业提交到master,运行代码
sc.textFile("/Users/mac/Downloads/01.csv") .flatMap(line=>line.split(",")) .map(w=> (w, 1)) .reduceByKey(_+_) .map { case (word, count) => (count, word) } .sortByKey(true) .map { case (count, word) => (word, count) } .foreach(println)
- 尝试在idea运行,此刻需要在pom中引入依赖
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.13</artifactId><version>3.3.0</version></dependency>
- 但发现代码只能在本地local模式下运行,每次尝试连接master报错
StandaloneAppClient$ClientEndpoint: Failed to connect to master bogon:7077
- 有人说是版本导致,查询sprak-shell的版本,修改pom中sprak-core_2.13为spark-core_2.12,将项目中的scala版本从2.13改为2.12.15
- 后运行,错误修改,出现新错误
cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD
- 百度结果:standalone方式远程调用进要把jar传给worker需要增加,可以在conf后setjars来指定jar包,确保分发成功
- 但代码是scala代码,所以pom依赖需要增加plugins,才能打包成功
<build><plugins><plugin><groupId>net.alchim31.maven</groupId><artifactId>scala-maven-plugin</artifactId><version>3.2.1</version><executions><execution><id>scala-compile-first</id><phase>process-resources</phase><goals><goal>add-source</goal><goal>compile</goal></goals></execution><execution><id>scala-test-compile</id><phase>process-test-resources</phase><goals><goal>add-source</goal><goal>testCompile</goal></goals></execution></executions><configuration><args><!-- 编译时使用 libs 目录下的 jar 包,通过 mvn scala:help 查看说明 --><arg>-extdirs</arg><arg>${project.basedir}/libs</arg></args><scalaVersion>2.12.15</scalaVersion></configuration></plugin></plugins></build>
- 同时代码需要进行修改,修法后代码如下:
packagecom.daishuimportorg.apache.spark.{SparkConf, SparkContext} objectLy001 { defmain(args: Array[String]): Unit= { valjars=List("/Users/mac/IdeaProjects/ly04/target/ly04-1.0-SNAPSHOT.jar") valconf=newSparkConf() .setAppName("ly001") .setJars(jars) .setMaster("spark://bogon:7077") valsc=newSparkContext(conf) sc.textFile("/Users/mac/Downloads/01.csv") .flatMap(line=>line.split(",")) .map(w=> (w, 1)) .reduceByKey(_+_) .map { case (word, count) => (count, word) } .sortByKey(true) .map { case (count, word) => (word, count) } .foreach(println) } }
- 好了,新的报错来了:
java.io.FileNotFoundException: /Users/mac/Downloads/01.csv (Operation not permitted)
- 原因是读取的文件不在项目目录下,放在项目目录下,修改文件路径即可成功
sc.textFile("src/main/resources/01.csv")
;同样,读取hdfs的文件也是只需要将路径更改sc.textFile("hdfs://localhost:8020/user/hive/warehouse/ly001.db/dwd_mn_mkt_order_master_df_ds/ds=20230101/*")
- 尝试在hive中修改引擎从mr到spark
set hive.execution.engine=spark;
但需要注意版本问题,建议还是不要安装过于靠前的版本
- 如果想实现sprk on yarn,则可以先在spark-env中配置参数,配置好之后可以使用
spark-shell --master yarn
进入spark,此时运行的任务就会出现在yarn的8088端口
exportYARN_CONF_DIR=/usr/local/Cellar/hadoop/3.3.4/libexec/etc/hadoop exportHADOOP_CONF_DIR=/usr/local/Cellar/hadoop/3.3.4/libexec/etc/hadoop