亢海鹏
2021-01-07
1142浏览量
MaxCompute Spark是MaxCompute提供的兼容开源的Spark计算服务。它在统一的计算资源和数据集权限体系之上,提供Spark计算框架,支持用户以熟悉的开发使用方式提交运行Spark作业,以满足更丰富的数据处理分析场景。
## JAVA_HOME配置 # 推荐使用JDK 1.8 export JAVA_HOME=/path/to/jdk export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar export PATH=$JAVA_HOME/bin:$PATH ## SPARK_HOME设置 # 下载上文提到的MaxCompute Spark客户端并解压到本地任意路径 # 请不要直接设置SPARK_HOME等于以下路径下述路径仅做展示用途 # 请指向正确的路径 export SPARK_HOME=/path/to/spark_extracted_package export PATH=$SPARK_HOME/bin:$PATH ## PySpark配置Python版本 export PATH=/path/to/python/bin/:$PATH
git clone https://github.com/aliyun/MaxCompute-Spark.git cd spark-2.x mvn clean package
// bash环境 cd $SPARK_HOME bin/spark-submit --master yarn-cluster --class com.aliyun.odps.spark.examples.SparkPi \ /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar // 在windows环境提交的命令 cd $SPARK_HOME/bin spark-submit.cmd --master yarn-cluster --class com.aliyun.odps.spark.examples.SparkPi \path\to\MaxCompute-Spark\spark-2.x\target\spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
## Java/Scala cd $SPARK_HOME ./bin/spark-submit --master local[4] --class com.aliyun.odps.spark.examples.SparkPi \ /path/to/odps-spark-examples/spark-examples/target/spark-examples-2.0.0-SNAPSHOT-shaded.jar ## PySpark cd $SPARK_HOME ./bin/spark-submit --master local[4] \ /path/to/odps-spark-examples/spark-examples/src/main/python/odps_table_rw.py
the value of spark.sql.catalogimplementation should be one of hive in-memory but was odps
ODPS SPARK节点 | spark-submit |
主Java、Python资源 | app jar or python file |
配置项 | --conf PROP=VALUE |
Main Class | --class CLASS_NAME |
参数 | [app arguments] |
选择JAR资源 | --jars JARS |
选择Python资源 | --py-files PY_FILES |
选择File资源 | --files FILES |
选择Archives资源 | --archives |
spark.executor.instances |
|
spark.executor.cores |
|
spark.executor.memory |
|
spark.yarn.executor.memoryOverhead |
|
spark.driver.cores |
|
spark.driver.memory |
|
spark.yarn.driver.memoryOverhead |
|
spark.driver.maxResultSize |
|
spark.hadoop.odps.cupid.disk.driver.device_size |
|
spark.hadoop.odps.project.name |
|
spark.hadoop.odps.access.id |
|
spark.hadoop.odps.access.key |
|
spark.hadoop.odps.end.point |
|
spark.hadoop.odps.runtime.end.point |
|
spark.hadoop.odps.task.major.version |
|
spark.sql.catalogImplementation |
|
spark.hadoop.odps.cupid.resources |
|
spark.hadoop.odps.cupid.vectorization.enable |
|
spark.hadoop.odps.input.split.size |
|
spark.hadoop.odps.cupid.vpc.domain.list |
|
spark.hadoop.odps.cupid.smartnat.enable |
|
spark.hadoop.odps.cupid.eni.enable |
|
spark.hadoop.odps.cupid.eni.info |
|
spark.hadoop.odps.cupid.engine.running.type |
|
spark.hadoop.odps.cupid.job.capability.duration.hours |
|
spark.hadoop.odps.moye.trackurl.dutation |
|
## 公共资源python2.7.13 spark.hadoop.odps.cupid.resources = public.python-2.7.13-ucs4.tar.gz spark.pyspark.python = ./public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python ## 公共资源python3.7.9 spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gz spark.pyspark.python = ./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3
(1)通过MaxCompute客户端将文件上传(单个文件最大支持500MB)
(2)在Spark作业配置中添加spark.hadoop.odps.cupid.resources参数:格式为<projectname>.<resourcename>,如果需要引用多个文件,需要用逗号隔开
(3)如果需要重命名,格式为<projectname>.<resourcename>:<new resource name>
val dir = new File(".") val targetFile = "file://" + dir.getCanonicalPath + "/" +文件名
(1)在driver端一般会打以下日志
(2)在logview中只能看到driver,而worker数量为0
(3)在spark ui中只能看到driver,而worker数量为0
spark.hadoop.odps.cupid.resources = public.__spark_libs__2.3.0odps0.32.5.zip spark.driver.extraClassPath = ./public.__spark_libs__2.3.0odps0.32.5.zip/* spark.executor.extraClassPath = ./public.__spark_libs__2.3.0odps0.32.5.zip/*
(1)通过odps cmd 执行 kill + instanceId;
(2)通过dataworks界面执行stop
# -*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8')
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。