解决Spark on YARN因虚拟内存超限导致的任务提交失败-开发者社区-阿里云

　无论用YARN cluster和YARN client来跑，均会出现如下问题。
[spark@master spark-1.6.1-bin-hadoop2.6]$ jps
2049 NameNode
2706 Jps
2372 ResourceManager
2660 Master
2203 SecondaryNameNode
[spark@master spark-1.6.1-bin-hadoop2.6]$ $SPARK_HOME/bin/spark-submit \
>  --master yarn\
>  --deploy-mode client \
>  --name javawordcount \
>  --num-executors 1 \
>  --driver-memory 512m \
>  --executor-memory 512m \
>  --executor-cores 1 \
>  --class zhouls.bigdata.MyJavaWordCount \
>  /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
>  hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
>  hdfs://master:9000/testspark/outData/MyJavaWordCount
17/03/30 20:36:57 INFO spark.SparkContext: Running Spark version 1.6.1
17/03/30 20:36:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/30 20:36:59 INFO spark.SecurityManager: Changing view acls to: spark
17/03/30 20:36:59 INFO spark.SecurityManager: Changing modify acls to: spark
17/03/30 20:36:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
17/03/30 20:37:01 INFO util.Utils: Successfully started service 'sparkDriver' on port 54074.
17/03/30 20:37:03 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/03/30 20:37:03 INFO Remoting: Starting remoting
17/03/30 20:37:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.80.10:52224]
17/03/30 20:37:04 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 52224.
17/03/30 20:37:04 INFO spark.SparkEnv: Registering MapOutputTracker
17/03/30 20:37:04 INFO spark.SparkEnv: Registering BlockManagerMaster
17/03/30 20:37:04 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-b6575213-cc8e-4a50-bc83-6ab089a65341
17/03/30 20:37:04 INFO storage.MemoryStore: MemoryStore started with capacity 146.2 MB
17/03/30 20:37:05 INFO spark.SparkEnv: Registering OutputCommitCoordinator

17/03/30 20:37:06 INFO server.Server: jetty-8.y.z-SNAPSHOT

17/03/30 20:37:06 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040

17/03/30 20:37:06 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.

17/03/30 20:37:06 INFO ui.SparkUI: Started SparkUI at http://192.168.80.10:4040

17/03/30 20:37:06 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fdfdb880-f6cf-47eb-8981-1176e657d466/httpd-f5d25b97-30bd-4f13-b925-d96026063a63

17/03/30 20:37:06 INFO spark.HttpServer: Starting HTTP Server

17/03/30 20:37:06 INFO server.Server: jetty-8.y.z-SNAPSHOT

17/03/30 20:37:06 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:54651

17/03/30 20:37:06 INFO util.Utils: Successfully started service 'HTTP file server' on port 54651.

17/03/30 20:37:07 INFO spark.SparkContext: Added JAR file:/home/spark/testspark/mySpark-1.0-SNAPSHOT.jar at http://192.168.80.10:54651/jars/mySpark-1.0-SNAPSHOT.jar with timestamp 1490877427613

17/03/30 20:37:08 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.80.10:8032

17/03/30 20:37:09 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers

17/03/30 20:37:09 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)

17/03/30 20:37:09 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead

17/03/30 20:37:09 INFO yarn.Client: Setting up container launch context for our AM

17/03/30 20:37:09 INFO yarn.Client: Setting up the launch environment for our AM container

17/03/30 20:37:09 INFO yarn.Client: Preparing resources for our AM container

17/03/30 20:37:14 INFO yarn.Client: Uploading resource file:/usr/local/spark/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar -> hdfs://master:9000/user/spark/.sparkStaging/application_1490877371054_0001/spark-assembly-1.6.1-hadoop2.6.0.jar

17/03/30 20:37:36 INFO yarn.Client: Uploading resource file:/tmp/spark-fdfdb880-f6cf-47eb-8981-1176e657d466/__spark_conf__3748671039525906996.zip -> hdfs://master:9000/user/spark/.sparkStaging/application_1490877371054_0001/__spark_conf__3748671039525906996.zip

17/03/30 20:37:38 INFO spark.SecurityManager: Changing view acls to: spark

17/03/30 20:37:38 INFO spark.SecurityManager: Changing modify acls to: spark

17/03/30 20:37:38 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)

17/03/30 20:37:38 INFO yarn.Client: Submitting application 1 to ResourceManager

17/03/30 20:37:39 INFO impl.YarnClientImpl: Submitted application application_1490877371054_0001

17/03/30 20:37:40 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:40 INFO yarn.Client: 

     client token: N/A     client token: N/A

     diagnostics: N/A

     ApplicationMaster host: N/A

     ApplicationMaster RPC port: -1

     queue: default

     start time: 1490877458881

     final status: UNDEFINED

     tracking URL: http://master:8088/proxy/application_1490877371054_0001/

     user: spark

17/03/30 20:37:41 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:42 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:43 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:44 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:45 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:46 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:47 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:48 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:49 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:50 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:51 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:52 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:53 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:54 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:55 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:56 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:57 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:58 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:37:59 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:00 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:01 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:02 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:03 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:04 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:05 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:06 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:07 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:08 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:09 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:10 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:12 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:13 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:14 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:15 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:16 INFO yarn.Client: Application report for application_1490877371054_0001 (state: ACCEPTED)

17/03/30 20:38:17 INFO yarn.Client: Application report for application_1490877371054_0001 (state: FAILED)

17/03/30 20:38:17 INFO yarn.Client: 

     client token: N/A

     diagnostics: Application application_1490877371054_0001 failed 2 times due to AM Container for appattempt_1490877371054_0001_000002 exited with  exitCode: -103

For more detailed output, check application tracking page:http://master:8088/proxy/application_1490877371054_0001/Then, click on links to logs of each attempt.

Diagnostics: Container [pid=2417,containerID=container_1490877371054_0001_02_000001] is running beyond virtual memory limits. Current usage: 79.2 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.

Dump of the process-tree for container_1490877371054_0001_02_000001 :

    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

    |- 2421 2417 2417 2417 (java) 283 147 2256482304 19967 /usr/local/jdk/jdk1.8.0_60/bin/java -server -Xmx512m -Djava.io.tmpdir=/usr/local/hadoop/hadoop-2.6.0/tmp/nm-local-dir/usercache/spark/appcache/application_1490877371054_0001/container_1490877371054_0001_02_000001/tmp -Dspark.yarn.app.container.log.dir=/usr/local/hadoop/hadoop-2.6.0/logs/userlogs/application_1490877371054_0001/container_1490877371054_0001_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg 192.168.80.10:54074 --executor-memory 512m --executor-cores 1 --properties-file /usr/local/hadoop/hadoop-2.6.0/tmp/nm-local-dir/usercache/spark/appcache/application_1490877371054_0001/container_1490877371054_0001_02_000001/__spark_conf__/__spark_conf__.properties 

    |- 2417 2415 2417 2417 (bash) 0 1 108650496 305 /bin/bash -c /usr/local/jdk/jdk1.8.0_60/bin/java -server -Xmx512m -Djava.io.tmpdir=/usr/local/hadoop/hadoop-2.6.0/tmp/nm-local-dir/usercache/spark/appcache/application_1490877371054_0001/container_1490877371054_0001_02_000001/tmp -Dspark.yarn.app.container.log.dir=/usr/local/hadoop/hadoop-2.6.0/logs/userlogs/application_1490877371054_0001/container_1490877371054_0001_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg '192.168.80.10:54074' --executor-memory 512m --executor-cores 1 --properties-file /usr/local/hadoop/hadoop-2.6.0/tmp/nm-local-dir/usercache/spark/appcache/application_1490877371054_0001/container_1490877371054_0001_02_000001/__spark_conf__/__spark_conf__.properties 1> /usr/local/hadoop/hadoop-2.6.0/logs/userlogs/application_1490877371054_0001/container_1490877371054_0001_02_000001/stdout 2> /usr/local/hadoop/hadoop-2.6.0/logs/userlogs/application_1490877371054_0001/container_1490877371054_0001_02_000001/stderr 



Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

Failing this attempt. Failing the application.

     ApplicationMaster host: N/A

     ApplicationMaster RPC port: -1

     queue: default

     start time: 1490877458881

     final status: FAILED

     tracking URL: http://master:8088/cluster/app/application_1490877371054_0001

     user: spark

17/03/30 20:38:17 INFO yarn.Client: Deleting staging directory .sparkStaging/application_1490877371054_0001

17/03/30 20:38:17 ERROR spark.SparkContext: Error initializing SparkContext.

org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)

    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)

    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)

    at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)

    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)

    at zhouls.bigdata.MyJavaWordCount.main(MyJavaWordCount.java:31)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:497)

    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)

    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)

    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)

    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)

    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}

17/03/30 20:38:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}

17/03/30 20:38:17 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.80.10:4040

17/03/30 20:38:17 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors

17/03/30 20:38:17 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down

17/03/30 20:38:17 INFO cluster.YarnClientSchedulerBackend: Stopped

17/03/30 20:38:17 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

17/03/30 20:38:17 INFO storage.MemoryStore: MemoryStore cleared

17/03/30 20:38:17 INFO storage.BlockManager: BlockManager stopped

17/03/30 20:38:17 INFO storage.BlockManagerMaster: BlockManagerMaster stopped

17/03/30 20:38:17 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running

17/03/30 20:38:17 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

17/03/30 20:38:17 INFO spark.SparkContext: Successfully stopped SparkContext

Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)

    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)

    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)

    at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)

    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)

    at zhouls.bigdata.MyJavaWordCount.main(MyJavaWordCount.java:31)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:497)

    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)

    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)

    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)

    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)

    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

17/03/30 20:38:17 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

17/03/30 20:38:18 INFO util.ShutdownHookManager: Shutdown hook called

17/03/30 20:38:18 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.

17/03/30 20:38:18 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fdfdb880-f6cf-47eb-8981-1176e657d466

17/03/30 20:38:18 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fdfdb880-f6cf-47eb-8981-1176e657d466/httpd-f5d25b97-30bd-4f13-b925-d96026063a63

[spark@master spark-1.6.1-bin-hadoop2.6]
解决思路
　　第一种解决版本：首先想到是集群中内存资源不足，可以检查下每台机器是否有足够剩余内存( free -g)；也可能是其他已经提交的Spark应用占了大部分资源；
　　第二种解决办法：如果1>正常，我们可以看看YARN集群是否启动成功。注意“坑”可能就在这里：即使Slave上的nodemanager进程存在，要注意检查resource manager日志，看看各个node manager是否启动成功，我的问题就出现在这里：进程在，但是日志显示node manager状态为UNHEALTHY，所以YARN集群能识别到的总内存资源为0。。。
　　检查了UNHEALTHY的原因，是因为/tmp下一个目录被识别为bad, 因为是临时目录，我把每个node manager的对应目录删掉，然后重启YARN集群，最终问题解决。
本文转自大数据躺过的坑博客园博客，原文链接：http://www.cnblogs.com/zlslch/p/6645549.html，如需转载请自行联系原作者
Spark通过YARN提交任务不成功（包含YARN cluster和YARN client)

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Spark通过YARN提交任务不成功（包含YARN cluster和YARN client)

热门文章

最新文章

相关课程

相关电子书