spark streaming job运行卡住-问答-阿里云开发者社区-阿里云

开发者社区> 问答> 正文

spark streaming job运行卡住

来自:Apache Spark 中国技术社区 2021-01-21 10:52:16 1310 1

QQ20210121-0.png QQ20210121-1.png

提交任务到yarn 显示running 我这个是新的groupId 在kafka上无法查询到group 再spark页面如上图所示,在yarn界面 卡住,没有任何日志打印

取消 提交回答
全部回答(1)
  • 游客2q7uranxketok
    2021-02-24 18:07:03

    启动8个task的日志:

    2016-06-18 10:16,167 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 0.0 in stage 293749.0 (TID 1356848, scdxdsjkafka2, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,167 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 1.0 in stage 293749.0 (TID 1356849, scdxdsjkafka1, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,168 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 2.0 in stage 293749.0 (TID 1356850, scdxdsjkafka1, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,169 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 3.0 in stage 293749.0 (TID 1356851, scdxdsjkafka1, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,169 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 4.0 in stage 293749.0 (TID 1356852, scdxdsjkafka3, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,169 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 5.0 in stage 293749.0 (TID 1356853, scdxdsjkafka2, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,170 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 6.0 in stage 293749.0 (TID 1356854, scdxdsjkafka2, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,170 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 7.0 in stage 293749.0 (TID 1356855, scdxdsjkafka3, PROCESS_LOCAL, 1165 bytes) |

    2、7个task任务完成(task 1356848 丢失):

    org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,468 | INFO | [task-result-getter-2] | Finished task 1.0 in stage 293749.0 (TID 1356849) in 301 ms on scdxdsjkafka1 (1/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,472 | INFO | [task-result-getter-3] | Finished task 5.0 in stage 293749.0 (TID 1356853) in 303 ms on scdxdsjkafka2 (2/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,474 | INFO | [task-result-getter-1] | Finished task 3.0 in stage 293749.0 (TID 1356851) in 306 ms on scdxdsjkafka1 (3/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,480 | INFO | [task-result-getter-0] | Finished task 7.0 in stage 293749.0 (TID 1356855) in 310 ms on scdxdsjkafka3 (4/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,485 | INFO | [task-result-getter-2] | Finished task 6.0 in stage 293749.0 (TID 1356854) in 315 ms on scdxdsjkafka2 (5/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,494 | INFO | [task-result-getter-3] | Finished task 4.0 in stage 293749.0 (TID 1356852) in 325 ms on scdxdsjkafka3 (6/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,511 | INFO | [task-result-getter-1] | Finished task 2.0 in stage 293749.0 (TID 1356850) in 343 ms on scdxdsjkafka1 (7/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,027 | INFO | [JobGenerator] | Added jobs for time 1466216220000 ms | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,035 | INFO | [JobGenerator] | Added jobs for time 1466216230000 ms | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    问题出现时的截图(一个task阻塞,阻塞6个小时):

    20161017113556297001.png

    Task统计信息中缺少一个task:

    http://hi3ms-image.huawei.com/hi/showimage-1430645815-408595-88bbc1a0e3abcd228e1a3bb4431202ce.jpg

    解决方法

    1、 采取推测机制: 在spark-default.conf 中添加:spark.speculation true 配置关于推测机制的三个参数如下:

    l spark.speculation.interval 100:检测周期,单位毫秒;

    l spark.speculation.quantile 0.75:完成task的百分比时启动推测;

    l spark.speculation.multiplier 1.5:比其他的慢多少倍时启动推测。

    2、设置 spark.streaming.concurrentJobs 4

    开启推测机制后,如果集群中,某一台机器的几个task特别慢,推测机制会将任务分配到其他机器执行,最后Spark会选取最快的作为最终结果。

    spark.streaming.concurrentJobs可以控制job并发度,默认是1,某一个Job卡死之后,会影响后续Job的执行。现在将其设置为4,如果某个Job卡死,不会影响后续Job的执行。

    技术角度分析,设置上述两个参数之后,可以规避某个Task挂死的问题。因为此问题,现网也是运行3-4天才出现的随机问题,无法立即复现验证,目前只能采取这种规避方案。

    0 0
相关问答

1

回答

spark streaming job运行卡住

2018-12-11 01:52:01 6365浏览量 回答数 1

1

回答

Spark Streaming作业运行一段时间后无故结束?

2020-03-20 10:02:00 430浏览量 回答数 1

1

回答

Spark Streaming 作业运行一段时间后无故结束

2019-04-26 15:25:02 1834浏览量 回答数 1

1

回答

Spark Streaming 作业运行一段时间后无故结束

2019-04-26 15:30:22 1625浏览量 回答数 1

1

回答

spark中RDD的特点有哪些?

2021-12-10 13:13:54 168浏览量 回答数 1

1

回答

spark中RDD之所以为“弹性”的特点的原因有什么?

2021-12-10 13:15:08 181浏览量 回答数 1

1

回答

Spark中的产生RDD的原因是什么?

2021-12-10 14:03:19 179浏览量 回答数 1

1

回答

Spark和RDD的关系是什么?

2021-12-10 14:05:37 218浏览量 回答数 1

1

回答

Spark RDD是怎么容错的,基本原理是什么?

2021-12-08 21:16:08 105浏览量 回答数 1

1

回答

Spark和RDD的关系是怎样的?

2021-12-06 20:40:40 270浏览量 回答数 1
0
文章
1
问答
问答排行榜
最热
最新
相关电子书
更多
低代码开发师(初级)实战教程
立即下载
阿里巴巴DevOps 最佳实践手册
立即下载
冬季实战营第三期:MySQL数据库进阶实战
立即下载