spark streaming job运行卡住-问答-阿里云开发者社区-阿里云

开发者社区> 问答> 正文

spark streaming job运行卡住

1991201895094906 2021-01-21 10:52:16 279

QQ20210121-0.png QQ20210121-1.png

提交任务到yarn 显示running 我这个是新的groupId 在kafka上无法查询到group 再spark页面如上图所示,在yarn界面 卡住,没有任何日志打印

消息中间件 分布式计算 资源调度 Kafka Spark 流计算
分享到
取消 提交回答
全部回答(1)
  • 游客2q7uranxketok
    2021-02-24 18:07:03

    启动8个task的日志:

    2016-06-18 10:16,167 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 0.0 in stage 293749.0 (TID 1356848, scdxdsjkafka2, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,167 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 1.0 in stage 293749.0 (TID 1356849, scdxdsjkafka1, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,168 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 2.0 in stage 293749.0 (TID 1356850, scdxdsjkafka1, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,169 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 3.0 in stage 293749.0 (TID 1356851, scdxdsjkafka1, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,169 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 4.0 in stage 293749.0 (TID 1356852, scdxdsjkafka3, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,169 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 5.0 in stage 293749.0 (TID 1356853, scdxdsjkafka2, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,170 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 6.0 in stage 293749.0 (TID 1356854, scdxdsjkafka2, PROCESS_LOCAL, 1165 bytes) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,170 | INFO | [sparkDriver-akka.actor.default-dispatcher-16] | Starting task 7.0 in stage 293749.0 (TID 1356855, scdxdsjkafka3, PROCESS_LOCAL, 1165 bytes) |

    2、7个task任务完成(task 1356848 丢失):

    org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,468 | INFO | [task-result-getter-2] | Finished task 1.0 in stage 293749.0 (TID 1356849) in 301 ms on scdxdsjkafka1 (1/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,472 | INFO | [task-result-getter-3] | Finished task 5.0 in stage 293749.0 (TID 1356853) in 303 ms on scdxdsjkafka2 (2/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,474 | INFO | [task-result-getter-1] | Finished task 3.0 in stage 293749.0 (TID 1356851) in 306 ms on scdxdsjkafka1 (3/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,480 | INFO | [task-result-getter-0] | Finished task 7.0 in stage 293749.0 (TID 1356855) in 310 ms on scdxdsjkafka3 (4/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,485 | INFO | [task-result-getter-2] | Finished task 6.0 in stage 293749.0 (TID 1356854) in 315 ms on scdxdsjkafka2 (5/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,494 | INFO | [task-result-getter-3] | Finished task 4.0 in stage 293749.0 (TID 1356852) in 325 ms on scdxdsjkafka3 (6/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,511 | INFO | [task-result-getter-1] | Finished task 2.0 in stage 293749.0 (TID 1356850) in 343 ms on scdxdsjkafka1 (7/8) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,027 | INFO | [JobGenerator] | Added jobs for time 1466216220000 ms | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    2016-06-18 10:16,035 | INFO | [JobGenerator] | Added jobs for time 1466216230000 ms | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    问题出现时的截图(一个task阻塞,阻塞6个小时):

    20161017113556297001.png

    Task统计信息中缺少一个task:

    http://hi3ms-image.huawei.com/hi/showimage-1430645815-408595-88bbc1a0e3abcd228e1a3bb4431202ce.jpg

    解决方法

    1、 采取推测机制: 在spark-default.conf 中添加:spark.speculation true 配置关于推测机制的三个参数如下:

    l spark.speculation.interval 100:检测周期,单位毫秒;

    l spark.speculation.quantile 0.75:完成task的百分比时启动推测;

    l spark.speculation.multiplier 1.5:比其他的慢多少倍时启动推测。

    2、设置 spark.streaming.concurrentJobs 4

    开启推测机制后,如果集群中,某一台机器的几个task特别慢,推测机制会将任务分配到其他机器执行,最后Spark会选取最快的作为最终结果。

    spark.streaming.concurrentJobs可以控制job并发度,默认是1,某一个Job卡死之后,会影响后续Job的执行。现在将其设置为4,如果某个Job卡死,不会影响后续Job的执行。

    技术角度分析,设置上述两个参数之后,可以规避某个Task挂死的问题。因为此问题,现网也是运行3-4天才出现的随机问题,无法立即复现验证,目前只能采取这种规避方案。

    0 0

阿里巴巴开源大数据技术团队成立Apache Spark中国技术社区,定期推送精彩案例,问答区数个Spark技术同学每日在线答疑,只为营造纯粹的Spark氛围,欢迎加入!邀请你加入钉钉群聊Apache Spark中国技术交流社区,点击进入查看详情 https://qr.dingtalk.com/action/joingroup?code=v1,k1,X7S/0/QcrLMkK7QZ5sw2oTvoYW49u0g5dvGu7PW+sm4=&_dt_no_comment=1&origin=11

推荐文章
相似问题
官网链接