开发者社区 > 大数据与机器学习 > 实时计算 Flink > 正文

Flink在HA模式,重启ZK集群,客户端任务提交异常(疑似脑裂)

基本信息

  • Flink 1.11.6版本,Standalone HA模式, 滚动重启了ZK集群
  • 在Flink集群的一个节点上使用flink run 命令提交多个任务

异常现象

  • 部分任务提交失败,异常信息如下
 [Flink-DispatcherRestEndpoint-thread-2] - [WARN ] - [org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.createRpcInvocationMessage(line:290)] - Could not create remote rpc invocation message. Failing rpc invocation because...
java.io.IOException: The rpc invocation size 12532388 exceeds the maximum akka framesize.
    at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.createRpcInvocationMessage(AkkaInvocationHandler.java:283) ~[flink-dist_2.12-1.11.6.jar:1.11.6]
    at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invokeRpc(AkkaInvocationHandler.java:210) ~[flink-dist_2.12-1.11.6.jar:1.11.6]
    at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invoke(AkkaInvocationHandler.java:133) ~[flink-dist_2.12-1.11.6.jar:1.11.6]
    at org.apache.flink.runtime.rpc.akka.FencedAkkaInvocationHandler.invoke(FencedAkkaInvocationHandler.java:88) ~[flink-dist_2.12-1.11.6.jar:1.11.6]
    at com.sun.proxy.$Proxy36.submitJob(Unknown Source) ~[?:?]
    at org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$handleRequest$0(JobSubmitHandler.java:123) ~[flink-dist_2.12-1.11.6.jar:1.11.6]
    at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) [?:?]
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) [?:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

日志信息

  • Flink集群中有两个节点(A和B)接收到了Job提交请求,两个节点的日志中均有如下信息

    [flink-akka.actor.default-dispatcher-33] - [INFO ] - [org.apache.flink.runtime.jobmaster.JobMaster.connectToResourceManager(line:1107)] - Connecting to ResourceManager akka.tcp://flink@X.X.X.X:46746/user/rpc/resourcemanager_0(ad84d46e902e0cf6da92179447af4e00)
    
  • 观察各个JobManager节点日志,有4个节点均出现Start SessionDispatcherLeaderProcess日志,但几乎都跟随了Stopping SessionDispatcherLeaderProcess日志,节点A与B没有出现stop; 其中A节点有明确的获得主点角色相关信息
    ```
    17:19:45,433 - [flink-akka.actor.default-dispatcher-22] - [INFO ] - [org.apache.flink.runtime.resourcemanager.ResourceManager.tryAcceptLeadership(line:1118)] - ResourceManager akka.tcp://flink@XXX.XX:46746/user/rpc/resourcemanager_0 was granted leadership with fencing token ad84d46e902e0cf6da92179447af4e00

    17:19:45,434 - [main-EventThread] - [INFO ] - [org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.grantLeadership(line:931)] - http://XXX:XXXwas granted leadership with leaderSessionID=f60df688-372d-416b-a965-989a59b37feb

    17:19:45,437 - [flink-akka.actor.default-dispatcher-22] - [INFO ] - [org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.start(line:287)] - Starting the SlotManager.

    17:19:45,480 - [main-EventThread] - [INFO ] - [org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.startInternal(line:97)] - Start SessionDispatcherLeaderProcess.

    17:19:45,489 - [cluster-io-thread-1] - [INFO ] - [org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(line:232)] - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/rpc/dispatcher_1 .

    17:19:45,495 - [flink-akka.actor.default-dispatcher-23] - [INFO ] - [org.apache.flink.runtime.resourcemanager.ResourceManager.registerTaskExecutorInternal(line:891)] - Registering TaskManager with ResourceID XXXXXX (akka.tcp://flink@X:XX/user/rpc/taskmanager_0) at ResourceManager

```

疑问

  • 正常情况下flink run提交到集群的任务应该只会提交同一个主JobManager节点
  • 以上描述的失败场景是否为Flink的BUG

展开
收起
Castle 2024-07-11 15:59:44 89 0
0 条回答
写回答
取消 提交回答

实时计算Flink版是阿里云提供的全托管Serverless Flink云服务,基于 Apache Flink 构建的企业级、高性能实时大数据处理系统。提供全托管版 Flink 集群和引擎,提高作业开发运维效率。

相关产品

  • 实时计算 Flink版
  • 相关电子书

    更多
    Flink CDC Meetup PPT - 龚中强 立即下载
    Flink CDC Meetup PPT - 王赫 立即下载
    Flink CDC Meetup PPT - 覃立辉 立即下载