开发者社区> 问答> 正文

Flink1.11.1版本Application Mode job on K8S集群问题

Flink1.11.1版本job以Application Mode在K8S集群上运行,jobmanager每个小时会重启一次,报错【Fatal error  occurred in  ResourceManager.io.fabric8.kubernetes.client.KubernetesClientException: too  old resource version】 

pod重启:  http://apache-flink.147419.n8.nabble.com/file/t1176/11.jpg  

重启原因:  2020-12-10 07:21:19,290 ERROR  org.apache.flink.kubernetes.KubernetesResourceManager [] - Fatal  error occurred in ResourceManager.  io.fabric8.kubernetes.client.KubernetesClientException: too old resource  version: 247468999 (248117930)  at  io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  [?:1.8.0_202]  at  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  [?:1.8.0_202]  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]  2020-12-10 07:21:19,291 ERROR  org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal  error occurred in the cluster entrypoint.  io.fabric8.kubernetes.client.KubernetesClientException: too old resource  version: 247468999 (248117930)  at  io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)  [flink-dist_2.11-1.11.1.jar:1.11.1]  at  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  [?:1.8.0_202]  at  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  [?:1.8.0_202]  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] 

网上查的原因是因为:  org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient类中212行 

@Override  public KubernetesWatch watchPodsAndDoCallback(Map<String, String> labels,  PodCallbackHandler podCallbackHandler) {  return new KubernetesWatch(  this.internalClient.pods()  .withLabels(labels)  .watch(new KubernetesPodsWatcher(podCallbackHandler)));  } 

而ETCD中只会保留一段时间的version信息  【 I think it's standard behavior of Kubernetes to give 410 after some time  during watch. It's usually client's responsibility to handle it. In the  context of a watch, it will return HTTP_GONE when you ask to see changes for  a resourceVersion that is too old - i.e. when it can no longer tell you what  has changed since that version, since too many things have changed. In that  case, you'll need to start again, by not specifying a resourceVersion in  which case the watch will send you the current state of the thing you are  watching and then send updates from that point.】 

大家有没遇到相同的问题,是怎么处理的?我有几个处理方式,希望能跟大家一起讨论一下。*来自志愿者整理的flink邮件归档

展开
收起
又出bug了-- 2021-12-02 11:51:04 1023 0
1 条回答
写回答
取消 提交回答
  • 我之间建了一个JIRA来跟进too old resource version的问题[1] 

    目前在Flink里面采用了Watcher来监控Pod的状态变化,当Watcher被异常close的时候就会触发fatal  error进而导致JobManager的重启 

    我这边做过一些具体的测试,在minikube、自建的K8s集群、阿里云ACK集群,稳定运行一周以上都是正常的。这个问题复现是通过重启  K8s的APIServer来做到的。所以我怀疑你那边Pod和APIServer之间的网络是不是不稳定,从而导致这个问题经常出现。 

    [1]. https://issues.apache.org/jira/browse/FLINK-20417*来自志愿者整理的FLINK邮件归档

    2021-12-02 14:41:31
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
Flink CDC Meetup PPT - 龚中强 立即下载
Flink CDC Meetup PPT - 王赫 立即下载
Flink CDC Meetup PPT - 覃立辉 立即下载

相关镜像