开发者社区 问答 正文

Flink 1.12 ApplicationMode运行在阿里云托管Kubernetes报错怎么办?


但是观察taskmanager容器有正常启动,后续任务也可正常执行,针对该报错需如何处理?是不兼容阿里云ACK集群么?



启动命令:

./bin/flink run-application \

--target kubernetes-application \

-Dkubernetes.cluster-id=demo \

-Dkubernetes.container.image=xx.xx.xx/xx/xxx:2.0.12 \

local:///opt/flink/usrlib/my-flink-job.jar

日志:

2021-03-01 04:52:06,518 INFO org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - Job 6eb4027586e7137b20ecc8c3ce624417 is submitted.

2021-03-01 04:52:06,518 INFO org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - Submitting Job with JobId=6eb4027586e7137b20ecc8c3ce624417.

2021-03-01 04:52:08,303 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Recovered 0 pods from previous attempts, current attempt id is 1.

2021-03-01 04:52:08,303 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Recovered 0 workers from previous attempt.

2021-03-01 04:52:08,306 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - ResourceManager akka.tcp://flink@demo.default:6123/user/rpc/resourcemanager_0 was granted leadership with fencing token 00000000000000000000000000000000

2021-03-01 04:52:08,310 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Starting the SlotManager.

2021-03-01 04:52:08,596 WARN org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Unhandled exception

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_275]

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_275]

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_275]

at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_275]

at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[?:1.8.0_275]

at org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [flink-dist_2.12-1.12.0.jar:1.12.0]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]

2021-03-01 04:52:08,596 WARN org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Unhandled exception

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_275]

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_275]

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_275]

at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_275]

at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[?:1.8.0_275]

at org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [flink-dist_2.12-1.12.0.jar:1.12.0]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]



后续不停日志滚动刷新exception*来自志愿者整理的flink邮件归档

展开
收起
彗星halation 2021-12-01 16:26:56 804 分享 版权
阿里云 AI 助理回答

遇到这种报错,通常意味着在Flink作业运行过程中发生了网络连接异常,具体表现为“Connection reset by peer”。这并不直接意味着阿里云ACK(Kubernetes容器服务)集群与Flink不兼容,而是可能由多种原因导致的,包括但不限于网络配置问题、资源限制、安全策略设置等。以下是一些排查和解决此类问题的建议:

  1. 检查网络配置:确保集群内的网络通信没有被防火墙或网络策略阻断。在ACK集群中,检查是否有网络策略(NetworkPolicy)限制了Pod间的通信。

  2. 资源限制:检查TaskManager容器是否因资源不足(如CPU、内存)而被Kubernetes终止。资源不足可能导致容器被Kill,从而出现连接重置的现象。

  3. 镜像问题:确认使用的Flink镜像xx.xx.xx/xx/xxx:2.0.12是否与ACK集群环境兼容,以及该镜像在拉取和运行时没有问题。考虑使用官方维护的或经过验证的镜像版本。

  4. 日志深入分析:查看TaskManager和JobManager的详细日志,这些日志可能会提供更多关于为何连接被重置的信息。可以在ACK控制台的对应Pod的日志中查找线索。

  5. Flink配置调整:检查Flink的配置文件,特别是与网络通讯相关的配置项,比如心跳间隔、超时时间等,确保它们适合你的集群环境。

  6. ACK集群健康状态:确认ACK集群本身运行正常,没有节点故障或其它系统级问题。可以通过ACK控制台监控集群状态。

  7. Java版本兼容性:日志显示使用的是Java 8 (1.8.0_275),虽然大多数情况下不会有问题,但确保这个Java版本与Flink 1.12.0是兼容的,有时特定的JVM版本会引发意外的问题。

如果以上步骤不能解决问题,建议查阅Flink社区的相关讨论或者在阿里云开发者论坛提问,提供更详细的集群配置信息和错误日志,以便获得更专业的帮助。

有帮助
无帮助
AI 助理回答生成答案可能存在不准确,仅供参考
0 条回答
写回答
取消 提交回答