但是观察taskmanager容器有正常启动,后续任务也可正常执行,针对该报错需如何处理?是不兼容阿里云ACK集群么?
启动命令:
./bin/flink run-application \
--target kubernetes-application \
-Dkubernetes.cluster-id=demo \
-Dkubernetes.container.image=xx.xx.xx/xx/xxx:2.0.12 \
local:///opt/flink/usrlib/my-flink-job.jar
日志:
2021-03-01 04:52:06,518 INFO org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - Job 6eb4027586e7137b20ecc8c3ce624417 is submitted.
2021-03-01 04:52:06,518 INFO org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - Submitting Job with JobId=6eb4027586e7137b20ecc8c3ce624417.
2021-03-01 04:52:08,303 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Recovered 0 pods from previous attempts, current attempt id is 1.
2021-03-01 04:52:08,303 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Recovered 0 workers from previous attempt.
2021-03-01 04:52:08,306 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - ResourceManager akka.tcp://flink@demo.default:6123/user/rpc/resourcemanager_0 was granted leadership with fencing token 00000000000000000000000000000000
2021-03-01 04:52:08,310 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Starting the SlotManager.
2021-03-01 04:52:08,596 WARN org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Unhandled exception
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_275]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_275]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_275]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_275]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[?:1.8.0_275]
at org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [flink-dist_2.12-1.12.0.jar:1.12.0]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
2021-03-01 04:52:08,596 WARN org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Unhandled exception
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_275]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_275]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_275]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_275]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[?:1.8.0_275]
at org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [flink-dist_2.12-1.12.0.jar:1.12.0]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
后续不停日志滚动刷新exception*来自志愿者整理的flink邮件归档
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。
遇到这种报错,通常意味着在Flink作业运行过程中发生了网络连接异常,具体表现为“Connection reset by peer”。这并不直接意味着阿里云ACK(Kubernetes容器服务)集群与Flink不兼容,而是可能由多种原因导致的,包括但不限于网络配置问题、资源限制、安全策略设置等。以下是一些排查和解决此类问题的建议:
检查网络配置:确保集群内的网络通信没有被防火墙或网络策略阻断。在ACK集群中,检查是否有网络策略(NetworkPolicy)限制了Pod间的通信。
资源限制:检查TaskManager容器是否因资源不足(如CPU、内存)而被Kubernetes终止。资源不足可能导致容器被Kill,从而出现连接重置的现象。
镜像问题:确认使用的Flink镜像xx.xx.xx/xx/xxx:2.0.12
是否与ACK集群环境兼容,以及该镜像在拉取和运行时没有问题。考虑使用官方维护的或经过验证的镜像版本。
日志深入分析:查看TaskManager和JobManager的详细日志,这些日志可能会提供更多关于为何连接被重置的信息。可以在ACK控制台的对应Pod的日志中查找线索。
Flink配置调整:检查Flink的配置文件,特别是与网络通讯相关的配置项,比如心跳间隔、超时时间等,确保它们适合你的集群环境。
ACK集群健康状态:确认ACK集群本身运行正常,没有节点故障或其它系统级问题。可以通过ACK控制台监控集群状态。
Java版本兼容性:日志显示使用的是Java 8 (1.8.0_275),虽然大多数情况下不会有问题,但确保这个Java版本与Flink 1.12.0是兼容的,有时特定的JVM版本会引发意外的问题。
如果以上步骤不能解决问题,建议查阅Flink社区的相关讨论或者在阿里云开发者论坛提问,提供更详细的集群配置信息和错误日志,以便获得更专业的帮助。