Flink 1.12.0 隔几个小时Checkpoint就会失败

Hi 大家好我用的Flink on yarn模式运行的一个任务，每隔几个小时就会出现一次错误

2021-03-18 08:52:37,019 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in 4699 ms).

2021-03-18 08:52:37,637 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5.

2021-03-18 08:52:42,956 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes in 4939 ms).

2021-03-18 08:52:43,528 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5.

2021-03-18 09:12:43,528 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before completing.

2021-03-18 09:12:43,615 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to recover from a global failure.

org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.

at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) ~[flink-dist_2.12-1.12.0.jar:1.12.0]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_231]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231]

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_231]

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_231]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231]

at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231]

2021-03-18 09:12:43,618 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from state RUNNING to RESTARTING.

2021-03-18 09:12:43,619 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to CANCELING.

2021-03-18 09:12:43,622 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to CANCELING.

2021-03-18 09:12:43,622 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to CANCELING.

然后就自己恢复了。用的是Unaligned Checkpoint，rocksdb存储后端，在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看，总是剩最后一个无法完成，调整过parallelism也无法解决问题。

谢谢大家！

你好，问题定位到了吗？我也遇到了相同的问题，感觉和checkpoint interval有关我有两个相同的作业（checkpoint interval 设置的是3分钟），一个运行在flink1.9，一个运行在flink1.12，1.9的作业稳定运行，1.12的运行5小时就会checkpoint 制作失败，抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. 当我把checkpoint interval调大到10分钟后，1.12的作业也可以稳定运行，所以我怀疑和制作间隔有关。看到过一个issuse，了解到flink1.10后对于checkpoint机制进行调整，接收端在barrier对齐时不会缓存单个barrier到达后的数据，意味着发送方必须在barrier对齐后等待credit feedback来传输数据，因此发送方会产生一定的冷启动，影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关，以及如何定位影响。

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Flink 1.12.0 隔几个小时Checkpoint就会失败

相关文章