开发者社区> 问答> 正文

flink的job运行一段时间后, checkpoint就一直失败

ID
Status
Acknowledged
Trigger Time
Latest Acknowledgement
End to End Duration
State Size
Buffered During Alignment
295
FAILED
30/5011:55:3811:55:391h 0m 0s205 KB0 B
Checkpoint Detail:
Path: - Discarded: - Failure Message: Checkpoint expired before completing.
Operators:
Name
Acknowledged
Latest Acknowledgment
End to End Duration
State Size
Buffered During Alignment
Source: dw-member
6/10 (60%)11:55:391s7.08 KB0 B
Source: wi-order
6/10 (60%)11:55:391s7.11 KB0 B
Source: dw-pay
6/10 (60%)11:55:391s7.11 KB0 B
RecordTransformOperator
6/10 (60%)11:55:391s98.8 KB0 B
RecordComputeOperator -> Sink: dw-record-data-sink
6/10 (60%)11:55:391s85.1 KB0 B
SubTasks:
End to End Duration
State Size
Checkpoint Duration (Sync)
Checkpoint Duration (Async)
Alignment Buffered
Alignment Duration
Minimum1s14.2 KB7ms841ms0 B13ms
Average1s14.2 KB94ms1s0 B13ms
Maximum1s14.2 KB181ms1s0 B15ms
ID
Acknowledgement Time
E2E Duration
State Size
Checkpoint Duration (Sync)
Checkpoint Duration (Async)
Align Buffered
Align Duration
1n/a
211:55:391s14.2 KB8ms1s0 B15ms
3n/a
411:55:391s14.2 KB181ms1s0 B13ms
5n/a
611:55:391s14.2 KB8ms1s0 B14ms
711:55:391s14.2 KB181ms961ms0 B13ms
8n/a
911:55:391s14.2 KB181ms841ms0 B13ms
1011:55:391s14.2 KB7ms1s0 B14ms


请问,这类问题如何排查,有没有好的建议或者最佳实践?谢谢!
*来自志愿者整理的flink邮件归档

展开
收起
moonlightdisco 2021-12-07 16:48:30 1593 0
1 条回答
写回答
取消 提交回答
  • Hi!

    checkpoint 超时有很多可能性。最常见的原因是超时的节点太忙阻塞了 checkpoint(包括计算资源不足,或者数据有倾斜等),这可以通过看 Flink web UI 上的 busy 以及反压信息判断;另外一个常见原因是 gc 太频繁,可以通过设置 jvm 参数打印出 gc log 观察。*来自志愿者整理的flink邮件归档

    2021-12-07 17:19:31
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
Flink CDC Meetup PPT - 龚中强 立即下载
Flink CDC Meetup PPT - 王赫 立即下载
Flink CDC Meetup PPT - 覃立辉 立即下载