ID
Status
Acknowledged
Trigger Time
Latest Acknowledgement
End to End Duration
State Size
Buffered During Alignment
295
FAILED
30/5011:55:3811:55:391h 0m 0s205 KB0 B
Checkpoint Detail:
Path: - Discarded: - Failure Message: Checkpoint expired before completing.
Operators:
Name
Acknowledged
Latest Acknowledgment
End to End Duration
State Size
Buffered During Alignment
Source: dw-member
6/10 (60%)11:55:391s7.08 KB0 B
Source: wi-order
6/10 (60%)11:55:391s7.11 KB0 B
Source: dw-pay
6/10 (60%)11:55:391s7.11 KB0 B
RecordTransformOperator
6/10 (60%)11:55:391s98.8 KB0 B
RecordComputeOperator -> Sink: dw-record-data-sink
6/10 (60%)11:55:391s85.1 KB0 B
SubTasks:
End to End Duration
State Size
Checkpoint Duration (Sync)
Checkpoint Duration (Async)
Alignment Buffered
Alignment Duration
Minimum1s14.2 KB7ms841ms0 B13ms
Average1s14.2 KB94ms1s0 B13ms
Maximum1s14.2 KB181ms1s0 B15ms
ID
Acknowledgement Time
E2E Duration
State Size
Checkpoint Duration (Sync)
Checkpoint Duration (Async)
Align Buffered
Align Duration
1n/a
211:55:391s14.2 KB8ms1s0 B15ms
3n/a
411:55:391s14.2 KB181ms1s0 B13ms
5n/a
611:55:391s14.2 KB8ms1s0 B14ms
711:55:391s14.2 KB181ms961ms0 B13ms
8n/a
911:55:391s14.2 KB181ms841ms0 B13ms
1011:55:391s14.2 KB7ms1s0 B14ms
请问,这类问题如何排查,有没有好的建议或者最佳实践?谢谢!
*来自志愿者整理的flink邮件归档
Hi!
checkpoint 超时有很多可能性。最常见的原因是超时的节点太忙阻塞了 checkpoint(包括计算资源不足,或者数据有倾斜等),这可以通过看 Flink web UI 上的 busy 以及反压信息判断;另外一个常见原因是 gc 太频繁,可以通过设置 jvm 参数打印出 gc log 观察。*来自志愿者整理的flink邮件归档
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。