Flink CDC中从 savepoints恢复会报错,怎么办?2023-05-30 16:41:20,445 WARN org.apache.flink.streaming.api.operators.BackendRestorerProcedure [] - Exception while restoring keyed state backend for WindowOperator_19feeae3c9d523bf90e117e9861001a5_(3/20) from alternative (1/1), will retry while more alternatives are available.
org.apache.flink.runtime.state.BackendBuildingException: Failed when trying to restore heap backend
at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.restoreState(HeapKeyedStateBackendBuilder.java:177) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:111) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.state.hashmap.HashMapStateBackend.createKeyedStateBackend(HashMapStateBackend.java:131) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.state.hashmap.HashMapStateBackend.createKeyedStateBackend(HashMapStateBackend.java:73) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.state.StateBackend.createKeyedStateBackend(StateBackend.java:136) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:441) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:585) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:565) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650) [flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:540) [flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:759) [flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) [flink-dist_2.12-1.13.6.jar:1.13.6]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 1@D
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。
遇到从savepoints恢复Flink CDC作业时出现的错误,特别是与状态后端相关的异常,可以尝试以下几个解决步骤:
检查Savepoint兼容性:确保你使用的Flink版本与创建savepoint时的版本兼容。不同版本的Flink可能对state backend有细微的差异,不兼容可能导致恢复失败。
清理并重新尝试:有时候,问题可能是由于临时文件或状态损坏导致的。尝试删除旧的savepoint(如果可以的话),然后让作业重新生成一个新的savepoint,并尝试从这个新的savepoint恢复。
检查Kryo序列化配置:报错中提到“Unable to find class: 1@D”,这通常意味着Kryo在反序列化过程中找不到对应的类定义。确认你的Flink作业和所有依赖都正确打包,并且所有需要序列化的类都在类路径上。你可能需要在Flink配置中显式注册这些类,使用env.getConfig().registerTypeWithKryoSerializer(Class<?> type, Class<? extends Serializer<?>> serializerClass)
方法。
修改State Backend设置:考虑更换或调整状态后端配置。例如,如果你当前使用的是HashMapStateBackend
,可以尝试改为RocksDBStateBackend
,因为RocksDB提供了更稳定的状态存储机制,尤其是在处理大规模状态数据时。记得根据文档正确配置RocksDBStateBackend。
日志和堆栈跟踪深入分析:仔细查看完整的错误日志和堆栈跟踪,寻找更具体的错误原因。有时,错误信息中会包含关于哪个具体操作或类导致问题的线索。
资源限制:检查是否有足够的系统资源(如内存、磁盘空间)来完成恢复过程。资源不足也可能导致恢复失败。
社区和官方文档:如果上述步骤不能解决问题,建议查阅Flink的官方文档,尤其是关于Savepoint和State Backend的部分,或者在Apache Flink的用户邮件列表或Stack Overflow等社区寻求帮助,提供详细的错误信息和你的配置情况,以便获得更专业的帮助。
请按照上述建议逐一排查,希望能帮助你解决问题。您可以复制页面截图提供更多信息,我可以进一步帮您分析问题原因。