Flink CDC中从 savepoints恢复会报错,怎么办?

Flink CDC中从 savepoints恢复会报错,怎么办?2023-05-30 16:41:20,445 WARN org.apache.flink.streaming.api.operators.BackendRestorerProcedure [] - Exception while restoring keyed state backend for WindowOperator_19feeae3c9d523bf90e117e9861001a5_(3/20) from alternative (1/1), will retry while more alternatives are available.

org.apache.flink.runtime.state.BackendBuildingException: Failed when trying to restore heap backend

at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.restoreState(HeapKeyedStateBackendBuilder.java:177) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:111) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.runtime.state.hashmap.HashMapStateBackend.createKeyedStateBackend(HashMapStateBackend.java:131) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.runtime.state.hashmap.HashMapStateBackend.createKeyedStateBackend(HashMapStateBackend.java:73) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.runtime.state.StateBackend.createKeyedStateBackend(StateBackend.java:136) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:441) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:585) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:565) ~[flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650) [flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:540) [flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:759) [flink-dist_2.12-1.13.6.jar:1.13.6]

at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) [flink-dist_2.12-1.13.6.jar:1.13.6]

at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]

Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 1@D

展开
收起
真的很搞笑 2023-06-04 18:26:46 261 分享 版权
阿里云 AI 助理回答

遇到从savepoints恢复Flink CDC作业时出现的错误,特别是与状态后端相关的异常,可以尝试以下几个解决步骤:

  1. 检查Savepoint兼容性:确保你使用的Flink版本与创建savepoint时的版本兼容。不同版本的Flink可能对state backend有细微的差异,不兼容可能导致恢复失败。

  2. 清理并重新尝试:有时候,问题可能是由于临时文件或状态损坏导致的。尝试删除旧的savepoint(如果可以的话),然后让作业重新生成一个新的savepoint,并尝试从这个新的savepoint恢复。

  3. 检查Kryo序列化配置:报错中提到“Unable to find class: 1@D”,这通常意味着Kryo在反序列化过程中找不到对应的类定义。确认你的Flink作业和所有依赖都正确打包,并且所有需要序列化的类都在类路径上。你可能需要在Flink配置中显式注册这些类,使用env.getConfig().registerTypeWithKryoSerializer(Class<?> type, Class<? extends Serializer<?>> serializerClass)方法。

  4. 修改State Backend设置:考虑更换或调整状态后端配置。例如,如果你当前使用的是HashMapStateBackend,可以尝试改为RocksDBStateBackend,因为RocksDB提供了更稳定的状态存储机制,尤其是在处理大规模状态数据时。记得根据文档正确配置RocksDBStateBackend。

  5. 日志和堆栈跟踪深入分析:仔细查看完整的错误日志和堆栈跟踪,寻找更具体的错误原因。有时,错误信息中会包含关于哪个具体操作或类导致问题的线索。

  6. 资源限制:检查是否有足够的系统资源(如内存、磁盘空间)来完成恢复过程。资源不足也可能导致恢复失败。

  7. 社区和官方文档:如果上述步骤不能解决问题,建议查阅Flink的官方文档,尤其是关于Savepoint和State Backend的部分,或者在Apache Flink的用户邮件列表或Stack Overflow等社区寻求帮助,提供详细的错误信息和你的配置情况,以便获得更专业的帮助。

请按照上述建议逐一排查,希望能帮助你解决问题。您可以复制页面截图提供更多信息,我可以进一步帮您分析问题原因。

有帮助
无帮助
AI 助理回答生成答案可能存在不准确,仅供参考
0 条回答
写回答
取消 提交回答

大数据领域前沿技术分享与交流,这里不止有技术干货、学习心得、企业实践、社区活动,还有未来。

还有其他疑问?
咨询AI助理