开发者社区> 问答> 正文

从checkpoint恢复任务失败怎么办?

用的版本1.9.1,我这里只要遇到异常,譬如空指针异常,然后从checkpoint恢复,总是恢复失败,报找不到sst文件的错误,错误堆栈如下:

2020-01-16 19:29:39

java.lang.Exception: Exception while creating StreamOperatorStateContext.

at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)

at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:253)

at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:881)

at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:395)

at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)

at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)

at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_8ea7af242b2bcc2d11daf69b5d588c4d_(31/32) from any of the 1 provided restore options.

at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)

at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)

at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)

... 6 more

Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.

at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:326)

at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520)

at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)

at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)

at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)

... 8 more

Caused by: java.nio.file.NoSuchFileException: /data/hadoop/tmp/nm-local-dir/usercache/www-data/appcache/application_1579002711906_0001/flink-io-bd910c0d-03c7-48ff-8712-4e7059bac574/job_bbe797c8fcdf7c362bed774435ae5f86_op_KeyedProcessOperator_8ea7af242b2bcc2d11daf69b5d588c4d__31_32__uuid_7259cf96-aa16-423e-a356-dcac0a7859f2/db/000019.sst -> /data/hadoop/tmp/nm-local-dir/usercache/www-data/appcache/application_1579002711906_0001/flink-io-bd910c0d-03c7-48ff-8712-4e7059bac574/job_bbe797c8fcdf7c362bed774435ae5f86_op_KeyedProcessOperator_8ea7af242b2bcc2d11daf69b5d588c4d__31_32__uuid_7259cf96-aa16-423e-a356-dcac0a7859f2/40e6dc65-7fac-41ae-b736-91c4ecd5e296/000019.sst

at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)

at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)

at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)

at java.nio.file.Files.createLink(Files.java:1086)

at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBIncrementalRestoreOperation.java:473)

at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:212)

at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188)

at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)

at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)

at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:270)

... 12 more

从错误日志看,首先从远程dfs下载checkpoint,下载到本地后,再做链接,但是在链接过程报错找不到文件,这个难道是权限问题吗*来自志愿者整理的flink邮件归档

展开
收起
彗星halation 2021-12-02 17:54:21 855 0
1 条回答
写回答
取消 提交回答
  • 确定每次恢复的时候没有其他异常么,之前有用户遇到是因为其他异常,触发cancel task的逻辑,导致清理了本地下载的文件,所以在进行硬链的时候会遇到no such file的异常。*来自志愿者整理的FLINK邮件归档

    2021-12-02 18:14:30
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
俞航翔|基于Log的通用增量Checkpoint 立即下载
低代码开发师(初级)实战教程 立即下载
阿里巴巴DevOps 最佳实践手册 立即下载

相关实验场景

更多