问题一:Flink 1.12.0 隔几个小时Checkpoint就会失败
Hi 大家好 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误
2021-03-18 08:52:37,019 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in 4699 ms).
2021-03-18 08:52:37,637 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5.
2021-03-18 08:52:42,956 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes in 4939 ms).
2021-03-18 08:52:43,528 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5.
2021-03-18 09:12:43,528 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before completing.
2021-03-18 09:12:43,615 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.
at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_231]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231]
at java.util.concurrent.ScheduledThreadPoolExecutorScheduledFutureTask.accessScheduledFutureTask.accessScheduledFutureTask.access201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_231]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_231]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231]
2021-03-18 09:12:43,618 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from state RUNNING to RESTARTING.
2021-03-18 09:12:43,619 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to CANCELING.
2021-03-18 09:12:43,622 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to CANCELING.
2021-03-18 09:12:43,622 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to CANCELING.
然后就自己恢复了。用的是Unaligned Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。
谢谢大家!
参考回答:
你好,问题定位到了吗? 我也遇到了相同的问题,感觉和checkpoint interval有关 我有两个相同的作业(checkpoint interval 设置的是3分钟),一个运行在flink1.9,一个运行在flink1.12,1.9的作业稳定运行,1.12的运行5小时就会checkpoint 制作失败,抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. 当我把checkpoint interval调大到10分钟后,1.12的作业也可以稳定运行,所以我怀疑和制作间隔有关。 看到过一个issuse,了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关,以及如何定位影响。
关于本问题的更多回答可点击原文查看:https://developer.aliyun.com/ask/359048?spm=a2c6h.13262185.0.0.677f6c07q66JNp
问题二:flink on yarn session模式与yarn通信失败是什么问题? (job模式可以成功)
大佬们请教一下: 之前一直使用job模式来提交任务,可以顺利提交计算任务。最近有需求比较适合session模式来提交,按照论坛里的教程进行提交的时候,一直报错连接不上resource manage。观察启动log发现两种任务连接的resource manage不同,一个是正确的端口,一个一直请求本机端口。
session 模式启动log:
job 模式启动log:
想请教一下: 1.如何配置session 模式下的 resource manage 端口?
2.job 模式下假如我有一个8核taskmanage服务器A配置了16个slot。job 模式提交了一个并行度为5的任务分配到了服务器A,再通过job模式提交任务的话,就不能有任务分配到服务器A了。我有办法将剩下空闲的slot利用起来吗?或者是设计上就是对小规模的任务,减少信息传递来提升计算完成的时间?
来自志愿者整理的flink邮件归档
参考回答:
第一个问题可以尝试在flink.conf 中配上jobmanager.rpc.address 和jobmanager.rpc.port
关于本问题的更多回答可点击原文查看:https://developer.aliyun.com/ask/359050?spm=a2c6h.13262185.0.0.677f6c07q66JNp
问题三:flink 使用yarn部署报错怎么办?
报错:Maximum Memory: 8192 Requested: 10240MB. Please check the 'yarn.scheduler.maximum-allocation-mb' and the 'yarn.nodemanager.resource.memory-mb' configuration values org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster at org.apache.flink.yarn.YarnClusterDescriptor.deploySessionCluster(YarnClusterDescriptor.java:425) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:606) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambdamainmainmain4(FlinkYarnSessionCli.java:860) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) at
来自志愿者整理的flink邮件归档 org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:860) Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The cluster does not have the requested resources for the TaskManagers available! Maximum Memory: 8192 Requested: 10240MB. Please check the 'yarn.scheduler.maximum-allocation-mb' and the 'yarn.nodemanager.resource.memory-mb' configuration values 来自志愿者整理的flink邮件归档
参考回答:
报错信息里已经说明了:你的 Yarn 集群配置允许的最大 container 是 2g,而你的 flink 配置的 TM 大小是 10g。
关于本问题的更多回答可点击原文查看:https://developer.aliyun.com/ask/359052?spm=a2c6h.13262185.0.0.677f6c07q66JNp
问题四:Flink-1.12 OnYarn模式下HA咨询问题?
Hi,各位社区的大佬,想了解下 flink 1.12 在yarn per job模式下HA的架构实现,是否有相关文档或图片描述呢?方便放出来,一起学习下!*来自志愿者整理的flink
参考回答:
ZooKeeper HA的实现是独立于部署模式的,不仅Yarn可以用,Standalone、K8s也可以用
具体的文档可以参考社区[1]
如果想了解设计细节,可以看一下K8s HA的实现,原理上大同小异[2]
[1].
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/ha/overview/
[2].
https://cwiki.apache.org/confluence/display/FLINK/FLIP-144%3A+Native+Kubernetes+HA+for+Flink*来自志愿者整理的flink邮件归档
关于本问题的更多回答可点击原文查看:https://developer.aliyun.com/ask/359149?spm=a2c6h.13262185.0.0.51804c79pMhZZN
问题五:mysql cdc配置问题应该怎么办?
各位好: 我通过mysql cdc链接mysql(mariadb)一直不成功。网上也搜不到有效的解决方法。
用pyflink调试就一直运行也没数据也不报错。用flink sql就报java.net.ConnectException: Connection refused。
用slave可以远程连接mysql,也执行了 grant replication client, replication slave on . to 'slave'@'%' identified by 'slave';
sql和mysql的配置信息在下方,期盼大佬解答
flink sql如下:
CREATE TEMPORARY TABLE xin_test (
> id INT,
> name STRING
> ) WITH (
> 'connector' = 'mysql-cdc',
> 'hostname' = '..*.203',
> 'port' = '3306',
> 'username' = 'slave',
> 'password' = 'slave',
> 'database-name' = 'test',
> 'table-name' = 'test',
> 'server-id' = '10001',
> 'server-time-zone' = 'Asia/Shanghai'
> )
> ;
Flink SQL> select * from xin_test;
[ERROR] Could not execute SQL statement. Reason:
java.net.ConnectException: Connection refused
mysql的配置信息:
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
symbolic-links=0
log-bin=mysql-bin
server-id=10001
replicate-do-db=test
[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid*来自志愿者整理的flink邮件归档
参考回答:
第一:一般这里需要reload 和lock table权限,这个权限包括你的replication slave 等权限用grant是授予不了的,测试建议先all。 第二:你可以登陆安装目录flink1.x/log,里面有详细的错误日志,不要只在client 里面哭想。我们连接的是mysql-cdc
关于本问题的更多回答可点击原文查看:https://developer.aliyun.com/ask/359303?spm=a2c6h.13262185.0.0.51804c79pMhZZN