开发者社区> 问答> 正文

scaling issue Running Flink on Kubernetes

Hi Experts,  I have my flink application running on Kubernetes, initially with 1 Job  Manager, and 2 Task Managers. 

Then we have the custom operator that watches for the CRD, when the CRD  replicas changed, it will patch the Flink Job Manager deployment  parallelism and max parallelism according to the replicas from CRD  (parallelism can be configured via env variables for our application).  which causes the job manager restart. hence a new Flink job. But the  consumer group does not change, so it will continue from the offset  where it left. 

In addition, operator will also update Task Manager's deployment replicas,  and will adjust the pod number. 

In case of scale up, the existing task manager pods do not get killed, but  new task manager pods will be created. 

And we observed a skew in the partition offset consumed. e.g. some  partitions have huge lags and other partitions have small lags. (observed  from burrow) 

This is also validated by the metrics from Flink UI, showing the throughput  differs for slotss 

Any clue why this is the case?*来自志愿者整理的flink邮件归档

展开
收起
玛丽莲梦嘉 2021-12-02 16:18:56 484 0
1 条回答
写回答
取消 提交回答
  • I have a few more questions regarding your issue. 

    • Which Flink version are you using? 
    • Is this skew observed only after a scaling-up? What happens if the  parallelism is initially set to the scaled-up value? 
    • Keeping the job running a while after the scale-up, does the skew ease? 

    I suspect the performance difference might be an outcome of some warming up  issues. E.g., the existing TMs might have some file already localized, or  some memory buffers already promoted to the JVM tenured area, while the new  TMs have not.*来自志愿者整理的FLINK邮件归档

    2021-12-02 17:18:11
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
ACK 云原生弹性方案—云原生时代的加速器 立即下载
ACK集群类型选择最佳实践 立即下载
企业运维之云原生和Kubernetes 实战 立即下载

相关镜像