大家在Kubernetes集群中部署核心组件时,经常会用到Critical Pod,那么你知道Critical Pod到底有何特别吗?要完整的了解这一点,其实并不是那么简单,它关系到调度、Kubelet Eviction Manager、DaemonSet Controller、Kubelet Preemption等,我将分4个系列为大家剖析。这一篇先介绍Critical Pod在Predicate in Schedule阶段的行为,以及用户期望的行为等。
官方宣布Rescheduler is deprecated as of Kubernetes 1.10 and will be removed in version 1.12,所以本文将不讨论Rescheduler对Critical Pod的处理逻辑。
有什么方法标识一个Pod为Critical Pod
规则1:
- Enable Feature Gate
ExperimentalCriticaPodAnnotation
- 必须隶属于
kube-system
namespace; - 必须加上Annotation
scheduler.alpha.kubernetes.io/critical-pod=""
规则2:
- Enable Feature Gate
ExperimentalCriticaPodAnnotation, PodPriority
-
Pod的Priority不为空,且不小于
2 * 10^9
;> system-node-critical priority = 10^9 + 1000; > system-cluster-critical priority = 10^9;
满足规则1或规则2之一,就认为该Pod为Critical Pod;
Schedule Critical Pod
在default scheduler进行pod调度的predicate阶段,会注册GeneralPredicates
为default predicates之一,并没有判断critical Pod使用EssentialPredicates
来对critical Pod进行predicate process。这意味着什么呢?
我们看看GeneralPredicates和EssentialPredicates的关系就知道了。GeneralPredicates中,先调用noncriticalPredicates,再调用EssentialPredicates。因此如果你给Deployment/StatefulSet等(DeamonSet除外)标识为Critical,那么在scheduler调度时,仍然走GeneralPredicates的流程,会调用noncriticalPredicates,而你却希望它直接走EssentialPredicates。
// GeneralPredicates checks whether noncriticalPredicates and EssentialPredicates pass. noncriticalPredicates are the predicates
// that only non-critical pods need and EssentialPredicates are the predicates that all pods, including critical pods, need
func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
var predicateFails []algorithm.PredicateFailureReason
fit, reasons, err := noncriticalPredicates(pod, meta, nodeInfo)
if err != nil {
return false, predicateFails, err
}
if !fit {
predicateFails = append(predicateFails, reasons...)
}
fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo)
if err != nil {
return false, predicateFails, err
}
if !fit {
predicateFails = append(predicateFails, reasons...)
}
return len(predicateFails) == 0, predicateFails, nil
}
noncriticalPredicates原意是想对non-critical pod做的额外predicate逻辑,这个逻辑就是PodFitsResources检查。
pkg/scheduler/algorithm/predicates/predicates.go:1076
// noncriticalPredicates are the predicates that only non-critical pods need
func noncriticalPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
var predicateFails []algorithm.PredicateFailureReason
fit, reasons, err := PodFitsResources(pod, meta, nodeInfo)
if err != nil {
return false, predicateFails, err
}
if !fit {
predicateFails = append(predicateFails, reasons...)
}
return len(predicateFails) == 0, predicateFails, nil
}
PodFitsResources就做以下检查资源是否满足要求:
- Allowed Pod Number;
- CPU;
- Memory;
- EphemeralStorage;
- Extended Resources;
也就是说,如果你给Deployment/StatefulSet等(DeamonSet除外)标识为Critical,那么对应的Pod调度时仍然会检查Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources
是否足够,如果不满足则会触发预选失败,并且在Preempt阶段也只是根据对应的PriorityClass进行正常的抢占逻辑,并没有针对Critical Pod进行特殊处理,因此最终可能会因为找不到满足资源要求的Node,导致该Critical Pod调度失败,一直处于Pending状态。
而用户设置Critical Pod是不想因为资源不足导致调度失败的。那如果我就是想使用Deployment/StatefulSet等(DeamonSet除外)标识为Critical Pod来部署关键服务呢?有以下两个办法:
- 按照前面提到的规则2,给Pod设置
system-cluster-critical
或system-node-critical
Priority Class,这样就会在scheduler正常的Preempt流程中抢占到资源完成调度。 - 按照前面提到的规则1,并且修改
GeneralPredicates
的代码如下,检测是否为Critical Pod,如果是,则不执行noncriticalPredicates逻辑,也就是说predicate阶段不对Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources
资源进行检查。
func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
var predicateFails, resons []algorithm.PredicateFailureReason
var fit bool
var err error
// **Modify**: check whether the pod is a Critical Pod, don't invoke noncriticalPredicates if false.
isCriticalPod := utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
kubelettypes.IsCriticalPod(newPod)
if !isCriticalPod {
fit, reasons, err = noncriticalPredicates(pod, meta, nodeInfo)
if err != nil {
return false, predicateFails, err
}
}
if !fit {
predicateFails = append(predicateFails, reasons...)
}
fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo)
if err != nil {
return false, predicateFails, err
}
if !fit {
predicateFails = append(predicateFails, reasons...)
}
return len(predicateFails) == 0, predicateFails, nil
}
方法1,其实Kubernetes在Admission Priority检查时已经帮你做了。
// admitPod makes sure a new pod does not set spec.Priority field. It also makes sure that the PriorityClassName exists if it is provided and resolves the pod priority from the PriorityClassName.
func (p *priorityPlugin) admitPod(a admission.Attributes) error {
...
if utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) {
var priority int32
if len(pod.Spec.PriorityClassName) == 0 &&
utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
kubelettypes.IsCritical(a.GetNamespace(), pod.Annotations) {
pod.Spec.PriorityClassName = scheduling.SystemClusterCritical
}
...
}
在Admission时候会对Pod的Priority进行检查,如果发现您已经:
- Enable PriorityClass Feature Gate;
- Enable ExperimentalCriticalPodAnnotation Feature Gate;
- 给Pod添加了ExperimentalCriticalPodAnnotation;
- 部署在kube-system namespace;
- 没有手动设置自定义PriorityClass;
那么,Admisson Priority阶段会自动给Pod添加SystemClusterCritical(system-cluster-critical) PriorityClass;
最佳实践
通过上面的分析,给出如下最佳实践:在Kubernetes集群中,通过非DeamonSet方式(比如Deployment、RS等)部署关键服务时,为了在集群资源不足时仍能保证抢占调度成功,请确保如下事宜:
- Enable PriorityClass Feature Gate;
- Enable ExperimentalCriticalPodAnnotation Feature Gate;
- 给Pod添加了ExperimentalCriticalPodAnnotation;
- 部署在kube-system namespace;
- 千万不要手动设置自定义PriorityClass;
总结
本文介绍了标识一个关键服务为Critical服务的两种方法,并介绍了Critical Pod(DaemonSet部署方式除外)在Predicate in Schedule阶段的行为,给出了最佳实践。