深入分析Kubernetes Critical Pod（三）-阿里云开发者社区

深入分析Kubernetes Critical Pod（三）

2018-07-01 3358

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

容器服务 Serverless 版 ACK Serverless，317元额度多规格

容器服务 Serverless 版 ACK Serverless，952元额度多规格

容器镜像服务 ACR，镜像仓库100个不限时长

简介： 本文介绍了Kubelet在Predicate Admit准入检查时对CriticalPod的资源抢占的原理，以及Priority Admission Controller对CriticalPod的PriorityClassName特殊处理。

本文介绍了Kubelet在Predicate Admit准入检查时对CriticalPod的资源抢占的原理，以及Priority Admission Controller对CriticalPod的PriorityClassName特殊处理。

深入分析Kubernetes Critical Pod系列：
深入分析Kubernetes Critical Pod（一）
深入分析Kubernetes Critical Pod（二）
深入分析Kubernetes Critical Pod（三）
深入分析Kubernetes Critical Pod（四）

Kubelet Predicate Admit时对Critical的资源抢占处理

kubelet 在Predicate Admit流程中，会对Pods进行各种Predicate准入检查，包括GeneralPredicates检查本节点是否有足够的cpu,mem,gpu资源。如果GeneralPredicates准入检测失败，对于nonCriticalPod则直接Admit失败，但如果是CriticalPod则会触发kubelet preemption进行资源抢占，按照一定规则杀死一些Pods释放资源，抢占成功，则Admit成功。

流程的源头应该从kubelet初始化的流程开始。

pkg/kubelet/kubelet.go:315

// NewMainKubelet instantiates a new Kubelet object along with all the required internal modules.
// No initialization of Kubelet and its modules should happen here.
func NewMainKubelet(...) (*Kubelet, error) {
    ...
   criticalPodAdmissionHandler := preemption.NewCriticalPodAdmissionHandler(klet.GetActivePods, killPodNow(klet.podWorkers, kubeDeps.Recorder), kubeDeps.Recorder)
    klet.admitHandlers.AddPodAdmitHandler(lifecycle.NewPredicateAdmitHandler(klet.getNodeAnyWay, criticalPodAdmissionHandler, klet.containerManager.UpdatePluginResources))
    // apply functional Option's
    for _, opt := range kubeDeps.Options {
        opt(klet)
    }

    ...
    return klet, nil
}

在NewMainKubelet对kubelet进行初始化时，通过AddPodAdmitHandler注册了criticalPodAdmissionHandler，CriticalPod的Admit的特殊之处就体现在criticalPodAdmissionHandler。

然后，我们进入kubelet的predicateAdmitHandler流程中，看看GeneralPredicates失败后的处理逻辑。

pkg/kubelet/lifecycle/predicate.go:58

func (w *predicateAdmitHandler) Admit(attrs *PodAdmitAttributes) PodAdmitResult {
    ...

    fit, reasons, err := predicates.GeneralPredicates(podWithoutMissingExtendedResources, nil, nodeInfo)
    if err != nil {
        message := fmt.Sprintf("GeneralPredicates failed due to %v, which is unexpected.", err)
        glog.Warningf("Failed to admit pod %v - %s", format.Pod(pod), message)
        return PodAdmitResult{
            Admit:   fit,
            Reason:  "UnexpectedAdmissionError",
            Message: message,
        }
    }
    if !fit {
        fit, reasons, err = w.admissionFailureHandler.HandleAdmissionFailure(pod, reasons)
        if err != nil {
            message := fmt.Sprintf("Unexpected error while attempting to recover from admission failure: %v", err)
            glog.Warningf("Failed to admit pod %v - %s", format.Pod(pod), message)
            return PodAdmitResult{
                Admit:   fit,
                Reason:  "UnexpectedAdmissionError",
                Message: message,
            }
        }
    }
    ...
    return PodAdmitResult{
        Admit: true,
    }
}

在kubelet predicateAdmitHandler中对Pod进行GeneralPredicates检查cpu,mem,gpu资源时，如果发现资源不足导致Admit失败，则接着调用HandleAdmissionFailure进行额外处理。前提提到，kubelet初始化时注册了criticalPodAdmissionHandler为HandleAdmissionFailure。

CriticalPodAdmissionHandler struct定义如下：

pkg/kubelet/preemption/preemption.go:41

type CriticalPodAdmissionHandler struct {
    getPodsFunc eviction.ActivePodsFunc
    killPodFunc eviction.KillPodFunc
    recorder    record.EventRecorder
}

CriticalPodAdmissionHandler的HandleAdmissionFailure方法就是处理CriticalPod特殊的逻辑所在。

pkg/kubelet/preemption/preemption.go:66

// HandleAdmissionFailure gracefully handles admission rejection, and, in some cases,
// to allow admission of the pod despite its previous failure.
func (c *CriticalPodAdmissionHandler) HandleAdmissionFailure(pod *v1.Pod, failureReasons []algorithm.PredicateFailureReason) (bool, []algorithm.PredicateFailureReason, error) {
    if !kubetypes.IsCriticalPod(pod) || !utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) {
        return false, failureReasons, nil
    }
    // InsufficientResourceError is not a reason to reject a critical pod.
    // Instead of rejecting, we free up resources to admit it, if no other reasons for rejection exist.
    nonResourceReasons := []algorithm.PredicateFailureReason{}
    resourceReasons := []*admissionRequirement{}
    for _, reason := range failureReasons {
        if r, ok := reason.(*predicates.InsufficientResourceError); ok {
            resourceReasons = append(resourceReasons, &admissionRequirement{
                resourceName: r.ResourceName,
                quantity:     r.GetInsufficientAmount(),
            })
        } else {
            nonResourceReasons = append(nonResourceReasons, reason)
        }
    }
    if len(nonResourceReasons) > 0 {
        // Return only reasons that are not resource related, since critical pods cannot fail admission for resource reasons.
        return false, nonResourceReasons, nil
    }
    err := c.evictPodsToFreeRequests(admissionRequirementList(resourceReasons))
    // if no error is returned, preemption succeeded and the pod is safe to admit.
    return err == nil, nil, err
}

如果Pod不是CriticalPod，或者ExperimentalCriticalPodAnnotation Feature Gate是关闭的，则直接返回false，表示Admit失败。
判断Admit的failureReasons是否包含predicate.InsufficientResourceError，如果包含，则调用evictPodsToFreeRequests触发kubelet preemption。注意这里的抢占不同于scheduler preemtion，不要混淆了。

evictPodsToFreeRequests就是kubelet preemption进行资源抢占的逻辑实现，其核心就是调用getPodsToPreempt挑选合适的待杀死的Pods(podsToPreempt)。

pkg/kubelet/preemption/preemption.go:121

// getPodsToPreempt returns a list of pods that could be preempted to free requests >= requirements
func getPodsToPreempt(pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) {
    bestEffortPods, burstablePods, guaranteedPods := sortPodsByQOS(pods)

    // make sure that pods exist to reclaim the requirements
    unableToMeetRequirements := requirements.subtract(append(append(bestEffortPods, burstablePods...), guaranteedPods...)...)
    if len(unableToMeetRequirements) > 0 {
        return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", unableToMeetRequirements.toString())
    }
    // find the guaranteed pods we would need to evict if we already evicted ALL burstable and besteffort pods.
    guarateedToEvict, err := getPodsToPreemptByDistance(guaranteedPods, requirements.subtract(append(bestEffortPods, burstablePods...)...))
    if err != nil {
        return nil, err
    }
    // Find the burstable pods we would need to evict if we already evicted ALL besteffort pods, and the required guaranteed pods.
    burstableToEvict, err := getPodsToPreemptByDistance(burstablePods, requirements.subtract(append(bestEffortPods, guarateedToEvict...)...))
    if err != nil {
        return nil, err
    }
    // Find the besteffort pods we would need to evict if we already evicted the required guaranteed and burstable pods.
    bestEffortToEvict, err := getPodsToPreemptByDistance(bestEffortPods, requirements.subtract(append(burstableToEvict, guarateedToEvict...)...))
    if err != nil {
        return nil, err
    }
    return append(append(bestEffortToEvict, burstableToEvict...), guarateedToEvict...), nil
}

kubelet preemtion时候挑选待杀死Pods的逻辑如下：

如果该Pod的某个Resource request quantity超过了现在的所有的bestEffortPods, burstablePods, guaranteedPods的该Resource request quantity，则podsToPreempt为nil，意味着无合适Pods以释放。
如果释放所有bestEffortPods, burstablePods的资源都不足够，则再挑选guaranteedPods（guarateedToEvict）。挑选的规则是：
- 规则一：越少的Pods被释放越好；
- 规则二：释放的资源越少越好；
- 规则一的优先级比规则二高；
如果释放所有bestEffortPods及guarateedToEvict的资源都不足够，则再挑选burstablePods(burstableToEvict)。挑选的规则同上。
如果释放所有burstableToEvict及guarateedToEvict的资源都不足够，则再挑选bestEffortPods(bestEffortToEvict)。挑选的规则同上。

也就是说：Pod Resource QoS优先级越低的越先被抢占，同一个QoS Level内挑选Pods按照如下规则：

规则一：越少的Pods被释放越好；
规则二：释放的资源越少越好；
规则一的优先级比规则二高；

Priority Admission Controller对CriticalPod的特殊处理

我们先看看几类特殊的、系统预留的CriticalPod：

ClusterCriticalPod: PriorityClass Name是system-cluster-critical的Pod。
NodeCriticalPod:PriorityClass Name是system-node-critical的Pod。

如果AdmissionController中启动了Priority Admission Controller，那么在创建Pod时对Priority的检查也存在CriticalPod的特殊处理。

Priority Admission Controller主要作用是根据Pod中指定的PriorityClassName替换成对应的Spec.Pritory数值。

plugin/pkg/admission/priority/admission.go:138

// admitPod makes sure a new pod does not set spec.Priority field. It also makes sure that the PriorityClassName exists if it is provided and resolves the pod priority from the PriorityClassName.
func (p *priorityPlugin) admitPod(a admission.Attributes) error {
    operation := a.GetOperation()
    pod, ok := a.GetObject().(*api.Pod)
    if !ok {
        return errors.NewBadRequest("resource was marked with kind Pod but was unable to be converted")
    }

    // Make sure that the client has not set `priority` at the time of pod creation.
    if operation == admission.Create && pod.Spec.Priority != nil {
        return admission.NewForbidden(a, fmt.Errorf("the integer value of priority must not be provided in pod spec. Priority admission controller populates the value from the given PriorityClass name"))
    }
    if utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) {
        var priority int32
        // TODO: @ravig - This is for backwards compatibility to ensure that critical pods with annotations just work fine.
        // Remove when no longer needed.
        if len(pod.Spec.PriorityClassName) == 0 &&
            utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
            kubelettypes.IsCritical(a.GetNamespace(), pod.Annotations) {
            pod.Spec.PriorityClassName = scheduling.SystemClusterCritical
        }
        if len(pod.Spec.PriorityClassName) == 0 {
            var err error
            priority, err = p.getDefaultPriority()
            if err != nil {
                return fmt.Errorf("failed to get default priority class: %v", err)
            }
        } else {
            // Try resolving the priority class name.
            pc, err := p.lister.Get(pod.Spec.PriorityClassName)
            if err != nil {
                if errors.IsNotFound(err) {
                    return admission.NewForbidden(a, fmt.Errorf("no PriorityClass with name %v was found", pod.Spec.PriorityClassName))
                }

                return fmt.Errorf("failed to get PriorityClass with name %s: %v", pod.Spec.PriorityClassName, err)
            }

            priority = pc.Value
        }
        pod.Spec.Priority = &priority
    }
    return nil
}

同时满足以下所有条件时，给Pod的Spec.PriorityClassName赋值为system-cluster-critical,即认为是ClusterCriticalPod。

如果Enable了ExperimentalCriticalPodAnnotation和PodPriority Feature Gate；
该Pod没有指定PriorityClassName；
该Pod属于kube-system namespace；
该Pod打了scheduler.alpha.kubernetes.io/critical-pod="" Annotation；

总结

本文介绍了Kubelet在Predicate Admit准入检查时对CriticalPod的资源抢占的原理，以及Priority Admission Controller对CriticalPod的PriorityClassName特殊处理。下一篇是最后一处关于Kubernetes对CriticalPod进行特殊待遇的地方——DaemonSet Controller。

深入分析Kubernetes Critical Pod（三）

Kubelet Predicate Admit时对Critical的资源抢占处理

Priority Admission Controller对CriticalPod的特殊处理

总结

阿里云容器服务 ACK

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

深入分析Kubernetes Critical Pod（三）

Kubelet Predicate Admit时对Critical的资源抢占处理

Priority Admission Controller对CriticalPod的特殊处理

总结

阿里云容器服务 ACK

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像