一次Dockerd内存泄露分析

本文涉及的产品
无影云电脑企业版,4核8GB 120小时 1个月
资源编排,不限时长
无影云电脑个人版,1个月黄金款+200核时
简介: 董江,容器技术布道者及实践者,中国移动高级系统架构专家,曾担任华为云核心网技术专家,CloudNative社区核心成员,KubeServiceStack社区发起者,Prometheus社区PMC,Knative Committer,Grafana社区Contributer。欢迎关注:https://kubeservice.cn/

1. 现象

线上k8s集群报警,登陆查看dockerd内存使用10G+(Node集群大小是15G),Dockerd占用内存70%

2. 排查思路

2.1 docker 版本

查看docker infodocker version是否有特殊配置

docker info:

Client:
 Debug Mode: false

Server:
 Containers: 54
  Running: 26
  Paused: 0
  Stopped: 28
 Images: 60
 Server Version: 19.03.15
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8686ededfc90076914c5238eb96c883ea093a8ba
 runc version: v1.0.2-0-g52b36a2d
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.19.25-200.1.el7.bclinux.x86_64
 Operating System: BigCloud Enterprise Linux For LDK 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.16GiB
 Name: ecs-prod-003
 ID: SBD2:S775:KOMB:TU2S:YSN6:6OJ7:U4KS:DLCT:QVM4:5UCI:YUA4:NEUC
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 173
  Goroutines: 167
  System Time: 2023-05-08T17:01:41.750342975+08:00
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  0.0.0.0/0
  127.0.0.0/8
 Registry Mirrors:
  http://xx.xx.xx.xx:7999/
 Live Restore Enabled: true

 [root@ecs-xxxx-003 /]#  dockerd -v
Docker version 19.03.15, build 99e3ed8919

docker version

Client: Docker Engine - Community
 Version:           19.03.15
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        99e3ed8919
 Built:             Sat Jan 30 03:17:57 2021
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.15
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       99e3ed8919
  Built:            Sat Jan 30 03:16:33 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.5.7
  GitCommit:        8686ededfc90076914c5238eb96c883ea093a8ba
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2d
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Storage Driver: overlay2 也是用最新版本,没有问题的; containerd version也是依赖v1.5.7runc 也是比较高版本1.0.2 整个版本还算比较新

PS: debug模式,也是出现问题后,重启dockerd之前开启的。 不会有大影响;

2.2 daeman.json 配置

[root@ecs-xxx-003 /]#  cat /etc/docker/daemon.json 
{
   
   
 "debug": true, 
 "live-restore": true,
 "registry-mirrors": ["http://xxx.xxx.xxx.xxx:7999"]
}
[root@ecs-xxx-003 /]#  service docker status
Redirecting to /bin/systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2023-05-04 10:18:14 CST; 5 days ago
     Docs: https://docs.docker.com
 Main PID: 160002 (dockerd)
    Tasks: 41
   Memory: 2.1G
   CGroup: /system.slice/docker.service
           └─160002 /usr/bin/dockerd --insecure-registry=0.0.0.0/0 --data-root=/var/lib/docker --log-opt max-size=50m --log-opt max-file=5 -H fd:// --containerd=/run/containerd/containerd.sock

开启了live-restore模式。设置了--data-root数据存储路径和 --log-opt 日志滚动切分。 功能设置也正常

2.3 日志是否有问题

首先查询dockerd file日志,观察固定时间内:

....
May  2 00:10:24 ecs-prod-003 dockerd: time="2023-05-02T00:10:24.148458112+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:10:36 ecs-prod-003 dockerd: time="2023-05-02T00:10:36.020613181+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:10:46 ecs-prod-003 dockerd: time="2023-05-02T00:10:46.108339401+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:10:58 ecs-prod-003 dockerd: time="2023-05-02T00:10:58.034965957+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:11:08 ecs-prod-003 dockerd: time="2023-05-02T00:11:08.184787392+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:11:20 ecs-prod-003 dockerd: time="2023-05-02T00:11:20.072593490+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:11:30 ecs-prod-003 dockerd: time="2023-05-02T00:11:30.193889464+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:11:42 ecs-prod-003 dockerd: time="2023-05-02T00:11:42.031475401+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:11:52 ecs-prod-003 dockerd: time="2023-05-02T00:11:52.118355116+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:12:04 ecs-prod-003 dockerd: time="2023-05-02T00:12:04.051169727+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:12:14 ecs-prod-003 dockerd: time="2023-05-02T00:12:14.164452224+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:12:26 ecs-prod-003 dockerd: time="2023-05-02T00:12:26.043414628+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:12:36 ecs-prod-003 dockerd: time="2023-05-02T00:12:36.083507423+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:12:48 ecs-prod-003 dockerd: time="2023-05-02T00:12:48.285459273+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:12:58 ecs-prod-003 dockerd: time="2023-05-02T00:12:58.429207582+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"

时间刚好好从4月30日到5月2日,比较大量的failed to exit within 10 seconds of signal 15 - using the forcefailed to exit within 10 seconds of kill - trying direct SIGKILL

可能是问题所在。

3. 流程梳理:dockerd与containerd相关处理流

dockerd和底层containerd-shim 整个生命交互流程如上图。

4. 源码梳理

Kubelet的PLGE Pod生命周期管理,会给docker发送stop Event, 如果没有相应在发送kill Event. docker kill时的wait chan close导致的,wait的时候会启动另一个goroutine,每次docker kill都会造成这两个goroutine的泄露。

docker stop部分:

// containerStop sends a stop signal, waits, sends a kill signal.
func (daemon *Daemon) containerStop(container *containerpkg.Container, seconds int) error {
    if !container.IsRunning() {
        return nil
    }

    stopSignal := container.StopSignal()
    // 1. Send a stop signal
    if err := daemon.killPossiblyDeadProcess(container, stopSignal); err != nil {
        // While normally we might "return err" here we're not going to
        // because if we can't stop the container by this point then
        // it's probably because it's already stopped. Meaning, between
        // the time of the IsRunning() call above and now it stopped.
        // Also, since the err return will be environment specific we can't
        // look for any particular (common) error that would indicate
        // that the process is already dead vs something else going wrong.
        // So, instead we'll give it up to 2 more seconds to complete and if
        // by that time the container is still running, then the error
        // we got is probably valid and so we force kill it.
        ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer cancel()

        if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
            logrus.Infof("Container failed to stop after sending signal %d to the process, force killing", stopSignal)
            if err := daemon.killPossiblyDeadProcess(container, 9); err != nil {
                return err
            }
        }
    }

    // 2. Wait for the process to exit on its own
    ctx := context.Background()
    if seconds >= 0 {
        var cancel context.CancelFunc
        ctx, cancel = context.WithTimeout(ctx, time.Duration(seconds)*time.Second)
        defer cancel()
    }

    if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
        logrus.Infof("Container %v failed to exit within %d seconds of signal %d - using the force", container.ID, seconds, stopSignal)
        // 3. If it doesn't, then send SIGKILL
        if err := daemon.Kill(container); err != nil {
            // Wait without a timeout, ignore result.
            <-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning) //这一步会挂起
            logrus.Warn(err) // Don't return error because we only care that container is stopped, not what function stopped it
        }
    }

    daemon.LogContainerEvent(container, "stop")
    return nil
}

docker kill部分代码:

// Kill forcefully terminates a container.
func (daemon *Daemon) Kill(container *containerpkg.Container) error {
    if !container.IsRunning() {
        return errNotRunning(container.ID)
    }

    // 1. Send SIGKILL
    if err := daemon.killPossiblyDeadProcess(container, int(syscall.SIGKILL)); err != nil {
        // While normally we might "return err" here we're not going to
        // because if we can't stop the container by this point then
        // it's probably because it's already stopped. Meaning, between
        // the time of the IsRunning() call above and now it stopped.
        // Also, since the err return will be environment specific we can't
        // look for any particular (common) error that would indicate
        // that the process is already dead vs something else going wrong.
        // So, instead we'll give it up to 2 more seconds to complete and if
        // by that time the container is still running, then the error
        // we got is probably valid and so we return it to the caller.
        if isErrNoSuchProcess(err) {
            return nil
        }

        ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer cancel()

        if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
            return err
        }
    }

    // 2. Wait for the process to die, in last resort, try to kill the process directly
    if err := killProcessDirectly(container); err != nil {
        if isErrNoSuchProcess(err) {
            return nil
        }
        return err
    }

    // Wait for exit with no timeout.
    // Ignore returned status.
    <-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning) //这一步会挂起

    return nil
}

具体可以查看: v19.03.15中kill部分v19.03.15中stop部分

因为没收到containerd发来的task exit的信号,无法从container.Wait返回的chan中读到数据,从而导致每次docker stop调用阻塞两个goroutine。 导致goroutine 泄露。

到底是那个 podcontainer 导致问题.

验证了确实在不断删除容器,但是删不掉,是容器 D进程 或者 Z 进程导致

5. 本问题梳理

Kubelet 为了保证最终一致性,发现宿主上还有不应该存在的容器就会一直不断的去尝试删除,每次删除都会调用docker stopapi,与dockerd建立一个uds连接,dockerd删除容器的时候会启动一个goroutine通过rpc形式调用containerd来删除容器并等待最终删除完毕才返回,等待的过程中会另起一个goroutine来获取结果,然而containerd在调用runc去真正执行删除的时候因为容器内D进程 或者 Z进程,无法删除容器,导致没有发出task exit信号,dockerd的两个相关的goroutine也就不会退出。

整个过程不断重复,最终就导致fd内存goroutine一步步的泄露,系统逐渐走向不可用。

5.1 触发条件

由于Pod中容器容器内D进程 或者 Z进程,无法删除容器。

5.2 根本原因

由于Docker server版本有没有设置gorounte timeout,导致gorounte挂起, 从而导致fd、内存泄漏

社区目前以及修复:
https://github.com/moby/moby/pull/42956

5.3 引起的容器

本环境中由于fluentd pod异常导致
从ContainerID a9acd5210705 跟踪关联Kubelet日志信息

May  1 10:47:56 ecs-prod-003 kubelet: 2023-05-01 10:47:56.381 [INFO][88861] ipam.go 1172: Releasing all IPs with handle 'k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9'
May  1 10:47:56 ecs-prod-003 kubelet: 2023-05-01 10:47:56.429 [INFO][88861] ipam_plugin.go 314: Released address using handleID ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 10:47:56 ecs-prod-003 kubelet: 2023-05-01 10:47:56.429 [INFO][88861] ipam_plugin.go 323: Releasing address using workloadID ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 10:47:56 ecs-prod-003 kubelet: 2023-05-01 10:47:56.433 [INFO][88846] k8s.go 498: Teardown processing complete. ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.865 [INFO][286438] plugin.go 503: Extracted identifiers ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Node="ecs-prod-003" Orchestrator="k8s" WorkloadEndpoint="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] k8s.go 473: Endpoint deletion will be handled by Kubernetes deletion of the Pod. ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" endpoint=&v3.WorkloadEndpoint{
   
   TypeMeta:v1.TypeMeta{
   
   Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{
   
   Name:"ecs--prod--003-k8s-fluentd--hjj42-eth0", GenerateName:"fluentd-", Namespace:"kube-system", SelfLink:"", UID:"d6097c23-9b88-4708-93e0-226bb313e7f3", ResourceVersion:"99857676", Generation:0, CreationTimestamp:v1.Time{
   
   Time:time.Time{
   
   wall:0x0, ext:63802199288, loc:(*time.Location)(0x29ce720)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{
   
   "controller-revision-hash":"86b64f7748", "k8s-app":"fluentd-logging", "pod-template-generation":"1", "projectcalico.org/namespace":"kube-system", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"fluentd", "version":"v1"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{
   
   Orchestrator:"k8s", Workload:"", Node:"ecs-prod-003", ContainerID:"", Pod:"fluentd-hjj42", Endpoint:"eth0", IPNetworks:[]string{
   
   "172.20.83.196/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{
   
   "kns.kube-system", "ksa.kube-system.fluentd"}, InterfaceName:"cali843ab5e3ccd", MAC:"", Ports:[]v3.EndpointPort(nil)}}
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] k8s.go 485: Cleaning up netns ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] linux_dataplane.go 457: veth does not exist, no need to clean up. ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" ifName="eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] k8s.go 492: Releasing IP address(es) ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] utils.go 168: Calico CNI releasing IP address ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.895 [INFO][286452] ipam_plugin.go 302: Releasing address using handleID ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.895 [INFO][286452] ipam.go 1172: Releasing all IPs with handle 'k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9'
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.903 [WARNING][286452] ipam_plugin.go 312: Asked to release address but it doesn't exist. Ignoring ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.903 [INFO][286452] ipam_plugin.go 323: Releasing address using workloadID ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.907 [INFO][286438] k8s.go 498: Teardown processing complete. ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
[root@ecs-prod-003 /var/log]#  grep "kubelet" /var/log/messages-20230507  |grep "fef3e1fbee14"
May  1 11:03:56 ecs-prod-003 kubelet: E0501 11:03:37.635343    4568 remote_runtime.go:276] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469"
May  1 11:04:00 ecs-prod-003 kubelet: E0501 11:03:37.635447    4568 kuberuntime_container.go:666] "Container termination failed with gracePeriod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="kube-system/fluentd-hjj42" podUID=d6097c23-9b88-4708-93e0-226bb313e7f3 containerName="fluentd" containerID="docker://fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469" gracePeriod=5
May  1 11:04:01 ecs-prod-003 kubelet: E0501 11:03:37.635513    4568 kuberuntime_container.go:691] "Kill container failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="kube-system/fluentd-hjj42" podUID=d6097c23-9b88-4708-93e0-226bb313e7f3 containerName="fluentd" containerID={
   
   Type:docker ID:fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469}
May  1 11:13:56 ecs-prod-003 kubelet: E0501 11:13:56.984106    4568 remote_runtime.go:276] "StopContainer from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" containerID="fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469"
May  1 11:13:56 ecs-prod-003 kubelet: E0501 11:13:56.985104    4568 kuberuntime_container.go:666] "Container termination failed with gracePeriod" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" pod="kube-system/fluentd-hjj42" podUID=d6097c23-9b88-4708-93e0-226bb313e7f3 containerName="fluentd" containerID="docker://fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469" gracePeriod=5
May  1 11:13:56 ecs-prod-003 kubelet: E0501 11:13:56.985160    4568 kuberuntime_container.go:691] "Kill container failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" pod="kube-system/fluentd-hjj42" podUID=d6097c23-9b88-4708-93e0-226bb313e7f3 containerName="fluentd" containerID={
   
   Type:docker ID:fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469}
May  1 23:37:46 ecs-prod-003 kubelet: I0501 23:37:46.243371  285578 scope.go:111] "RemoveContainer" containerID="fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469"

6. 修复方式

此问题修复方式:https://github.com/moby/moby/pull/42956
落地版本在: v20.10.10

7. 最佳方式:

短期方式

如果用户系统要求,不能更新docker版本,可通过内部优雅重启方式

{
   
   
    "log-driver": "json-file",
    "log-opts": {
   
   
        "max-size": "50m",
        "max-file": "5"
    },
    "oom-score-adjust": -1000,
    "registry-mirrors": ["https://xxxxx"],
    "storage-driver": "overlay2",
    "storage-opts":["overlay2.override_kernel_check=true"],
    "live-restore": true
}

live-restore: trueoom-score-adjust: -1000 将被oomkill调整权限最高,如果发生以上问题,可优雅重启

长期方式

性能相关:

升级版本到 v20.10.14+

相关实践学习
通过Ingress进行灰度发布
本场景您将运行一个简单的应用,部署一个新的应用用于新的发布,并通过Ingress能力实现灰度发布。
容器应用与集群管理
欢迎来到《容器应用与集群管理》课程,本课程是“云原生容器Clouder认证“系列中的第二阶段。课程将向您介绍与容器集群相关的概念和技术,这些概念和技术可以帮助您了解阿里云容器服务ACK/ACK Serverless的使用。同时,本课程也会向您介绍可以采取的工具、方法和可操作步骤,以帮助您了解如何基于容器服务ACK Serverless构建和管理企业级应用。 学习完本课程后,您将能够: 掌握容器集群、容器编排的基本概念 掌握Kubernetes的基础概念及核心思想 掌握阿里云容器服务ACK/ACK Serverless概念及使用方法 基于容器服务ACK Serverless搭建和管理企业级网站应用
相关文章
|
13天前
|
Web App开发 监控 JavaScript
监控和分析 JavaScript 内存使用情况
【10月更文挑战第30天】通过使用上述的浏览器开发者工具、性能分析工具和内存泄漏检测工具,可以有效地监控和分析JavaScript内存使用情况,及时发现和解决内存泄漏、过度内存消耗等问题,从而提高JavaScript应用程序的性能和稳定性。在实际开发中,可以根据具体的需求和场景选择合适的工具和方法来进行内存监控和分析。
|
1月前
|
编译器 C语言
动态内存分配与管理详解(附加笔试题分析)(上)
动态内存分配与管理详解(附加笔试题分析)
49 1
|
2月前
|
程序员 编译器 C++
【C++核心】C++内存分区模型分析
这篇文章详细解释了C++程序执行时内存的四个区域:代码区、全局区、栈区和堆区,以及如何在这些区域中分配和释放内存。
51 2
|
7天前
|
开发框架 监控 .NET
【Azure App Service】部署在App Service上的.NET应用内存消耗不能超过2GB的情况分析
x64 dotnet runtime is not installed on the app service by default. Since we had the app service running in x64, it was proxying the request to a 32 bit dotnet process which was throwing an OutOfMemoryException with requests >100MB. It worked on the IaaS servers because we had the x64 runtime install
|
17天前
|
Web App开发 JavaScript 前端开发
使用 Chrome 浏览器的内存分析工具来检测 JavaScript 中的内存泄漏
【10月更文挑战第25天】利用 Chrome 浏览器的内存分析工具,可以较为准确地检测 JavaScript 中的内存泄漏问题,并帮助我们找出潜在的泄漏点,以便采取相应的解决措施。
127 9
|
22天前
|
并行计算 算法 IDE
【灵码助力Cuda算法分析】分析共享内存的矩阵乘法优化
本文介绍了如何利用通义灵码在Visual Studio 2022中对基于CUDA的共享内存矩阵乘法优化代码进行深入分析。文章从整体程序结构入手,逐步深入到线程调度、矩阵分块、循环展开等关键细节,最后通过带入具体值的方式进一步解析复杂循环逻辑,展示了通义灵码在辅助理解和优化CUDA编程中的强大功能。
|
1月前
|
程序员 编译器 C语言
动态内存分配与管理详解(附加笔试题分析)(下)
动态内存分配与管理详解(附加笔试题分析)(下)
46 2
|
2月前
|
算法 程序员 Python
程序员必看!Python复杂度分析全攻略,让你的算法设计既快又省内存!
在编程领域,Python以简洁的语法和强大的库支持成为众多程序员的首选语言。然而,性能优化仍是挑战。本文将带你深入了解Python算法的复杂度分析,从时间与空间复杂度入手,分享四大最佳实践:选择合适算法、优化实现、利用Python特性减少空间消耗及定期评估调整,助你写出高效且节省内存的代码,轻松应对各种编程挑战。
41 1
|
2月前
|
存储 Prometheus NoSQL
Redis 内存突增时,如何定量分析其内存使用情况
【9月更文挑战第21天】当Redis内存突增时,可采用多种方法分析内存使用情况:1)使用`INFO memory`命令查看详细内存信息;2)借助`redis-cli --bigkeys`和RMA工具定位大键;3)利用Prometheus和Grafana监控内存变化;4)优化数据类型和存储结构;5)检查并调整内存碎片率。通过这些方法,可有效定位并解决内存问题,保障Redis稳定运行。
|
1月前
|
SQL 安全 算法
ChatGPT高效提问—prompt实践(漏洞风险分析-重构建议-识别内存泄漏)
ChatGPT高效提问—prompt实践(漏洞风险分析-重构建议-识别内存泄漏)