一次Dockerd内存泄露分析

本文涉及的产品
轻量应用服务器 2vCPU 4GiB,适用于网站搭建
无影云电脑个人版,1个月黄金款+200核时
无影云电脑企业版,8核16GB 120小时 1个月
简介: 董江,容器技术布道者及实践者,中国移动高级系统架构专家,曾担任华为云核心网技术专家,CloudNative社区核心成员,KubeServiceStack社区发起者,Prometheus社区PMC,Knative Committer,Grafana社区Contributer。欢迎关注:https://kubeservice.cn/

1. 现象

线上k8s集群报警,登陆查看dockerd内存使用10G+(Node集群大小是15G),Dockerd占用内存70%

2. 排查思路

2.1 docker 版本

查看docker infodocker version是否有特殊配置

docker info:

Client:
 Debug Mode: false

Server:
 Containers: 54
  Running: 26
  Paused: 0
  Stopped: 28
 Images: 60
 Server Version: 19.03.15
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8686ededfc90076914c5238eb96c883ea093a8ba
 runc version: v1.0.2-0-g52b36a2d
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.19.25-200.1.el7.bclinux.x86_64
 Operating System: BigCloud Enterprise Linux For LDK 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.16GiB
 Name: ecs-prod-003
 ID: SBD2:S775:KOMB:TU2S:YSN6:6OJ7:U4KS:DLCT:QVM4:5UCI:YUA4:NEUC
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 173
  Goroutines: 167
  System Time: 2023-05-08T17:01:41.750342975+08:00
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  0.0.0.0/0
  127.0.0.0/8
 Registry Mirrors:
  http://xx.xx.xx.xx:7999/
 Live Restore Enabled: true

 [root@ecs-xxxx-003 /]#  dockerd -v
Docker version 19.03.15, build 99e3ed8919

docker version

Client: Docker Engine - Community
 Version:           19.03.15
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        99e3ed8919
 Built:             Sat Jan 30 03:17:57 2021
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.15
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       99e3ed8919
  Built:            Sat Jan 30 03:16:33 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.5.7
  GitCommit:        8686ededfc90076914c5238eb96c883ea093a8ba
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2d
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Storage Driver: overlay2 也是用最新版本,没有问题的; containerd version也是依赖v1.5.7runc 也是比较高版本1.0.2 整个版本还算比较新

PS: debug模式,也是出现问题后,重启dockerd之前开启的。 不会有大影响;

2.2 daeman.json 配置

[root@ecs-xxx-003 /]#  cat /etc/docker/daemon.json 
{
   
   
 "debug": true, 
 "live-restore": true,
 "registry-mirrors": ["http://xxx.xxx.xxx.xxx:7999"]
}
[root@ecs-xxx-003 /]#  service docker status
Redirecting to /bin/systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2023-05-04 10:18:14 CST; 5 days ago
     Docs: https://docs.docker.com
 Main PID: 160002 (dockerd)
    Tasks: 41
   Memory: 2.1G
   CGroup: /system.slice/docker.service
           └─160002 /usr/bin/dockerd --insecure-registry=0.0.0.0/0 --data-root=/var/lib/docker --log-opt max-size=50m --log-opt max-file=5 -H fd:// --containerd=/run/containerd/containerd.sock

开启了live-restore模式。设置了--data-root数据存储路径和 --log-opt 日志滚动切分。 功能设置也正常

2.3 日志是否有问题

首先查询dockerd file日志,观察固定时间内:

....
May  2 00:10:24 ecs-prod-003 dockerd: time="2023-05-02T00:10:24.148458112+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:10:36 ecs-prod-003 dockerd: time="2023-05-02T00:10:36.020613181+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:10:46 ecs-prod-003 dockerd: time="2023-05-02T00:10:46.108339401+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:10:58 ecs-prod-003 dockerd: time="2023-05-02T00:10:58.034965957+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:11:08 ecs-prod-003 dockerd: time="2023-05-02T00:11:08.184787392+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:11:20 ecs-prod-003 dockerd: time="2023-05-02T00:11:20.072593490+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:11:30 ecs-prod-003 dockerd: time="2023-05-02T00:11:30.193889464+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:11:42 ecs-prod-003 dockerd: time="2023-05-02T00:11:42.031475401+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:11:52 ecs-prod-003 dockerd: time="2023-05-02T00:11:52.118355116+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:12:04 ecs-prod-003 dockerd: time="2023-05-02T00:12:04.051169727+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:12:14 ecs-prod-003 dockerd: time="2023-05-02T00:12:14.164452224+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:12:26 ecs-prod-003 dockerd: time="2023-05-02T00:12:26.043414628+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:12:36 ecs-prod-003 dockerd: time="2023-05-02T00:12:36.083507423+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"
May  2 00:12:48 ecs-prod-003 dockerd: time="2023-05-02T00:12:48.285459273+08:00" level=info msg="Container a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9 failed to exit within 10 seconds of signal 15 - using the force"
May  2 00:12:58 ecs-prod-003 dockerd: time="2023-05-02T00:12:58.429207582+08:00" level=info msg="Container a9acd5210705 failed to exit within 10 seconds of kill - trying direct SIGKILL"

时间刚好好从4月30日到5月2日,比较大量的failed to exit within 10 seconds of signal 15 - using the forcefailed to exit within 10 seconds of kill - trying direct SIGKILL

可能是问题所在。

3. 流程梳理:dockerd与containerd相关处理流

dockerd和底层containerd-shim 整个生命交互流程如上图。

4. 源码梳理

Kubelet的PLGE Pod生命周期管理,会给docker发送stop Event, 如果没有相应在发送kill Event. docker kill时的wait chan close导致的,wait的时候会启动另一个goroutine,每次docker kill都会造成这两个goroutine的泄露。

docker stop部分:

// containerStop sends a stop signal, waits, sends a kill signal.
func (daemon *Daemon) containerStop(container *containerpkg.Container, seconds int) error {
    if !container.IsRunning() {
        return nil
    }

    stopSignal := container.StopSignal()
    // 1. Send a stop signal
    if err := daemon.killPossiblyDeadProcess(container, stopSignal); err != nil {
        // While normally we might "return err" here we're not going to
        // because if we can't stop the container by this point then
        // it's probably because it's already stopped. Meaning, between
        // the time of the IsRunning() call above and now it stopped.
        // Also, since the err return will be environment specific we can't
        // look for any particular (common) error that would indicate
        // that the process is already dead vs something else going wrong.
        // So, instead we'll give it up to 2 more seconds to complete and if
        // by that time the container is still running, then the error
        // we got is probably valid and so we force kill it.
        ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer cancel()

        if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
            logrus.Infof("Container failed to stop after sending signal %d to the process, force killing", stopSignal)
            if err := daemon.killPossiblyDeadProcess(container, 9); err != nil {
                return err
            }
        }
    }

    // 2. Wait for the process to exit on its own
    ctx := context.Background()
    if seconds >= 0 {
        var cancel context.CancelFunc
        ctx, cancel = context.WithTimeout(ctx, time.Duration(seconds)*time.Second)
        defer cancel()
    }

    if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
        logrus.Infof("Container %v failed to exit within %d seconds of signal %d - using the force", container.ID, seconds, stopSignal)
        // 3. If it doesn't, then send SIGKILL
        if err := daemon.Kill(container); err != nil {
            // Wait without a timeout, ignore result.
            <-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning) //这一步会挂起
            logrus.Warn(err) // Don't return error because we only care that container is stopped, not what function stopped it
        }
    }

    daemon.LogContainerEvent(container, "stop")
    return nil
}

docker kill部分代码:

// Kill forcefully terminates a container.
func (daemon *Daemon) Kill(container *containerpkg.Container) error {
    if !container.IsRunning() {
        return errNotRunning(container.ID)
    }

    // 1. Send SIGKILL
    if err := daemon.killPossiblyDeadProcess(container, int(syscall.SIGKILL)); err != nil {
        // While normally we might "return err" here we're not going to
        // because if we can't stop the container by this point then
        // it's probably because it's already stopped. Meaning, between
        // the time of the IsRunning() call above and now it stopped.
        // Also, since the err return will be environment specific we can't
        // look for any particular (common) error that would indicate
        // that the process is already dead vs something else going wrong.
        // So, instead we'll give it up to 2 more seconds to complete and if
        // by that time the container is still running, then the error
        // we got is probably valid and so we return it to the caller.
        if isErrNoSuchProcess(err) {
            return nil
        }

        ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer cancel()

        if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
            return err
        }
    }

    // 2. Wait for the process to die, in last resort, try to kill the process directly
    if err := killProcessDirectly(container); err != nil {
        if isErrNoSuchProcess(err) {
            return nil
        }
        return err
    }

    // Wait for exit with no timeout.
    // Ignore returned status.
    <-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning) //这一步会挂起

    return nil
}

具体可以查看: v19.03.15中kill部分v19.03.15中stop部分

因为没收到containerd发来的task exit的信号,无法从container.Wait返回的chan中读到数据,从而导致每次docker stop调用阻塞两个goroutine。 导致goroutine 泄露。

到底是那个 podcontainer 导致问题.

验证了确实在不断删除容器,但是删不掉,是容器 D进程 或者 Z 进程导致

5. 本问题梳理

Kubelet 为了保证最终一致性,发现宿主上还有不应该存在的容器就会一直不断的去尝试删除,每次删除都会调用docker stopapi,与dockerd建立一个uds连接,dockerd删除容器的时候会启动一个goroutine通过rpc形式调用containerd来删除容器并等待最终删除完毕才返回,等待的过程中会另起一个goroutine来获取结果,然而containerd在调用runc去真正执行删除的时候因为容器内D进程 或者 Z进程,无法删除容器,导致没有发出task exit信号,dockerd的两个相关的goroutine也就不会退出。

整个过程不断重复,最终就导致fd内存goroutine一步步的泄露,系统逐渐走向不可用。

5.1 触发条件

由于Pod中容器容器内D进程 或者 Z进程,无法删除容器。

5.2 根本原因

由于Docker server版本有没有设置gorounte timeout,导致gorounte挂起, 从而导致fd、内存泄漏

社区目前以及修复:
https://github.com/moby/moby/pull/42956

5.3 引起的容器

本环境中由于fluentd pod异常导致
从ContainerID a9acd5210705 跟踪关联Kubelet日志信息

May  1 10:47:56 ecs-prod-003 kubelet: 2023-05-01 10:47:56.381 [INFO][88861] ipam.go 1172: Releasing all IPs with handle 'k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9'
May  1 10:47:56 ecs-prod-003 kubelet: 2023-05-01 10:47:56.429 [INFO][88861] ipam_plugin.go 314: Released address using handleID ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 10:47:56 ecs-prod-003 kubelet: 2023-05-01 10:47:56.429 [INFO][88861] ipam_plugin.go 323: Releasing address using workloadID ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 10:47:56 ecs-prod-003 kubelet: 2023-05-01 10:47:56.433 [INFO][88846] k8s.go 498: Teardown processing complete. ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.865 [INFO][286438] plugin.go 503: Extracted identifiers ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Node="ecs-prod-003" Orchestrator="k8s" WorkloadEndpoint="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] k8s.go 473: Endpoint deletion will be handled by Kubernetes deletion of the Pod. ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" endpoint=&v3.WorkloadEndpoint{
   
   TypeMeta:v1.TypeMeta{
   
   Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{
   
   Name:"ecs--prod--003-k8s-fluentd--hjj42-eth0", GenerateName:"fluentd-", Namespace:"kube-system", SelfLink:"", UID:"d6097c23-9b88-4708-93e0-226bb313e7f3", ResourceVersion:"99857676", Generation:0, CreationTimestamp:v1.Time{
   
   Time:time.Time{
   
   wall:0x0, ext:63802199288, loc:(*time.Location)(0x29ce720)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{
   
   "controller-revision-hash":"86b64f7748", "k8s-app":"fluentd-logging", "pod-template-generation":"1", "projectcalico.org/namespace":"kube-system", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"fluentd", "version":"v1"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{
   
   Orchestrator:"k8s", Workload:"", Node:"ecs-prod-003", ContainerID:"", Pod:"fluentd-hjj42", Endpoint:"eth0", IPNetworks:[]string{
   
   "172.20.83.196/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{
   
   "kns.kube-system", "ksa.kube-system.fluentd"}, InterfaceName:"cali843ab5e3ccd", MAC:"", Ports:[]v3.EndpointPort(nil)}}
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] k8s.go 485: Cleaning up netns ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] linux_dataplane.go 457: veth does not exist, no need to clean up. ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" ifName="eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] k8s.go 492: Releasing IP address(es) ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.874 [INFO][286438] utils.go 168: Calico CNI releasing IP address ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.895 [INFO][286452] ipam_plugin.go 302: Releasing address using handleID ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.895 [INFO][286452] ipam.go 1172: Releasing all IPs with handle 'k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9'
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.903 [WARNING][286452] ipam_plugin.go 312: Asked to release address but it doesn't exist. Ignoring ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.903 [INFO][286452] ipam_plugin.go 323: Releasing address using workloadID ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" HandleID="k8s-pod-network.a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9" Workload="ecs--prod--003-k8s-fluentd--hjj42-eth0"
May  1 20:29:34 ecs-prod-003 kubelet: 2023-05-01 20:29:34.907 [INFO][286438] k8s.go 498: Teardown processing complete. ContainerID="a9acd5210705c3e2e95ca459d1b244883c0ba2a5ee94650cb2fa23422367e6e9"
[root@ecs-prod-003 /var/log]#  grep "kubelet" /var/log/messages-20230507  |grep "fef3e1fbee14"
May  1 11:03:56 ecs-prod-003 kubelet: E0501 11:03:37.635343    4568 remote_runtime.go:276] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469"
May  1 11:04:00 ecs-prod-003 kubelet: E0501 11:03:37.635447    4568 kuberuntime_container.go:666] "Container termination failed with gracePeriod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="kube-system/fluentd-hjj42" podUID=d6097c23-9b88-4708-93e0-226bb313e7f3 containerName="fluentd" containerID="docker://fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469" gracePeriod=5
May  1 11:04:01 ecs-prod-003 kubelet: E0501 11:03:37.635513    4568 kuberuntime_container.go:691] "Kill container failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="kube-system/fluentd-hjj42" podUID=d6097c23-9b88-4708-93e0-226bb313e7f3 containerName="fluentd" containerID={
   
   Type:docker ID:fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469}
May  1 11:13:56 ecs-prod-003 kubelet: E0501 11:13:56.984106    4568 remote_runtime.go:276] "StopContainer from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" containerID="fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469"
May  1 11:13:56 ecs-prod-003 kubelet: E0501 11:13:56.985104    4568 kuberuntime_container.go:666] "Container termination failed with gracePeriod" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" pod="kube-system/fluentd-hjj42" podUID=d6097c23-9b88-4708-93e0-226bb313e7f3 containerName="fluentd" containerID="docker://fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469" gracePeriod=5
May  1 11:13:56 ecs-prod-003 kubelet: E0501 11:13:56.985160    4568 kuberuntime_container.go:691] "Kill container failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" pod="kube-system/fluentd-hjj42" podUID=d6097c23-9b88-4708-93e0-226bb313e7f3 containerName="fluentd" containerID={
   
   Type:docker ID:fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469}
May  1 23:37:46 ecs-prod-003 kubelet: I0501 23:37:46.243371  285578 scope.go:111] "RemoveContainer" containerID="fef3e1fbee146788d1557ec20204e10922eeb2607969a26d1323588cc8a7f469"

6. 修复方式

此问题修复方式:https://github.com/moby/moby/pull/42956
落地版本在: v20.10.10

7. 最佳方式:

短期方式

如果用户系统要求,不能更新docker版本,可通过内部优雅重启方式

{
   
   
    "log-driver": "json-file",
    "log-opts": {
   
   
        "max-size": "50m",
        "max-file": "5"
    },
    "oom-score-adjust": -1000,
    "registry-mirrors": ["https://xxxxx"],
    "storage-driver": "overlay2",
    "storage-opts":["overlay2.override_kernel_check=true"],
    "live-restore": true
}

live-restore: trueoom-score-adjust: -1000 将被oomkill调整权限最高,如果发生以上问题,可优雅重启

长期方式

性能相关:

升级版本到 v20.10.14+

相关实践学习
深入解析Docker容器化技术
Docker是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器上,也可以实现虚拟化,容器是完全使用沙箱机制,相互之间不会有任何接口。Docker是世界领先的软件容器平台。开发人员利用Docker可以消除协作编码时“在我的机器上可正常工作”的问题。运维人员利用Docker可以在隔离容器中并行运行和管理应用,获得更好的计算密度。企业利用Docker可以构建敏捷的软件交付管道,以更快的速度、更高的安全性和可靠的信誉为Linux和Windows Server应用发布新功能。 在本套课程中,我们将全面的讲解Docker技术栈,从环境安装到容器、镜像操作以及生产环境如何部署开发的微服务应用。本课程由黑马程序员提供。 &nbsp; &nbsp; 相关的阿里云产品:容器服务 ACK 容器服务 Kubernetes 版(简称 ACK)提供高性能可伸缩的容器应用管理能力,支持企业级容器化应用的全生命周期管理。整合阿里云虚拟化、存储、网络和安全能力,打造云端最佳容器化应用运行环境。 了解产品详情: https://www.aliyun.com/product/kubernetes
相关文章
|
10月前
|
Web App开发 监控 JavaScript
监控和分析 JavaScript 内存使用情况
【10月更文挑战第30天】通过使用上述的浏览器开发者工具、性能分析工具和内存泄漏检测工具,可以有效地监控和分析JavaScript内存使用情况,及时发现和解决内存泄漏、过度内存消耗等问题,从而提高JavaScript应用程序的性能和稳定性。在实际开发中,可以根据具体的需求和场景选择合适的工具和方法来进行内存监控和分析。
|
3月前
|
存储 弹性计算 缓存
阿里云服务器ECS经济型、通用算力、计算型、通用和内存型选购指南及使用场景分析
本文详细解析阿里云ECS服务器的经济型、通用算力型、计算型、通用型和内存型实例的区别及适用场景,涵盖性能特点、配置比例与实际应用,助你根据业务需求精准选型,提升资源利用率并降低成本。
235 3
|
2月前
|
存储 人工智能 自然语言处理
AI代理内存消耗过大?9种优化策略对比分析
在AI代理系统中,多代理协作虽能提升整体准确性,但真正决定性能的关键因素之一是**内存管理**。随着对话深度和长度的增加,内存消耗呈指数级增长,主要源于历史上下文、工具调用记录、数据库查询结果等组件的持续积累。本文深入探讨了从基础到高级的九种内存优化技术,涵盖顺序存储、滑动窗口、摘要型内存、基于检索的系统、内存增强变换器、分层优化、图形化记忆网络、压缩整合策略以及类操作系统内存管理。通过统一框架下的代码实现与性能评估,分析了每种技术的适用场景与局限性,为构建高效、可扩展的AI代理系统提供了系统性的优化路径和技术参考。
132 4
AI代理内存消耗过大?9种优化策略对比分析
|
11月前
|
编译器 C语言
动态内存分配与管理详解(附加笔试题分析)(上)
动态内存分配与管理详解(附加笔试题分析)
178 1
|
12月前
|
程序员 编译器 C++
【C++核心】C++内存分区模型分析
这篇文章详细解释了C++程序执行时内存的四个区域:代码区、全局区、栈区和堆区,以及如何在这些区域中分配和释放内存。
139 2
|
6月前
|
存储 Java
课时4:对象内存分析
接下来对对象实例化操作展开初步分析。在整个课程学习中,对象使用环节往往是最棘手的问题所在。
|
6月前
|
Java 编译器 Go
go的内存逃逸分析
内存逃逸分析是Go编译器在编译期间根据变量的类型和作用域,确定变量分配在堆上还是栈上的过程。如果变量需要分配在堆上,则称作内存逃逸。Go语言有自动内存管理(GC),开发者无需手动释放内存,但编译器需准确分配内存以优化性能。常见的内存逃逸场景包括返回局部变量的指针、使用`interface{}`动态类型、栈空间不足和闭包等。内存逃逸会影响性能,因为操作堆比栈慢,且增加GC压力。合理使用内存逃逸分析工具(如`-gcflags=-m`)有助于编写高效代码。
125 2
|
10月前
|
JavaScript
如何使用内存快照分析工具来分析Node.js应用的内存问题?
需要注意的是,不同的内存快照分析工具可能具有不同的功能和操作方式,在使用时需要根据具体工具的说明和特点进行灵活运用。
318 62
|
10月前
|
并行计算 算法 测试技术
C语言因高效灵活被广泛应用于软件开发。本文探讨了优化C语言程序性能的策略,涵盖算法优化、代码结构优化、内存管理优化、编译器优化、数据结构优化、并行计算优化及性能测试与分析七个方面
C语言因高效灵活被广泛应用于软件开发。本文探讨了优化C语言程序性能的策略,涵盖算法优化、代码结构优化、内存管理优化、编译器优化、数据结构优化、并行计算优化及性能测试与分析七个方面,旨在通过综合策略提升程序性能,满足实际需求。
241 1
|
10月前
|
开发框架 监控 .NET
【Azure App Service】部署在App Service上的.NET应用内存消耗不能超过2GB的情况分析
x64 dotnet runtime is not installed on the app service by default. Since we had the app service running in x64, it was proxying the request to a 32 bit dotnet process which was throwing an OutOfMemoryException with requests >100MB. It worked on the IaaS servers because we had the x64 runtime install
156 5