kubelet内存异常分析

本文涉及的产品
可观测可视化 Grafana 版,10个用户账号 1个月
函数计算FC,每月15万CU 3个月
注册配置 MSE Nacos/ZooKeeper,118元/月
简介: kubelet内存异常分析

 背景:

kubelet被重启后, 内存再次升高, 在游戏启动后,没有进入大量玩家的情况下,2天时间占用接近10G,并且持续在增长, CPU偶尔负载高。AWS给出的解释是containerd v1.4.0版本问题, 以下为内部分析过程。

实际情况

# 整体情况

[ec2-user@server]$ free-h

              total        used        free      shared  buff/cache   available

Mem:            31G         15G         13G        102M        1.8G         17G

Swap:            0B          0B          0B

[ec2-user@server]$ top

top- 18:22:11 up 18 days, 44 min,  1 user,  load average: 0.34, 0.37, 0.28

Tasks: 183 total,   1 running, 116 sleeping,   0 stopped,   0 zombie

%Cpu(s):  0.6 us,  0.7 sy,  0.0 ni, 98.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.6 st

KiB Mem : 32577156 total, 14126708 free, 16523260 used,  1927188 buff/cache

KiB Swap:        0 total,        0 free,        0 used. 18165344 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                  

29958 root      20   0   11.2g   9.1g  62448 S   2.9 29.3 185:57.18 kubelet                                                                                                                                                  

 4252 root      20   0 5169628   3.1g  47444 S   1.5 10.0 541:55.51 dockerd                                                                                                                                                  

 5920 root      20   0 2980576   2.8g   5760 S  10.3  9.0   3846:28 containerd-shim                                                                                                                                          

 5981 root      20   0  163972  56460  27972 S   0.0  0.2  12:58.19 aws-k8s-agent                                                                                                                                            

 1513 root      20   0  129228  55956  55512 S   0.0  0.2   2:00.04 systemd-journal                                                                                                                                          

 3614 root      20   0 3487244  51304  27264 S   0.0  0.2  85:15.65 containerd                                                                                                                                               

 6866 101       20   0  744864  42556  20548 S   0.0  0.1  34:39.17 nginx-ingress-c                                                                                                                                          

 3837 root      20   0  487720  41568  40036 S   0.0  0.1   1:51.04 rsyslogd                                                                                                                                                 

 5616 root      20   0  141592  25588  16760 S   0.0  0.1  10:04.11 kube-proxy                                                                                                                                               

23638 101       20   0  124928  19092   3944 S   0.0  0.1   8:51.19 nginx                                                                                                                                                    

23640 101       20   0  124800  18916   3792 S   0.0  0.1   7:50.70 nginx                                                                                                                                                    

23643 101       20   0  124800  18876   3740 S   0.0  0.1   7:51.30 nginx                                                                                                                                                    

23642 101       20   0  124672  18772   3776 S   0.0  0.1   7:50.18 nginx                                                                                                                                                    

23637 101       20   0  124544  18740   3864 S   0.0  0.1   7:48.88 nginx                                                                                                                                                    

23639 101       20   0  124544  18672   3792 S   0.0  0.1   7:50.47 nginx                                                                                                                                                    

23641 101       20   0  124544  18624   3760 S   0.0  0.1   7:49.50 nginx                                                                                                                                                    

23644 101       20   0  124416  18624   3876 S   0.0  0.1   7:51.38 nginx                                                                                                                                                    

26195 root      20   0   55140  17180   6784 S   0.0  0.1   3:06.30 supervisord                                                                                                                                              

17424 nfsnobo+  20   0  621508  16416   9268 S   0.0  0.1   4:47.43 node_exporter                                                                                                                                            

27261 root      20   0  114792  15356   7192 S   0.0  0.0  13:26.84 python                                                                                                                                                   

27230 root      20   0  119092  11944   8936 S   0.0  0.0   5:26.61 aws-efs-csi-dri                                                                                                                                          

 5354 root      20   0 1499128  11908   7176 S   0.0  0.0  13:03.18 falcon-agent                                                                                                                                             

27352 root      20   0  114524  11396   8188 S   0.0  0.0   8:12.86 livenessprobe                                                                                                                                            

 7104 101       20   0  110640   9908   4700 S   0.0  0.0   0:00.41 nginx                                                                                                                                                    

  402 root      20   0  150544   8576   7260 S   0.0  0.0   0:00.00 sshd                                                                                                                                                     

27293 root      20   0  113604   8336   6824 S   0.0  0.0   0:04.58 csi-node-driver                                                                                                                                          

26241 root      20   0   63512   6716   5292 S   0.0  0.0   0:00.01 syslog-ng                                                                                                                                                

 9644 postfix   20   0   88292   6644   5632 S   0.0  0.0   0:00.00 pickup                                                                                                                                                   

 4151 root      20   0  110768   6336   5296 S   0.0  0.0   0:00.07 sshd                                                                                                                                                     

 5554 root      20   0  108828   6076   5308 S   0.0  0.0   0:36.73 containerd-shim                                                                                                                                          

27134 root      20   0  110236   6040   5268 S   0.0  0.0   0:24.21 containerd-shim                                                                                                                                          

 3797 postfix   20   0   88364   6004   4992 S   0.0  0.0   0:00.69 qmgr                                                                                                                                                     

 6766 root      20   0  108828   5944   5180 S   0.0  0.0   0:37.73 containerd-shim                                                                                                                                          

26037 root      20   0  108828   5904   5384 S   0.0  0.0   0:33.20 containerd-shim                                                                                                                                          

17148 root      20   0  108828   5900   5328 S   0.0  0.0   0:20.14 containerd-shim                                                                                                                                          

26164 root      20   0  110236   5900   5204 S   0.0  0.0   0:34.79 containerd-shim                                                                                                                                          

 5053 root      20   0  108828   5808   5096 S   0.0  0.0   0:38.79 containerd-shim                                                                                                                                          

    1 root      20   0  191440   5780   3944 S   0.0  0.0   4:21.84 systemd                                                                                                                                                  

27262 root      20   0  108828   5760   5152 S   0.0  0.0   0:24.57 containerd-shim                                                                                                                                          

17389 root      20   0  110236   5748   5196 S   0.0  0.0   0:19.76 containerd-shim                                                                                                                                          

 5054 root      20   0  108828   5632   4936 S   0.0  0.0   0:40.97 containerd-shim                                                                                                                                          

 6163 root      20   0  110236   5056   4472 S   0.0  0.0   0:36.13 containerd-shim                                                                                                                                          

27321 root      20   0  110236   4744   4356 S   0.0  0.0   0:24.02 containerd-shim                                                                                                                                          

27199 root      20   0  108828   4736   4356 S   0.0  0.0   0:23.77 containerd-shim                                                                                                                                          

23645 101       20   0  108604   4708   1268 S   0.0  0.0   0:04.64 nginx

CPU分析

1)整体分析:runtime.scanobject(gc垃圾回收相关)函数占用CPU最多,耗时最长

性能调试

# 函数调用分析,CPU指标, 持续监控1分钟

  tools go tool pprof -seconds=60  http://xxxxx:8001/api/v1/nodes/ip-10-90-154-0.cn-north-1.compute.internal/proxy/debug/pprof/profile

Fetching profile over HTTP from http://xxxxx:8001/api/v1/nodes/ip-10-90-154-0.cn-north-1.compute.internal/proxy/debug/pprof/profile?seconds=60

Please wait... (1m0s)

Saved profile in/Users/westos/pprof/pprof.kubelet.samples.cpu.002.pb.gz

File: kubelet

Build ID: fa0e6afcd119e92cd898064c5f918d9dc46a9676

Type: cpu

Time: Dec 7, 2020 at 4:58pm (CST)

Duration: 1mins, Total samples = 3.13s ( 5.22%)

Entering interactive mode (type"help"forcommands, "o"foroptions)

(pprof) top

Showing nodes accounting for2250ms, 71.88% of 3130ms total

Dropped 178 nodes (cum <= 15.65ms)

Showing top10 nodes out of 190

      flat  flat%   sum%        cum   cum%

     500ms 15.97% 15.97%      680ms 21.73%  runtime.findObject

     490ms 15.65% 31.63%     1910ms 61.02%  runtime.scanobject

     320ms 10.22% 41.85%      340ms 10.86%  runtime.pageIndexOf

     200ms  6.39% 48.24%      680ms 21.73%  runtime.greyobject

     160ms  5.11% 53.35%      170ms  5.43%  runtime.spanOf

     160ms  5.11% 58.47%      280ms  8.95%  runtime.sweepone

     160ms  5.11% 63.58%      160ms  5.11%  syscall.Syscall6

     150ms  4.79% 68.37%      150ms  4.79%  runtime.futex

      60ms  1.92% 70.29%       60ms  1.92%  syscall.Syscall

      50ms  1.60% 71.88%       60ms  1.92%  runtime.(*gcBits).bitp

2)火焰图(横轴表示运行时间,可以观察程序不同部分的占比,纵轴表示函数调用栈)

分析:kubelet的cpu表现偶尔异常,大部分情况正常,但是能从中看出来负责垃圾回收的gc线程最为忙碌,runtime.gcBgMarkWorker 函数,这个函数是 golang 垃圾回收相关的函数,用于标记(mark)出所有是垃圾的对象。一般情况下此函数不会占用这么多的 cpu,出现这种情况一般都是内存 gc 问题,但是后续内存占用只比平常多了几百M,并没有特别高,原因是影响 GC 性能的一般都不是内存的占用量,而是对象的数量。举例说明,10 个 100m 的对象和一亿个 10 字节的对象占用内存几乎一样大,但是回收起来一亿个小对象肯定会比 10 个大对象要慢很多。

火焰图分析


# go tool pprof -http localhost:8001 ~/Documents/pprof.kubelet.samples.cpu.002.pb.gz

image.gif编辑

内存分析

1)整体分析:kubelet总共占据了9.86G左右内存, 主要由两个函数造成分配内存过多:

encoding/json.(*decodeState).literalStore 7.4G
reflect.unsafe_NewArray   2.1G

heap


➜  ~ go tool pprof   http://10.90.32.42:8001/api/v1/nodes/ip-10-90-154-0.cn-north-1.compute.internal/proxy/debug/pprof/heap
Fetching profile over HTTP from http://10.90.32.42:8001/api/v1/nodes/ip-10-90-154-0.cn-north-1.compute.internal/proxy/debug/pprof/heap
Saved profile in /Users/westos/pprof/pprof.kubelet.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
File: kubelet
Build ID: fa0e6afcd119e92cd898064c5f918d9dc46a9676
Type: inuse_space
Time: Dec 8, 2020 at 6:40am (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 10031.25MB, 99.40% of 10091.96MB total
Dropped 224 nodes (cum <= 50.46MB)
Showing top 10 nodes out of 20
      flat  flat%   sum%        cum   cum%
 7593.96MB 75.25% 75.25%  7593.96MB 75.25%  encoding/json.(*decodeState).literalStore
 2223.15MB 22.03% 97.28%  2223.15MB 22.03%  reflect.unsafe_NewArray
  213.63MB  2.12% 99.39%   214.63MB  2.13%  k8s.io/kubernetes/vendor/github.com/docker/docker/pkg/stdcopy.StdCopy
    0.50MB 0.005% 99.40%  9850.17MB 97.60%  k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).ExecSync
         0     0% 99.40%  9833.62MB 97.44%  encoding/json.(*Decoder).Decode
         0     0% 99.40%  9799.61MB 97.10%  encoding/json.(*decodeState).array
         0     0% 99.40%  9833.62MB 97.44%  encoding/json.(*decodeState).object
         0     0% 99.40%  9833.62MB 97.44%  encoding/json.(*decodeState).unmarshal
         0     0% 99.40%  9833.62MB 97.44%  encoding/json.(*decodeState).value
         0     0% 99.40%  9848.67MB 97.59%  k8s.io/kubernetes/pkg/kubelet/dockershim.(*streamingRuntime).exec
(pprof) list literalStore
Total: 9.86GB
ROUTINE ======================== encoding/json.(*decodeState).literalStore in /usr/local/go/src/encoding/json/decode.go
    7.42GB     7.42GB (flat, cum) 75.25% of Total
         .          .   1005:                           break
         .          .   1006:                   }
         .          .   1007:                   if v.NumMethod() != 0 {
         .          .   1008:                           d.saveError(&UnmarshalTypeError{Value: "number", Type: v.Type(), Offset: int64(d.readIndex())})
         .          .   1009:                           break
    7.42GB     7.42GB   1010:                   }
         .          .   1011:                   v.Set(reflect.ValueOf(n))
         .          .   1012:
         .          .   1013:           case reflect.Int, reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64:
         .          .   1014:                   n, err := strconv.ParseInt(s, 10, 64)
         .          .   1015:                   if err != nil || v.OverflowInt(n) {
(pprof)
(pprof) list unsafe_NewArray
Total: 9.86GB
ROUTINE ======================== reflect.unsafe_NewArray in /usr/local/go/src/runtime/malloc.go
    2.17GB     2.17GB (flat, cum) 22.03% of Total
         .          .   1173://go:linkname reflectlite_unsafe_New internal/reflectlite.unsafe_New
         .          .   1174:func reflectlite_unsafe_New(typ *_type) unsafe.Pointer {
         .          .   1175:   return mallocgc(typ.size, typ, true)
         .          .   1176:}
         .          .   1177:
    2.17GB     2.17GB   1178:// newarray allocates an array of n elements of type typ.
         .          .   1179:func newarray(typ *_type, n int) unsafe.Pointer {
         .          .   1180:   if n == 1 {
         .          .   1181:           return mallocgc(typ.size, typ, true)
         .          .   1182:   }
         .          .   1183:   mem, overflow := math.MulUintptr(typ.size, uintptr(n))
(pprof)

2)火焰图(从上至下,函数调用栈)

分析:

    • 1)该调用栈主要是在获取容器状态, 最后一个位于kubelet代码中的函数为 k8s.io/kubernetes/vendor/github.com/docker/docker/client.(*Client).ContainerInspect,获取容器信息
    • 2)decodeState方法导致大量的小内存对象需要被分配,其中literalStore类型最多, 64B的对象已经使用了高达124620205个(1.2亿),  64B * 124620205 /  (1024 * 1024) = 7.6G
    • 3)结合上面的CPU火焰图,垃圾回收gc函数比较忙碌,侧面说明小对象非常多。

    image.gif编辑

    --inuse/inuse_objects, 对象使用数量,decodeState使用64B的对象达到1.2亿,且还在增长

    image.gif编辑

    --inuse/inuse_space 内存使用量,decodeState累计小对象总内存已接近8G, 随着时间推移,还在增长。

    image.gif编辑

    总结:

    1)CPU整体表现正常,偶尔异常, 异常时gc垃圾回收最为忙碌, 大部分时间正常说明即时有这么多小对象产生,但是回收量并不高,否则cpu使用率应该会非常高

    2)内存方面,dockershim.(*streamingRuntime).exec调用产生的小对象数量非常多,且回收很少,导致内存不断增长

    3)虽然能确定内存是因为小对象分配过多,且回收量很少, 但是具体的为何CRI(容器运行时相关代码)检查容器状态时产生这么多的小对象,是否存在失败不停重试还需要结合kubernetes CRI部分源码和kubelet、container-shim、docker三者的关系以及日志来深入分析进行定位。

    4)strace并未能追踪到kubelet的函数调用栈及出错次数(原因未知)

    经验总结

    在做内存问题相关的排查时:

      • 若 gc 相关函数占用异常,可重点排查对象数量
      • 解决速度问题(CPU占用)时,关注对象数量( --inuse/alloc_objects )指标
      • 解决内存占用问题时,关注分配空间( --inuse/alloc_space )指标

      后边我将会给大家带来更多内存、CPU异常分析工具和方法,敬请关注

      相关实践学习
      通过Ingress进行灰度发布
      本场景您将运行一个简单的应用,部署一个新的应用用于新的发布,并通过Ingress能力实现灰度发布。
      容器应用与集群管理
      欢迎来到《容器应用与集群管理》课程,本课程是“云原生容器Clouder认证“系列中的第二阶段。课程将向您介绍与容器集群相关的概念和技术,这些概念和技术可以帮助您了解阿里云容器服务ACK/ACK Serverless的使用。同时,本课程也会向您介绍可以采取的工具、方法和可操作步骤,以帮助您了解如何基于容器服务ACK Serverless构建和管理企业级应用。 学习完本课程后,您将能够: 掌握容器集群、容器编排的基本概念 掌握Kubernetes的基础概念及核心思想 掌握阿里云容器服务ACK/ACK Serverless概念及使用方法 基于容器服务ACK Serverless搭建和管理企业级网站应用
      相关文章
      |
      2月前
      |
      Web App开发 监控 JavaScript
      监控和分析 JavaScript 内存使用情况
      【10月更文挑战第30天】通过使用上述的浏览器开发者工具、性能分析工具和内存泄漏检测工具,可以有效地监控和分析JavaScript内存使用情况,及时发现和解决内存泄漏、过度内存消耗等问题,从而提高JavaScript应用程序的性能和稳定性。在实际开发中,可以根据具体的需求和场景选择合适的工具和方法来进行内存监控和分析。
      |
      3月前
      |
      编译器 C语言
      动态内存分配与管理详解(附加笔试题分析)(上)
      动态内存分配与管理详解(附加笔试题分析)
      76 1
      |
      4月前
      |
      程序员 编译器 C++
      【C++核心】C++内存分区模型分析
      这篇文章详细解释了C++程序执行时内存的四个区域:代码区、全局区、栈区和堆区,以及如何在这些区域中分配和释放内存。
      62 2
      |
      2月前
      |
      并行计算 算法 测试技术
      C语言因高效灵活被广泛应用于软件开发。本文探讨了优化C语言程序性能的策略,涵盖算法优化、代码结构优化、内存管理优化、编译器优化、数据结构优化、并行计算优化及性能测试与分析七个方面
      C语言因高效灵活被广泛应用于软件开发。本文探讨了优化C语言程序性能的策略,涵盖算法优化、代码结构优化、内存管理优化、编译器优化、数据结构优化、并行计算优化及性能测试与分析七个方面,旨在通过综合策略提升程序性能,满足实际需求。
      65 1
      |
      2月前
      |
      JavaScript
      如何使用内存快照分析工具来分析Node.js应用的内存问题?
      需要注意的是,不同的内存快照分析工具可能具有不同的功能和操作方式,在使用时需要根据具体工具的说明和特点进行灵活运用。
      50 3
      |
      2月前
      |
      开发框架 监控 .NET
      【Azure App Service】部署在App Service上的.NET应用内存消耗不能超过2GB的情况分析
      x64 dotnet runtime is not installed on the app service by default. Since we had the app service running in x64, it was proxying the request to a 32 bit dotnet process which was throwing an OutOfMemoryException with requests >100MB. It worked on the IaaS servers because we had the x64 runtime install
      |
      2月前
      |
      Web App开发 JavaScript 前端开发
      使用 Chrome 浏览器的内存分析工具来检测 JavaScript 中的内存泄漏
      【10月更文挑战第25天】利用 Chrome 浏览器的内存分析工具,可以较为准确地检测 JavaScript 中的内存泄漏问题,并帮助我们找出潜在的泄漏点,以便采取相应的解决措施。
      351 9
      |
      3月前
      |
      并行计算 算法 IDE
      【灵码助力Cuda算法分析】分析共享内存的矩阵乘法优化
      本文介绍了如何利用通义灵码在Visual Studio 2022中对基于CUDA的共享内存矩阵乘法优化代码进行深入分析。文章从整体程序结构入手,逐步深入到线程调度、矩阵分块、循环展开等关键细节,最后通过带入具体值的方式进一步解析复杂循环逻辑,展示了通义灵码在辅助理解和优化CUDA编程中的强大功能。
      |
      3月前
      |
      程序员 编译器 C语言
      动态内存分配与管理详解(附加笔试题分析)(下)
      动态内存分配与管理详解(附加笔试题分析)(下)
      59 2
      |
      4月前
      |
      算法 程序员 Python
      程序员必看!Python复杂度分析全攻略,让你的算法设计既快又省内存!
      在编程领域,Python以简洁的语法和强大的库支持成为众多程序员的首选语言。然而,性能优化仍是挑战。本文将带你深入了解Python算法的复杂度分析,从时间与空间复杂度入手,分享四大最佳实践:选择合适算法、优化实现、利用Python特性减少空间消耗及定期评估调整,助你写出高效且节省内存的代码,轻松应对各种编程挑战。
      74 1