背景
从监控看 Secondary 使用的物理内存比 Primary 多 11GB 左右,
基本的内存分析可以先看团队另一位同学写的这个排查文档,
用户没有设置在备库读,Secondary 基本没有流量,只有复制的流量,连接数也不多,基本排除是业务行为导致 Secondary 内存高,所以怀疑和 tcmalloc 分配器的缓存行为有关。
排查
查看Primary 和 Secondary 的 serverStatus.tcmalloc 输出,
Primary,
mgset-25489817:PRIMARY> db.serverStatus().tcmalloc
{
"generic" : {
"current_allocated_bytes" : NumberLong("16296822448"),
"heap_size" : NumberLong("34201272320")
},
"tcmalloc" : {
"pageheap_free_bytes" : 933314560,
"pageheap_unmapped_bytes" : NumberLong("15870619648"),
"max_total_thread_cache_bytes" : NumberLong(1073741824),
"current_total_thread_cache_bytes" : 543050048,
"total_free_bytes" : NumberLong(1100498976),
"central_cache_free_bytes" : 557461008,
"transfer_cache_free_bytes" : 4096,
"thread_cache_free_bytes" : 543031184,
"aggressive_memory_decommit" : 0,
"pageheap_committed_bytes" : NumberLong("18330652672"),
"pageheap_scavenge_count" : 22937964,
"pageheap_commit_count" : 31247638,
"pageheap_total_commit_bytes" : NumberLong("218141866151936"),
"pageheap_decommit_count" : 23394903,
"pageheap_total_decommit_bytes" : NumberLong("218123535499264"),
"pageheap_reserve_count" : 9872,
"pageheap_total_reserve_bytes" : NumberLong("34201272320"),
"spinlock_total_delay_ns" : NumberLong("113428202936"),
Secondary,
mgset-25489817:SECONDARY> db.serverStatus().tcmalloc
{
"generic" : {
"current_allocated_bytes" : NumberLong("16552694552"),
"heap_size" : NumberLong("33373687808")
},
"tcmalloc" : {
"pageheap_free_bytes" : NumberLong("11787452416"),
"pageheap_unmapped_bytes" : NumberLong("4039823360"),
"max_total_thread_cache_bytes" : NumberLong(1073741824),
"current_total_thread_cache_bytes" : 113279256,
"total_free_bytes" : 993717480,
"central_cache_free_bytes" : 879823248,
"transfer_cache_free_bytes" : 614976,
"thread_cache_free_bytes" : 113279256,
"aggressive_memory_decommit" : 0,
"pageheap_committed_bytes" : NumberLong("29333864448"),
"pageheap_scavenge_count" : 2605518,
"pageheap_commit_count" : 4694997,
"pageheap_total_commit_bytes" : NumberLong("672231747584"),
"pageheap_decommit_count" : 3544502,
"pageheap_total_decommit_bytes" : NumberLong("642897883136"),
"pageheap_reserve_count" : 25284,
"pageheap_total_reserve_bytes" : NumberLong("33373687808"),
"spinlock_total_delay_ns" : NumberLong("3132393632"),
我们重点关注 *_free_bytes 的输出项,其中,
- pageheap_free_bytes:Number of bytes in free, mapped pages in page heap. These bytes can be used to fulfill allocation requests. They always count towards virtual memory usage, and unless the underlying memory is swapped out by the OS(线上目前没有开启 swap), they also count towards physical memory usage.
- total_free_bytes =
central_cache_free_bytes + transfer_cache_free_bytes + thread_cache_free_bytes
,注意这个total_free_bytes 是不包含pageheap_free_bytes的,见 tcmalloc 代码 - 所以如果查看 tcmalloc cache 了多少内存,需要看 pageheap_free_bytes + total_free_bytes
最后,对比一下 Secondary 和 Primary 的 serverStatus 输出,可以看到total_free_bytes二者是差不多的,都在 1GB 左右,但是pageheap_free_bytes ,Secondary 比 Primary 多了 11GB 左右,和前面 OS 层面观察到的 RSS 差值一致。
关于 central_cache_free_bytes 、thread_cache_free_bytes 、 thread_cache_free_bytes的含义也列一下,这个代码里面没有解释,在其他地方找到了,
- central_cache_free_bytes, Number of free bytes in the central cache that have been assigned to size classes. They always count towards virtual memory usage, and unless the underlying memory is swapped out by the OS, they also count towards physical memory usage. This property is not writable.
- transfer_cache_free_bytes, Number of free bytes that are waiting to be transfered between the central cache and a thread cache. They always count towards virtual memory usage, and unless the underlying memory is swapped out by the OS, they also count towards physical memory usage. This property is not writable.
- thread_cache_free_bytes, Number of free bytes in thread caches. They always count towards virtual memory usage, and unless the underlying memory is swapped out by the OS, they also count towards physical memory usage. This property is not writable.
优化
阿里云 MongoDB 实现了一个 tcmallocRelease 命令(后端可执行,不对外部用户提供),背后是调用 tcmalloc 的ReleaseFreeMemory()进行 PageHeap 的回收,不过这个命令在执行过程中会锁住整个 PageHeap,可能导致其他需要分配内存的请求 hang 住,线上执行要小心。另外,如果对这部分 cache 住的内存不是特别敏感,不建议执行,毕竟不是真的浪费了,也减少了后续需要调用系统调用的次数。
此外,这个方法不影响 Central Cache 和 Thread Cache。关于tcmalloc cache 内存归还操作系统的策略和时机,比较复杂,详细的资料可以参考这个文章。
我们在上述实例的Hidden 节点执行db.adminCommand({tcmallocRelease: 1})
命令,可以观察到pageheap_free_bytes下降了 90%以上,
before,
mgset-25489817:SECONDARY> db.serverStatus().tcmalloc
{
"generic" : {
"current_allocated_bytes" : NumberLong("16549856240"),
"heap_size" : NumberLong("34105942016")
},
"tcmalloc" : {
"pageheap_free_bytes" : NumberLong("7499571200"),
"pageheap_unmapped_bytes" : NumberLong("9387900928"),
"max_total_thread_cache_bytes" : NumberLong(1073741824),
"current_total_thread_cache_bytes" : 133710112,
"total_free_bytes" : 668613648,
"central_cache_free_bytes" : 534325360,
"transfer_cache_free_bytes" : 578176,
"thread_cache_free_bytes" : 133710112,
after,
mgset-25489817:SECONDARY> db.serverStatus().tcmalloc
{
"generic" : {
"current_allocated_bytes" : NumberLong("16546167280"),
"heap_size" : NumberLong("34105942016")
},
"tcmalloc" : {
"pageheap_free_bytes" : 38395904,
"pageheap_unmapped_bytes" : NumberLong("16852795392"),
"max_total_thread_cache_bytes" : NumberLong(1073741824),
"current_total_thread_cache_bytes" : 134981800,
"total_free_bytes" : 668583440,
"central_cache_free_bytes" : 533437608,
"transfer_cache_free_bytes" : 164032,
"thread_cache_free_bytes" : 134981800,
官方 JIRA Issue
查了一下有几个,但是我们重点关注这个,https://jira.mongodb.org/browse/SERVER-37541 , 这个 issue 实际上是对今天这里讨论的问题的一个汇总,主要包括两方面的原因,
- Fragmentation,即碎片导致,这个问题大神 Bruce Lucas 开了一个 jira,但是 mongodb 团队反馈说是不在高优先级 list 上,所以 backlog 了(PS:优化内存碎片率是世界性难题,tcmalloc/jemalloc 都不能做到完美,可能要优化确实很困难)。
- 另外一个就是内存分配器的缓存行为,tcmalloc 在向操作系统归还内存时,是比较 "reluctant" 的,而且有时候还会达到一个临界点突然归还内存,导致性能抖动,可以配置server parameter tcmallocAggressiveMemoryDecommit 来进行更激进的内存回收,但是 MongoDB 团队测试发现有性能问题,所以默认没有开启。