最近一台装有Rhel5.3的40G内存的机器上有一个oracle数据库,数据库的SGA设置为20G,当运行业务时,一个业务高峰期时,发现swap频繁交换,CPU 100%,Load很高,基本体现为内存不足。此时的连接数在600个左右。按内存的计算:每个连接占用内存基本在5M,这样600个连接只占用3G内存,SGA内存20G,操作系统占用内存1G,这样总占用的内存为24G,而总共内存有40G,怎么会内存不足呢?当时是百思不得其解,于是做了大量的压力测试,首先是写了一个java程序,启动多个线程,每个线程与数据库建一个连接,然后循环运行一个简单的SQL,这个SQL按一个随机函数生成的ID去查询一个很大的表(有索引)。当启动1000个连接后,使用free -m查看内存:
#free -m
total used free shared buffers cached
Mem: 40210 25842 14368 0 9 177
-/+ buffers/cache: 25655 14554
Swap: 20481 479 20001
total used free shared buffers cached
Mem: 40210 25842 14368 0 9 177
-/+ buffers/cache: 25655 14554
Swap: 20481 479 20001
发现free的内存值很小,used的内存值为断增长,运行大约20分钟后,当free减少到40M左右的时候, 系统的CPU一下子到100%,Load从15升到600。
从这个结果看到,还是内存不足,当时还写了一个脚本,查看所有oracle进程的内存情况,也没有发现oracle进程占用内存太多。所以一直没有找到原因。
最后试着用cat /proc/meminfo查看内存时,终于找到了原因,没有加压力时,cat /proc/meminfo看到的结果为:
root@xxxx:/proc/sys/vm>cat /proc/meminfo
MemTotal: 41175744 kB
MemFree: 27603324 kB
Buffers: 36572 kB
Cached: 13006240 kB
SwapCached: 232980 kB
Active: 304448 kB
Inactive: 12990616 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 41175744 kB
LowFree: 27603324 kB
SwapTotal: 20972816 kB
SwapFree: 20070348 kB
Dirty: 1232 kB
Writeback: 0 kB
AnonPages: 240500 kB
Mapped: 354120 kB
Slab: 136980 kB
PageTables: 34004 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 41560688 kB
Committed_AS: 17163928 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 273756 kB
VmallocChunk: 34359464051 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
MemTotal: 41175744 kB
MemFree: 27603324 kB
Buffers: 36572 kB
Cached: 13006240 kB
SwapCached: 232980 kB
Active: 304448 kB
Inactive: 12990616 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 41175744 kB
LowFree: 27603324 kB
SwapTotal: 20972816 kB
SwapFree: 20070348 kB
Dirty: 1232 kB
Writeback: 0 kB
AnonPages: 240500 kB
Mapped: 354120 kB
Slab: 136980 kB
PageTables: 34004 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 41560688 kB
Committed_AS: 17163928 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 273756 kB
VmallocChunk: 34359464051 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
当压力上来时:
root@bopspri:/proc/sys/vm>cat /proc/meminfo
MemTotal: 41175744 kB
MemFree: 375212 kB
Buffers: 36444 kB
Cached: 13005200 kB
SwapCached: 232984 kB
Active: 16919192 kB
Inactive: 509908 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 41175744 kB
LowFree: 375212 kB
SwapTotal: 20972816 kB
SwapFree: 20070340 kB
Dirty: 184 kB
Writeback: 0 kB
AnonPages: 4375088 kB
Mapped: 12889760 kB
Slab: 168916 kB
PageTables: 23005464 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 41560688 kB
Committed_AS: 40413008 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 273756 kB
VmallocChunk: 34359464051 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
MemTotal: 41175744 kB
MemFree: 375212 kB
Buffers: 36444 kB
Cached: 13005200 kB
SwapCached: 232984 kB
Active: 16919192 kB
Inactive: 509908 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 41175744 kB
LowFree: 375212 kB
SwapTotal: 20972816 kB
SwapFree: 20070340 kB
Dirty: 184 kB
Writeback: 0 kB
AnonPages: 4375088 kB
Mapped: 12889760 kB
Slab: 168916 kB
PageTables: 23005464 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 41560688 kB
Committed_AS: 40413008 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 273756 kB
VmallocChunk: 34359464051 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
可以看到压力上来后,
PageTables
占用的内存居然高达23G。PageTables是Linux下虚拟内存到物理内存中做映射时映射表占用的空间,这个映射表居然占用了这么大的内存,真让人不可思议。
为了解决这个问题,想到了Linux的大页管理,正常的页大小为4k,而大页管理的页大小为2M,通过大页管理后,映射表占用的空间将会大大减少。
于是把数据库停了,启动大页管理,给大页管理分配20G内存:
echo 10240 > /proc/sys/vm/nr_hugepages
增加
root soft memlock -1
root hard memlock -1
root hard memlock -1
oracle soft memlock -1
oracle hard memlock -1
oracle hard memlock -1
把数据库的lock_sga改成true后,再做压力测试,系统终于能稳定运行了,free -m查看到的空闲内存一直空闲13G:
root@xxxx:/etc/security>free -m
total used free shared buffers cached
Mem: 40210 26234 13976 0 20 184
-/+ buffers/cache: 26029 14181
Swap: 20481 479 20001
total used free shared buffers cached
Mem: 40210 26234 13976 0 20 184
-/+ buffers/cache: 26029 14181
Swap: 20481 479 20001