Linux Kernel Lowmem Pressure Issues and Related Kernel Structures (Doc ID 452326.1)

简介: Linux Kernel Lowmem Pressure Issues and Related Kernel Structures (Doc ID 452326.1)

What is LowMem?
An ideal computer architecture should have page frames to be able to map or address any location in the address space of the architecture directly, providing a uniform and a virtually unlimited range.

Unfortunately the real architecture implementations have some constraints that limits the way the page frames can be employed. Specifically, on 80x86 architectures (where the Linux kernel was designed initially for), the Linux kernel has to deal with two major constraints:

The Direct Memory Access (DMA): The very old ISA bus systems can only address the first 16MB of RAM
The 32-bit systems with large RAM: The CPU cannot directly address all physical memory because the address space is limited
Linux kernel defines three "zones". Those zones for 80x86 (Intel 32-bit) architecture are:

DMA: 0x00000000 - 0x00999999 (0 - 16 MB)
LowMem: 0x01000000 - 0x037999999 (16 - 896 MB) - size: 880MB
HighMem: 0x038000000 -
The LowMem zone is named as NORMAL ZONE internally too. So in this document we will use both terms meaning exactly the same thing.

On 32-bit Linux the size of the LowMem zone is 880MB and it cannot be resized. You can use a hugemem kernel, or a 64-bit architecture to have a larger LowMem zone (See Note 264236.1)

Kernel Structure Types
Below are the main categories of structures in the Linux kernel.

IMPORTANT: A LowMem pressure / shortage is not likely to be caused by a specific kernel structure. The issues are almost always aggregate of more than one kernel structure involved.

Lists: Most of the data structures in the linux kernel are linked-lists and specifically almost all of them are circular double-linked lists which are flexible and easily traversible. Examples:
Process lists
Resource lists
Run queues
Tables: These are contiguous memory areas in LowMem and with their hierarchical structure, they can extend or shrink. Examples:
Page Tables (implemented via PTE, PMD, PGD etc.)
Exception Tables
Buffers / Caches: These are fixed size data structures that are allocated for a specific task to buffer any data transfer or cache data for faster access/manipulation.
Block buffers
Remote procedure call buffers
Index node cache
Directory entry cache
With respect to the structure categories above, the Lists and Tables are not known to cause any LowMem problem. Therefore our main focus in this document is about the Buffers / Caches category.

The Buddy System Algorithm
The internal kernel services and add-on modules need to allocate groups of contiguous page frames as temporary memory areas to be used to implement the kernel functions. The basic algorithm to implement those allocations are done by the Buddy System Algorithm. The buddy system algorithm handles page frames as "generic" areas and to avoid external fragmentation it executes based on allocating chunks of 2^n bytes (geometrically distributed). The free chunks can be seen from /proc/buddyinfo.

The SLAB Allocator
This is a specific organization and algorithm to allocate different type of kernel objects that are used for different purposes.

With the implementation of the SLAB allocator in the Linux kernel, there are predefined types illustrating a specific cache. For example:

rpc_buffers
journal_head
ext3_inode_cache
arp_cache
kiocb
bio
inode_cache
dentry_cache
Note that those also cover the Buffers / Caches allocated in the kernel. In principle any cache/buffer allocated in the kernel is allocated via the SLAB allocator.

The types of objects for the slab allocator can vary between different kernels and versions. Moreover loaded kernel modules (like OCFS2 modules) can introduce additional caches. e.g.:

ocfs2_inode_cache
ocfs2_lock
...
For further information about SLABs and SLAB allocator please see Note 434351.1.

Kernel Structures per Purpose
The following is an exemplary list of structures that are allocated within LowMem. The most common problem leading ones are marked bold. See Note 434351.1 for further details.

Scheduler
Run Queue
Process descriptors
Interrupts / Exceptions / Signals
Signal handing queue
Timers
POSIX timer cache
Networking
Generic flow cache
RPC buffers, tasks and i-node caches
TCP/IP hash tables, caches, buckets
BSD Unix domain sockets, tables and caches
Storage
Device Mapper transactions, I/O vectors
SCSI command caches, queuing pools
Block Device caches, controls, vectors, queues
Asynchronous I/O controls and context
I/O Scheduling pools and queues
Filesystems:
Buffers
Auditing
I-node caches
File Lock Caches
Directory entry caches
Filesystem specific (ext2,3 etc.) caches and attributes
Journalling structures
Processes
Per process structures
Memory management structure data allocated by each new process
Virtual memory area data allocated by each new process
File and filesystem related data allocated by each new process
Signal cache and handlers allocated by each new process
Task information for every new process forked
Per user structure to keep track of user processes
Memory Management
Internal cache of cache description objects
Anonymous virtual memory addresses for physical to virtual reverse mappings
hugetlbpage-backed filesystem cache
Resizable virtual memory filesystems
Page middle directory
Page global directory
Other
Generic caches of size N
Generic caches of size N for direct memory access use
Common Symptoms of LowMem Pressure
There are two groups of issues happens in the LowMem area:

Memory shortage: The total amount of free bytes gets too low that almost no big allocation is possible in LowMem
Memory fragmentation: There are no contiguous free chunks of larger size and all free memory is in small chunks like 4k or 8k.So some of the symptoms below are based on either or both of the situations.
Memory shortage and fragmentation can be reliably diagnosed by checking /proc/meminfo, /proc/buddyinfo and the SysRq-M output described as below.

Processes are Getting Killed by the OOM Killer
The Out of memory (OOM) killer terminates processes that seem to be idle and having a large memory workset to free out memory when there is a memory shortage. This generally implies LowMem shortage but shall be verified by /proc/meminfo data.

In the messages logfile, you would see something like:

kernel: Out of Memory: Killed process NNNN

OOM Killer Strikes when There is No Swap Usage
The linux system generally does not start to use the swap area (due to lazy swapping) unless the total free memory is really low. So if OOM killer kills some processes but the system was not swapping it means that there are available pages in HighMem zone but we are short of LowMem area.

In the messages log file, you would see something like:

kernel: Out of Memory: Killed process NNNN
and

free

         total       used       free     shared    buffers     cached

...
...
Swap: 2048248 0 2048248

Note that the OOM killer with take action if there is a request for LowMem area and we are short of LowMem.

OOM Killer Strikes when There is Free Memory
If the OOM killer is taking action even if there is free memory:

free

         total       used       free     shared    buffers     cached

Mem: 2067508 1507748 559760 0 38044 1072244
-/+ buffers/cache: 397460 1670048
Swap: 2048248 4 2048244
It would generally mean that we might be out of LowMem free pages.

Kernel is Unable to Spawn Processes
The symptoms would differ according to running application but specifically the fork() system call is failing either with EAGAIN or ENOMEM. From manpage of fork():

EAGAIN fork() cannot allocate sufficient memory to copy the parent process page tables and allocate a task structure for the child.
ENOMEM fork() failed to allocate the necessary kernel structures because memory is tight.

System Calls Failing with ENOBUFS / ENOMEM
While diagnosing failure or performance issues with applications and while diagnosing the processes using strace (see man strace) you can see system calls failing with ENOBUFS, ENOMEM or EAGAIN and most likely repeated again and again if you have performance issues. This may be due to lowmem fragmentation or shortage.

With newer kernel versions there are enhancements to avoid fragmentation of LowMem are available.

LowMem Shortage Shown in /proc/meminfo
If there is a shortage in the Lowmem area the /proc/meminfo would be representing it with the 'LowFree' line. Note that even with an healthy system you might see low LowFree values which does not mean a LowMem shortage. For example a system with 2GB of memory and hugemem kernel:

MemTotal: 2061160 kB
MemFree: 10228 kB
Buffers: 119840 kB
Cached: 1307784 kB
Active: 587724 kB
Inactive: 1236924 kB
...
LowTotal: 2061160 kB
LowFree: 10228 kB
Here the system seems to be short of memory, but we see that buffers are high (and they can be released if needed), along with 1.24 GB of cached pages. For the cached pages 1.17 GB of them are inactive, so they can be released if needed. That is based on the workload.

Conversely, even if you have an output as below:

MemTotal: 4026240 kB
MemFree: 758396 kB
...
HighTotal: 3145720 kB
HighFree: 608768 kB
LowTotal: 880520 kB
LowFree: 149628 kB
...
you might still have LowMem problems. Here there are 142MB of LowMem is available, but it might be fragmented. So the /proc/meminfo output does not mean anything alone in many situations. To check fragmentation, see the next section below.

Clear Fragmentation Detected in /proc/buddyinfo & SysRq-M Output
This is the most reliable way to diagnose for lowmem fragmentation. Please see Note 228203.1 about the use of magic keys. When you request the Memory dump you might get something like:

SysRq : Show Memory
...
DMA: 2214kB 208kB 216kB 1932kB 1264kB 2128kB 1256kB 0512kB 11024kB 02048kB 0*4096kB = 3988kB
Normal: 255364kB 35188kB 95916kB 032kB 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 0*4096kB = 145632kB
HighMem: 614544kB 182198kB 413516kB 155832kB 147264kB 48128kB 1256kB 1512kB 01024kB 02048kB 0*4096kB = 608704kB
In the layout above we see that there is 142MB free space in Lowmem area (normal zone) but it is highly fragmented for 4, 8 and 16 kB chunks. There are no 32 and 64 (or higher) contiguous chunks which a lot applications would need. In that case the applications would have problems based on ENOMEM / ENOBUFS.

Similar information can be gathered live from /proc/buddyinfo too.

cat /proc/buddyinfo

Node 0, zone DMA 221 20 2 19 12 2 1 0 1 0 0
Node 0, zone Normal 25536 3518 959 0 0 0 0 0 0 0 0
Node 0, zone HighMem 61454 18219 4135 1558 1472 48 1 1 0 0 0

Each column of numbers represents the number of pages (221, 20, ...) of that order (0, 1, 2, .. meaning 2^0PAGE_SIZE, 2^1PAGE_SIZE, ...)

Troubleshooting LowMem Pressure Problems
The following are some initial actions that you may take to troubleshoot LowMem pressure issues.

Determine whether you have LowMem shortage only or fragmentation too.
Examine your SLAB cache entries according to Note 434351.1 for growing entries (LowMem shortage) or very active ones (may cause fragmentation) using the tools described. If the applications that are being run behaving as expected then consider doing changes on the system itself as below. If not the application would be needed to be fixed.
If running Enterprise Linux 3, consider:
Switching to hugemem kernel
Upgrading to the U8 errata patch (2.4.21-47) and setting vm.vm-defragment appropriately
Upgrading to Enterprise Linux 4 or higher and setting vm.lower_zone_protection to 100 (This applies to x86 32-bit architectures only - i.e. not valid for x86-64)
On Linux x86-64 environments (and recent i686 2.6.X kernels)increasing the value for vm.min_free_kbytes will cause the system to start reclaiming memory at an earlier time than it would have before, therefore it can help to decrease the LowMem pressure.
For the case of fragmentation, if you find out that running application is demanding too large contiguous space, the application might need to be fixed (See Note 419871.1)
If the application running can employ HugePages (see Note 361323.1) and does a lot lowmem operation where it can do it with HugePages, enabling HugePages will help with the LowMem pressure. Note that the Oracle RDBMS is not an example of such application as the portions of SGA that can use HugePages are already allocating from HighMem.
REFERENCES
NOTE:264236.1 - Considerations on using "hugemem" Kernel vs SMP kernel
NOTE:275318.1 - The Bigpages Feature on Linux
NOTE:360402.1 - Aggressive swapping and low resources during RMAN backup on Linux X86
NOTE:419871.1 - Failures due to "skgxpvfymmtu: process failed because of a resource problem in the OS" on 32-bit Linux
NOTE:452000.1 - Linux: Out-of-Memory (OOM) Killer

NOTE:396038.1 - Common Misconceptions About Linux Kernel Structures
NOTE:405720.1 - o2net Using High CPU and Cluster Node Evictions
NOTE:434351.1 - Linux Kernel: The SLAB Allocator

相关文章
|
6月前
|
Linux API C语言
【Linux系统编程】深入理解Linux 组ID和附属组ID的查询与设置
【Linux系统编程】深入理解Linux 组ID和附属组ID的查询与设置
153 0
【Linux系统编程】深入理解Linux 组ID和附属组ID的查询与设置
|
6月前
|
Linux 调度 Android开发
【系统启动】Kernel怎么跳转到Android:linux与安卓的交界
【系统启动】Kernel怎么跳转到Android:linux与安卓的交界
105 0
|
6月前
|
Linux C语言
Linux内核学习(七):linux kernel内核启动(一):概述篇
Linux内核学习(七):linux kernel内核启动(一):概述篇
116 0
|
6月前
|
Linux 芯片
Linux内核学习(六):linux kernel的Kconfig分析
Linux内核学习(六):linux kernel的Kconfig分析
556 0
|
2月前
|
存储 缓存 编译器
Linux kernel memory barriers 【ChatGPT】
Linux kernel memory barriers 【ChatGPT】
54 11
|
3月前
|
Linux 网络安全 开发工具
内核实验(二):自定义一个迷你Linux ARM系统,基于Kernel v5.15.102, Busybox,Qemu
本文介绍了如何基于Linux Kernel 5.15.102版本和BusyBox创建一个自定义的迷你Linux ARM系统,并使用QEMU进行启动和调试,包括内核和BusyBox的编译配置、根文件系统的制作以及运行QEMU时的命令和参数设置。
278 0
内核实验(二):自定义一个迷你Linux ARM系统,基于Kernel v5.15.102, Busybox,Qemu
|
2月前
|
Linux API
Linux内核中的两种ID分配方式
Linux内核中的两种ID分配方式
|
5月前
|
安全 Linux 数据安全/隐私保护
探索Linux命令newuidmap:用户ID映射的利器
`newuidmap`是Linux工具,用于在用户命名空间中设定UID映射,支持容器安全。它允许限定容器内进程的主机系统权限,确保数据安全和隔离。通过映射文件或命令行参数定义UID映射,提供灵活性和安全性。例如,为Docker容器设置映射,使进程能访问特定UID的数据文件。使用时需注意映射准确性、权限控制和避免映射过多UID。与其他工具如`newgidmap`配合使用以增强用户命名空间支持。
|
5月前
|
Linux 数据处理 数据库
深入解析Linux命令id:理解用户身份与权限
`id`命令在Linux中用于显示用户身份(UID, GID和附加组)。它查看系统用户数据库获取信息。参数如`-u`显示UID,`-g`显示GID,`-G`显示附加组,结合`-n`显示名称而非ID。用于确认命令执行者身份,确保权限正确。在脚本中使用时注意权限管理,遵循最小权限原则。
|
6月前
|
Linux Shell C语言
如何在 Linux 中查找父进程 ID (PPID)?
【5月更文挑战第4天】
834 4
如何在 Linux 中查找父进程 ID (PPID)?