如何度量Kernel Resources for PostgreSQL-阿里云开发者社区

背景

对于操作系统来说，数据库算是比较大型的应用，往往需要耗费大量的系统资源，特别是在内部进程间通信这块的资源。

操作系统默认的配置可能无法满足数据库对资源使用的需求。

那么应该如何根据数据库的需要，设置操作系统相关资源参数呢？

PostgreSQL 对系统资源的需求计算

在讲资源分配前，大家可以参考阅读一下
https://www.postgresql.org/docs/9.5/static/kernel-resources.html#SYSVIPC

https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

kernel-doc-xxx/Documentation/sysctl/kernel.txt

https://en.wikipedia.org/wiki/UNIX_System_V

这一篇主要讲的是进程间通信
https://docs.oracle.com/cd/E19455-01/806-4750/6jdqdflta/index.html

共享内存和信号量

关于PostgreSQL的共享内存管理，早期的 PostgreSQL 版本共享内存分配只支持sysv的方式，数据库启动时，需要分配一块共享内存段，这个需求主要与配置多大的shared_buffers 有关。

社区在9.3的版本做了一个变动，使用mmap分配共享内存，大幅降低了系统对System V共享内存的需求量。
https://www.postgresql.org/docs/9.3/static/release-9-3.html

但是mmap并不好，会带来额外的IO开销，所以PostgreSQL 9.4开始，又做了一个变动，支持动态分配共享内存，主要是为多核并行计算做的铺垫，而且默认的共享内存分配的方法变成了posix(如果环境支持)，同样不需要启动时分配大的共享内存段。
https://www.postgresql.org/docs/9.4/static/release-9-4.html

从9.4开始，共享内存分配方法通过参数 dynamic_shared_memory_type 控制。

不同的值，创建共享内存段的方法也不一样，例如posix使用shm_open，sysv使用shmget，mmap使用mmap。

正是由于创建共享内存段的方法不一样，所以需要配置的操作系统内核参数也不一样。

dynamic_shared_memory_type (enum) 

Specifies the dynamic shared memory implementation that the server should use. 

Possible values are :  (支持如下)
posix (for POSIX shared memory allocated using shm_open), 
sysv (for System V shared memory allocated via shmget), 
windows (for Windows shared memory), 
mmap (to simulate shared memory using memory-mapped files stored in the data directory), 
none (to disable this feature). 

Not all values are supported on all platforms;  

the first supported option is the default for that platform.   
如果不强制指定，默认使用第一种支持的方式。  

The use of the mmap option, which is not the default on any platform, is generally discouraged because the operating system may write modified pages back to disk repeatedly, increasing system I/O load;
however, it may be useful for debugging, when the pg_dynshmem directory is stored on a RAM disk, or when other shared memory facilities are not available.  
建议不要使用mmap，除非你想调试PG的共享内存。

不同的共享内存分配方法，对操作系统的内核参数配置要求也不一样。

涉及的资源以及计算方法如下

| Name | Description | Reasonable values |
| ---- | ---- | ---- |
| SHMMAX | 单个共享内存段最大允许多大 (bytes) | 见另一张表，或者直接设置为内存的80% |
| SHMMIN | 单个共享内存段最小允许多小 (bytes) | 1 |
| SHMALL | 整个系统允许分配多少共享内存，(所有共享内存段相加) (bytes or pages) | 需考虑其他需要分配共享内存的应用，确保大于所有应用的需求量，通常可以设置为实际内存大小 |
| SHMSEG | 每个进程允许分配多少个共享内存段 | only 1 segment is needed, but the default is much higher, 所以不需要设置 |
| SHMMNI | 整个系统允许分配多少个共享内存段 | 需要分配共享内存的进程数 * SHMSEG |
| SEMMNI | 允许分配多少组信号量ID (i.e., sets) | at least ceil((max_connections + autovacuum_max_workers + 5) / 16) ，PostgreSQL每16个进程一组 |
| SEMMNS | 允许分配多少个信号量 | ceil((max_connections + autovacuum_max_workers + 5) / 16) 17 plus room for other applications，每组信号量需要17字节，加上其他软件的需求。实际设置时设置为SEMMNISEMMSL |
| SEMMSL | 每组允许开多少信号量 | at least 17 |
| SEMMAP | Number of entries in semaphore map | see text |
| SEMVMX | Maximum value of semaphore | at least 1000 (The default is often 32767; do not change unless necessary) |

共享内存段 shmmax 计算方法

共享内存 SysV 管理 (适用于 < 9.3 的版本)
https://www.postgresql.org/docs/9.2/static/kernel-resources.html

因此对于9.2以及更低版本的共享内存SysV管理的情况，shmmax的需求计算方法如下，将所有项相加。

Usage	Approximate shared memory bytes required (as of 8.3)
Connections	(1800 + 270 max_locks_per_transaction) max_connections
Autovacuum workers	(1800 + 270 max_locks_per_transaction) autovacuum_max_workers
Prepared transactions	(770 + 270 max_locks_per_transaction) max_prepared_transactions
Shared disk buffers	(block_size + 208) * shared_buffers
WAL buffers	(wal_block_size + 8) * wal_buffers
Fixed space requirements	770 kB

共享内存 SysV 管理 (适用于 >= 9.3 的版本)
对于9.3以及更高版本的PostgreSQL, 即使使用SysV，也不需要这么多共享内存。后面会有实测。

通常需要4KB左右。

共享内存 posix, mmap, none 管理
一个PostgreSQL集群只需要56字节(实测)的共享内存段大小

  PostgreSQL requires a few bytes of System V shared memory (typically 48 bytes, on 64-bit platforms) for each copy of the server.   
  On most modern operating systems, this amount can easily be allocated.   

  However, if you are running many copies of the server, or if other applications are also using System V shared memory :   
  it may be necessary to increase SHMMAX, the maximum size in bytes of a shared memory segment,
  SHMALL, the total amount of System V shared memory system-wide.   
  Note that SHMALL is measured in pages rather than bytes on many systems.

小结
9.3 以下版本，设置这3个内核参数
（9.3 以及以上版本，需要的shmmax没那么大，所以也可以使用以上设置。）

kernel.shmall = 实际内存大小 (如果单位为page, bytes/PAGE_SIZE)   
kernel.shmmax >= shared_buffer (bytes)   
kernel.shmmni >= 实际数据库集群数*2（>=9.4版本，使用SysV时每个PostgreSQL数据库集群需要2个共享内存段）

如果一台服务器中要启动多个PostgreSQL集群，则每个集群都需要

shmmin和shmseg不需要设置，从shmget的开发者手册也可以得到证实

shmmin = 1  # 1字节，但实际最小是1 PAGE(4KB)  
shmseg = unlimited

系统页大小(未使用huge page时)

# getconf PAGE_SIZE
4096

man shmget

       The following limits on shared memory segment resources affect the shmget() call:

       SHMALL System wide maximum of shared memory pages (on Linux, this limit can be read and modified via /proc/sys/kernel/shmall).

       SHMMAX Maximum size in bytes for a shared memory segment: policy dependent (on Linux, this limit can be read and modified via /proc/sys/kernel/shmmax).

       SHMMIN Minimum size in bytes for a shared memory segment: implementation dependent (currently 1 byte, though PAGE_SIZE is the effective minimum size).

       SHMMNI System wide maximum number of shared memory segments: implementation dependent (currently 4096, was 128 before Linux 2.3.99; on Linux, this limit can be read and modified via /proc/sys/kernel/shmmni). 

       The implementation has no specific limits for the per-process maximum number of shared memory segments (SHMSEG).

信号量计算方法

信号量的需求，和数据库版本无关，计算方法如下。

需要多少组
SEMMNI >= (max_connections + max_worker_processes + autovacuum_max_workers + 5) / 16
需要多少信号量
SEMMNS >= ((max_connections + max_worker_processes + autovacuum_max_workers + 5) / 16) * 17 + 其他程序的需求
每组需要多少信号量
SEMMSL >= 17

对应系统内核配置举例

# sysctl -w kernel.sem="1234      150994944       512000  7890"

含义分别为

max number of arrays = 7890  对应 semmni  
max semaphores per array = 1234  对应 semmsl  
max semaphores system wide = 150994944  对应 semmns  = semmni*semmsl
max ops per semop call = 512000

如何查看当前系统设置的SysV资源限制

# ipcs -l

------ Messages Limits --------
max queues system wide = 32768
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 19531250
max total shared memory (kbytes) = 4096000000
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 7890
max semaphores per array = 1234
max semaphores system wide = 150994944
max ops per semop call = 512000
semaphore max value = 32767

如何查看已使用的SysV资源

# ipcs -u

------ Messages Status --------
allocated queues = 0
used headers = 0
used space = 0 bytes

------ Shared Memory Status --------
segments allocated 2
pages allocated 2
pages resident  2
pages swapped   0
Swap performance: 0 attempts     0 successes

------ Semaphore Status --------
used arrays = 8
allocated semaphores = 136

实测

shmmax与信号量实测
共享内存管理方法 posix, mmap, none 实测 shmmax 需求如下

sysctl -w kernel.shmmax=1024 
# 
postgresql.conf
shared_buffer=16GB

启动数据库, 查看IPC

$ ipcs
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x001d4fe9 294912     digoal     600        56         5   
# 
如果sysctl -w kernel.shmmax=48，启动报错，需要56字节    
FATAL:  could not create shared memory segment: Invalid argument
DETAIL:  Failed system call was shmget(key=1921001, size=56, 03600).  
HINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter, or possibly that it is less than your kernel's SHMMIN parameter.
        The PostgreSQL documentation contains more information about shared memory configuration.

共享内存管理方法 sysv 实测 shmmax 需求如下

postgresql.conf
dynamic_shared_memory_type = sysv

如果设置低于数据库的需求，会报错

sysctl -w kernel.shmmax=1024

报错

 pg_ctl start
server starting
 FATAL:  could not get shared memory segment: Invalid argument

把shm加到到20GB，
9.5的版本，启动时实际需要的内存并不多，如果你在9.2或者更低版本测试，那会需要很多

sysctl -w kernel.shmmax=2000000000 # 单位byte
sysctl -w kernel.shmall=2000000000  # 单位page
#
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x001d4fe9 360448     digoal     600        56         5                       
0x6b8b4567 393217     digoal     600        2396       5

PostgreSQL 9.2 shared_buffer=16GB , 启动时需要申请大量的内存.

ipcs 
#
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x001d53d1 1114112    digoal     600        17605672960 5

ulimit 设置

ulimit 设置主要是限制进程级别对系统资源的使用。

       ulimit [-HSTabcdefilmnpqrstuvx [limit]]
              Provides  control over the resources available to the shell and to processes started by it, on systems that allow such control.  The -H and -S options specify that the hard or soft limit is set for the given resource.
              A hard limit cannot be increased by a non-root user once it is set; a soft limit may be increased up to the value of the hard limit.  If neither -H nor -S is specified, both the soft and  hard  limits  are  set.   The value  of limit can be a number in the unit specified for the resource or one of the special values hard, soft, or unlimited, which stand for the current hard limit, the current soft limit, and no limit, respectively.
              If limit is omitted, the current value of the soft limit of the resource is printed, unless the -H option is given.  When more than one resource is specified, the limit name and unit  are  printed  before  the  value.
              Other options are interpreted as follows:
              -a     All current limits are reported
              -b     The maximum socket buffer size
              -c     The maximum size of core files created
              -d     The maximum size of a process's data segment
              -e     The maximum scheduling priority ("nice")
              -f     The maximum size of files written by the shell and its children
              -i     The maximum number of pending signals
              -l     The maximum size that may be locked into memory
              -m     The maximum resident set size (many systems do not honor this limit)
              -n     The maximum number of open file descriptors (most systems do not allow this value to be set)
              -p     The pipe size in 512-byte blocks (this may not be set)
              -q     The maximum number of bytes in POSIX message queues
              -r     The maximum real-time scheduling priority
              -s     The maximum stack size
              -t     The maximum amount of cpu time in seconds
              -u     The maximum number of processes available to a single user
              -v     The maximum amount of virtual memory available to the shell and, on some systems, to its children
              -x     The maximum number of file locks
              -T     The maximum number of threads

              If  limit  is  given, it is the new value of the specified resource (the -a option is display only).  If no option is given, then -f is assumed.  Values are in 1024-byte increments, except for -t, which is in seconds,
              -p, which is in units of 512-byte blocks, and -T, -b, -n, and -u, which are unscaled values.  The return status is 0 unless an invalid option or argument is supplied, or an error occurs while setting a new limit.

配置文件举例
/etc/security/limits.conf

#<domain>        <type>  <item>  <value>
#
#Where:
#<domain> can be:
#        - a user name
#        - a group name, with @group syntax
#        - the wildcard *, for default entry
#        - the wildcard %, can be also used with %group syntax,
#                 for maxlogin limit
#
#<type> can have the two values:
#        - "soft" for enforcing the soft limits
#        - "hard" for enforcing hard limits
#
#<item> can be one of the following:
#        - core - limits the core file size (KB)
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open files
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - nproc - max number of processes
#        - as - address space limit (KB)
#        - maxlogins - max number of logins for this user
#        - maxsyslogins - max number of logins on the system
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#        - sigpending - max number of pending signals
#        - msgqueue - max memory used by POSIX message queues (bytes)
#        - nice - max nice priority allowed to raise to values: [-20, 19]
#        - rtprio - max realtime priority
#
#<domain>      <type>  <item>         <value>
#

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4

# End of file
* soft    nofile  655360
* hard    nofile  655360
* soft    nproc   655360
* hard    nproc   655360
* soft    memlock unlimited
* hard    memlock unlimited
* soft    core    unlimited
* hard    core    unlimited

查看进程设置
# cat /proc/$PID/limits

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             655360               655360               processes 
Max open files            655360               655360               files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       513997               513997               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

查看当前用户的limit配置

# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 513997
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 655360
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 655360
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

PostgreSQL 推荐设置

* soft    nofile  655360    # The maximum number of open file descriptors
* hard    nofile  655360  
* soft    nproc   655360    # The maximum number of processes available to a single user
* hard    nproc   655360
* soft    memlock unlimited  # The maximum size that may be locked into memory
* hard    memlock unlimited
* soft    core    unlimited  # The maximum size of core files created
* hard    core    unlimited

core dump 相关内核设置

kernel.core_pattern = /xxx/xxx/core_%e_%u_%t_%s.%p
kernel.core_uses_pid = 1

OOM score adj 设置

PostgreSQL 的守护进程是postgres，如果它挂了，数据库就挂了，其他进程挂了它会负责crash recovery，自动重启数据库（默认设置了 restart_after_crash = on ）

所以如果要防止系统OOM时杀掉postgres主进程，需要在启动数据库前，使用root用户设置self脚本进程的oom_score_adj，然后启动数据库。

echo -1000 > /proc/self/oom_score_adj

or 

echo -17 > /proc/self/oom_score_adj

例子

# echo -1000 > /proc/self/oom_score_adj

启动是需要在启动环境中设置这两个环境变量
# export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj  # 设置postgres主进程oom_score_adj
# export PG_OOM_ADJUST_VALUE=0  # 设置子进程oom_score_adj

# su - digoal -c "export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj;export PG_OOM_ADJUST_VALUE=0;. ~/env.sh; pg_ctl start"

# ps -efw|grep digoal
digoal    2492     1  9 23:22 ?        00:00:00 /home/digoal/pgsql9.5/bin/postgres
digoal    2493  2492  0 23:22 ?        00:00:00 postgres: logger process   
digoal    2495  2492  0 23:22 ?        00:00:00 postgres: checkpointer process   
digoal    2496  2492  0 23:22 ?        00:00:00 postgres: writer process   
digoal    2497  2492  0 23:22 ?        00:00:00 postgres: wal writer process   
digoal    2498  2492  0 23:22 ?        00:00:00 postgres: autovacuum launcher process  
digoal    2499  2492  0 23:22 ?        00:00:00 postgres: stats collector process 

# cat /proc/2492/oom_score_adj 
-1000
# cat /proc/2493/oom_score_adj 
0

参考

https://www.postgresql.org/docs/9.5/static/kernel-resources.html#SYSVIPC
https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
man shm_open, shmget, mmap, semctl, sem_overview

小结

本文主要帮助大家理解PostgreSQL数据库对操作系统资源的需求，以及计算方法。

如果用户需要在一个系统中运行多个数据库集群，则需要将所有集群的需求加起来。

PostgreSQL 9.2以及以前的版本，在数据库启动时对SysV共享内存段的需求很大，所以要设得比较大，需要用户注意。

祝大家玩得开心，欢迎随时来 阿里云促膝长谈业务需求 ，恭候光临。

阿里云的小伙伴们加油，努力 做好内核与服务，打造最贴地气的云数据库 。

如何度量Kernel Resources for PostgreSQL

背景

PostgreSQL 对系统资源的需求计算

共享内存和信号量

共享内存段 shmmax 计算方法

信号量计算方法

实测

ulimit 设置

OOM score adj 设置

参考

小结

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像

如何度量Kernel Resources for PostgreSQL

背景

PostgreSQL 对系统资源的需求计算

共享内存和信号量

共享内存段 shmmax 计算方法

信号量 计算方法

实测

ulimit 设置

OOM score adj 设置

参考

小结

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像

信号量计算方法