如何度量Kernel Resources for PostgreSQL

本文涉及的产品
PolarDB Agent Express,2核4GB
云数据库 PolarDB MySQL 版,列存表分析加速 8核16GB
PolarDB Agent Flow,2核4GB
简介: 背景 对于操作系统来说,数据库算是比较大型的应用,往往需要耗费大量的系统资源,特别是在内部进程间通信这块的资源。 操作系统默认的配置可能无法满足数据库对资源使用的需求。 那么应该如何根据数据库的需要,设置操作系统相关资源参数呢? PostgreSQL 对系统资源的需求计算 在讲

背景

对于操作系统来说,数据库算是比较大型的应用,往往需要耗费大量的系统资源,特别是在内部进程间通信这块的资源。

操作系统默认的配置可能无法满足数据库对资源使用的需求。

那么应该如何根据数据库的需要,设置操作系统相关资源参数呢?

PostgreSQL 对系统资源的需求计算

在讲资源分配前,大家可以参考阅读一下
https://www.postgresql.org/docs/9.5/static/kernel-resources.html#SYSVIPC

https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

kernel-doc-xxx/Documentation/sysctl/kernel.txt

https://en.wikipedia.org/wiki/UNIX_System_V

这一篇主要讲的是进程间通信
https://docs.oracle.com/cd/E19455-01/806-4750/6jdqdflta/index.html

共享内存和信号量

关于PostgreSQL的共享内存管理,早期的 PostgreSQL 版本共享内存分配只支持sysv的方式,数据库启动时,需要分配一块共享内存段,这个需求主要与配置多大的shared_buffers 有关。

社区在9.3的版本做了一个变动,使用mmap分配共享内存,大幅降低了系统对System V共享内存的需求量。
https://www.postgresql.org/docs/9.3/static/release-9-3.html

但是mmap并不好,会带来额外的IO开销,所以PostgreSQL 9.4开始,又做了一个变动,支持动态分配共享内存,主要是为多核并行计算做的铺垫,而且默认的共享内存分配的方法变成了posix(如果环境支持),同样不需要启动时分配大的共享内存段 。
https://www.postgresql.org/docs/9.4/static/release-9-4.html

从9.4开始,共享内存分配方法通过参数 dynamic_shared_memory_type 控制。

不同的值,创建共享内存段的方法也不一样,例如posix使用shm_open,sysv使用shmget,mmap使用mmap。

正是由于创建共享内存段的方法不一样,所以需要配置的操作系统内核参数也不一样。

dynamic_shared_memory_type (enum) 

Specifies the dynamic shared memory implementation that the server should use. 

Possible values are :  (支持如下)
posix (for POSIX shared memory allocated using shm_open), 
sysv (for System V shared memory allocated via shmget), 
windows (for Windows shared memory), 
mmap (to simulate shared memory using memory-mapped files stored in the data directory), 
none (to disable this feature). 

Not all values are supported on all platforms;  

the first supported option is the default for that platform.   
如果不强制指定,默认使用第一种支持的方式。  

The use of the mmap option, which is not the default on any platform, is generally discouraged because the operating system may write modified pages back to disk repeatedly, increasing system I/O load;
however, it may be useful for debugging, when the pg_dynshmem directory is stored on a RAM disk, or when other shared memory facilities are not available.  
建议不要使用mmap,除非你想调试PG的共享内存。  

不同的共享内存分配方法,对操作系统的内核参数配置要求也不一样。

涉及的资源以及计算方法如下

| Name | Description | Reasonable values |
| ---- | ---- | ---- |
| SHMMAX | 单个共享内存段最大允许多大 (bytes) | 见另一张表,或者直接设置为内存的80% |
| SHMMIN | 单个共享内存段最小允许多小 (bytes) | 1 |
| SHMALL | 整个系统允许分配多少共享内存,(所有共享内存段相加) (bytes or pages) | 需考虑其他需要分配共享内存的应用,确保大于所有应用的需求量,通常可以设置为实际内存大小 |
| SHMSEG | 每个进程允许分配多少个共享内存段 | only 1 segment is needed, but the default is much higher, 所以不需要设置 |
| SHMMNI | 整个系统允许分配多少个共享内存段 | 需要分配共享内存的进程数 * SHMSEG |
| SEMMNI | 允许分配多少组信号量ID (i.e., sets) | at least ceil((max_connections + autovacuum_max_workers + 5) / 16) ,PostgreSQL每16个进程一组 |
| SEMMNS | 允许分配多少个信号量 | ceil((max_connections + autovacuum_max_workers + 5) / 16) 17 plus room for other applications,每组信号量需要17字节,加上其他软件的需求。实际设置时设置为SEMMNISEMMSL |
| SEMMSL | 每组允许开多少信号量 | at least 17 |
| SEMMAP | Number of entries in semaphore map | see text |
| SEMVMX | Maximum value of semaphore | at least 1000 (The default is often 32767; do not change unless necessary) |

共享内存段 shmmax 计算方法

因此对于9.2以及更低版本的共享内存SysV管理的情况,shmmax的需求计算方法如下,将所有项相加。

Usage Approximate shared memory bytes required (as of 8.3)
Connections (1800 + 270 max_locks_per_transaction) max_connections
Autovacuum workers (1800 + 270 max_locks_per_transaction) autovacuum_max_workers
Prepared transactions (770 + 270 max_locks_per_transaction) max_prepared_transactions
Shared disk buffers (block_size + 208) * shared_buffers
WAL buffers (wal_block_size + 8) * wal_buffers
Fixed space requirements 770 kB
  • 共享内存 SysV 管理 (适用于 >= 9.3 的版本)
    对于9.3以及更高版本的PostgreSQL, 即使使用SysV,也不需要这么多共享内存。后面会有实测。

通常需要4KB左右。

  • 共享内存 posix, mmap, none 管理
    一个PostgreSQL集群只需要56字节(实测)的共享内存段大小
  PostgreSQL requires a few bytes of System V shared memory (typically 48 bytes, on 64-bit platforms) for each copy of the server.   
  On most modern operating systems, this amount can easily be allocated.   

  However, if you are running many copies of the server, or if other applications are also using System V shared memory :   
  it may be necessary to increase SHMMAX, the maximum size in bytes of a shared memory segment,
  SHMALL, the total amount of System V shared memory system-wide.   
  Note that SHMALL is measured in pages rather than bytes on many systems.  

小结
9.3 以下版本,设置这3个内核参数
(9.3 以及以上版本,需要的shmmax没那么大,所以也可以使用以上设置。 )

kernel.shmall = 实际内存大小 (如果单位为page, bytes/PAGE_SIZE)   
kernel.shmmax >= shared_buffer (bytes)   
kernel.shmmni >= 实际数据库集群数*2(>=9.4版本,使用SysV时每个PostgreSQL数据库集群需要2个共享内存段)  

如果一台服务器中要启动多个PostgreSQL集群,则每个集群都需要

shmmin和shmseg不需要设置,从shmget的开发者手册也可以得到证实

shmmin = 1  # 1字节,但实际最小是1 PAGE(4KB)  
shmseg = unlimited    

系统页大小(未使用huge page时)

# getconf PAGE_SIZE
4096

man shmget

       The following limits on shared memory segment resources affect the shmget() call:

       SHMALL System wide maximum of shared memory pages (on Linux, this limit can be read and modified via /proc/sys/kernel/shmall).

       SHMMAX Maximum size in bytes for a shared memory segment: policy dependent (on Linux, this limit can be read and modified via /proc/sys/kernel/shmmax).

       SHMMIN Minimum size in bytes for a shared memory segment: implementation dependent (currently 1 byte, though PAGE_SIZE is the effective minimum size).

       SHMMNI System wide maximum number of shared memory segments: implementation dependent (currently 4096, was 128 before Linux 2.3.99; on Linux, this limit can be read and modified via /proc/sys/kernel/shmmni). 

       The implementation has no specific limits for the per-process maximum number of shared memory segments (SHMSEG).

信号量 计算方法

信号量的需求,和数据库版本无关,计算方法如下。

  • 需要多少组
    SEMMNI >= (max_connections + max_worker_processes + autovacuum_max_workers + 5) / 16
  • 需要多少信号量
    SEMMNS >= ((max_connections + max_worker_processes + autovacuum_max_workers + 5) / 16) * 17 + 其他程序的需求
  • 每组需要多少信号量
    SEMMSL >= 17

对应系统内核配置举例

# sysctl -w kernel.sem="1234      150994944       512000  7890"

含义分别为

max number of arrays = 7890  对应 semmni  
max semaphores per array = 1234  对应 semmsl  
max semaphores system wide = 150994944  对应 semmns  = semmni*semmsl
max ops per semop call = 512000  

如何查看当前系统设置的SysV资源限制

# ipcs -l

------ Messages Limits --------
max queues system wide = 32768
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 19531250
max total shared memory (kbytes) = 4096000000
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 7890
max semaphores per array = 1234
max semaphores system wide = 150994944
max ops per semop call = 512000
semaphore max value = 32767

如何查看已使用的SysV资源

# ipcs -u

------ Messages Status --------
allocated queues = 0
used headers = 0
used space = 0 bytes

------ Shared Memory Status --------
segments allocated 2
pages allocated 2
pages resident  2
pages swapped   0
Swap performance: 0 attempts     0 successes

------ Semaphore Status --------
used arrays = 8
allocated semaphores = 136

实测

  • shmmax与信号量实测
  • 共享内存管理方法 posix, mmap, none 实测 shmmax 需求 如下
sysctl -w kernel.shmmax=1024 
# 
postgresql.conf
shared_buffer=16GB 

启动数据库, 查看IPC

$ ipcs
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x001d4fe9 294912     digoal     600        56         5   
# 
如果sysctl -w kernel.shmmax=48,启动报错,需要56字节    
FATAL:  could not create shared memory segment: Invalid argument
DETAIL:  Failed system call was shmget(key=1921001, size=56, 03600).  
HINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter, or possibly that it is less than your kernel's SHMMIN parameter.
        The PostgreSQL documentation contains more information about shared memory configuration.
  • 共享内存管理方法 sysv 实测 shmmax 需求 如下
postgresql.conf
dynamic_shared_memory_type = sysv

如果设置低于数据库的需求,会报错

sysctl -w kernel.shmmax=1024

报错

 pg_ctl start
server starting
 FATAL:  could not get shared memory segment: Invalid argument

把shm加到到20GB,
9.5的版本,启动时实际需要的内存并不多,如果你在9.2或者更低版本测试,那会需要很多

sysctl -w kernel.shmmax=2000000000 # 单位byte
sysctl -w kernel.shmall=2000000000  # 单位page
#
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x001d4fe9 360448     digoal     600        56         5                       
0x6b8b4567 393217     digoal     600        2396       5                       

PostgreSQL 9.2 shared_buffer=16GB , 启动时需要申请大量的内存.

ipcs 
#
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x001d53d1 1114112    digoal     600        17605672960 5   

ulimit 设置

ulimit 设置主要是限制进程级别对系统资源的使用。

       ulimit [-HSTabcdefilmnpqrstuvx [limit]]
              Provides  control over the resources available to the shell and to processes started by it, on systems that allow such control.  The -H and -S options specify that the hard or soft limit is set for the given resource.
              A hard limit cannot be increased by a non-root user once it is set; a soft limit may be increased up to the value of the hard limit.  If neither -H nor -S is specified, both the soft and  hard  limits  are  set.   The value  of limit can be a number in the unit specified for the resource or one of the special values hard, soft, or unlimited, which stand for the current hard limit, the current soft limit, and no limit, respectively.
              If limit is omitted, the current value of the soft limit of the resource is printed, unless the -H option is given.  When more than one resource is specified, the limit name and unit  are  printed  before  the  value.
              Other options are interpreted as follows:
              -a     All current limits are reported
              -b     The maximum socket buffer size
              -c     The maximum size of core files created
              -d     The maximum size of a process's data segment
              -e     The maximum scheduling priority ("nice")
              -f     The maximum size of files written by the shell and its children
              -i     The maximum number of pending signals
              -l     The maximum size that may be locked into memory
              -m     The maximum resident set size (many systems do not honor this limit)
              -n     The maximum number of open file descriptors (most systems do not allow this value to be set)
              -p     The pipe size in 512-byte blocks (this may not be set)
              -q     The maximum number of bytes in POSIX message queues
              -r     The maximum real-time scheduling priority
              -s     The maximum stack size
              -t     The maximum amount of cpu time in seconds
              -u     The maximum number of processes available to a single user
              -v     The maximum amount of virtual memory available to the shell and, on some systems, to its children
              -x     The maximum number of file locks
              -T     The maximum number of threads

              If  limit  is  given, it is the new value of the specified resource (the -a option is display only).  If no option is given, then -f is assumed.  Values are in 1024-byte increments, except for -t, which is in seconds,
              -p, which is in units of 512-byte blocks, and -T, -b, -n, and -u, which are unscaled values.  The return status is 0 unless an invalid option or argument is supplied, or an error occurs while setting a new limit.

配置文件举例
/etc/security/limits.conf

#<domain>        <type>  <item>  <value>
#
#Where:
#<domain> can be:
#        - a user name
#        - a group name, with @group syntax
#        - the wildcard *, for default entry
#        - the wildcard %, can be also used with %group syntax,
#                 for maxlogin limit
#
#<type> can have the two values:
#        - "soft" for enforcing the soft limits
#        - "hard" for enforcing hard limits
#
#<item> can be one of the following:
#        - core - limits the core file size (KB)
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open files
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - nproc - max number of processes
#        - as - address space limit (KB)
#        - maxlogins - max number of logins for this user
#        - maxsyslogins - max number of logins on the system
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#        - sigpending - max number of pending signals
#        - msgqueue - max memory used by POSIX message queues (bytes)
#        - nice - max nice priority allowed to raise to values: [-20, 19]
#        - rtprio - max realtime priority
#
#<domain>      <type>  <item>         <value>
#

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4

# End of file
* soft    nofile  655360
* hard    nofile  655360
* soft    nproc   655360
* hard    nproc   655360
* soft    memlock unlimited
* hard    memlock unlimited
* soft    core    unlimited
* hard    core    unlimited

查看进程设置
# cat /proc/$PID/limits

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             655360               655360               processes 
Max open files            655360               655360               files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       513997               513997               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

查看当前用户的limit配置

# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 513997
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 655360
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 655360
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

PostgreSQL 推荐设置

* soft    nofile  655360    # The maximum number of open file descriptors
* hard    nofile  655360  
* soft    nproc   655360    # The maximum number of processes available to a single user
* hard    nproc   655360
* soft    memlock unlimited  # The maximum size that may be locked into memory
* hard    memlock unlimited
* soft    core    unlimited  # The maximum size of core files created
* hard    core    unlimited

core dump 相关内核设置

kernel.core_pattern = /xxx/xxx/core_%e_%u_%t_%s.%p
kernel.core_uses_pid = 1

OOM score adj 设置

PostgreSQL 的守护进程是postgres,如果它挂了,数据库就挂了,其他进程挂了它会负责crash recovery,自动重启数据库(默认设置了 restart_after_crash = on )

所以如果要防止系统OOM时杀掉postgres主进程,需要在启动数据库前,使用root用户设置self脚本进程的oom_score_adj,然后启动数据库。

echo -1000 > /proc/self/oom_score_adj

or 

echo -17 > /proc/self/oom_score_adj

例子

# echo -1000 > /proc/self/oom_score_adj

启动是需要在启动环境中设置这两个环境变量
# export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj  # 设置postgres主进程oom_score_adj
# export PG_OOM_ADJUST_VALUE=0  # 设置子进程oom_score_adj

# su - digoal -c "export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj;export PG_OOM_ADJUST_VALUE=0;. ~/env.sh; pg_ctl start"

# ps -efw|grep digoal
digoal    2492     1  9 23:22 ?        00:00:00 /home/digoal/pgsql9.5/bin/postgres
digoal    2493  2492  0 23:22 ?        00:00:00 postgres: logger process   
digoal    2495  2492  0 23:22 ?        00:00:00 postgres: checkpointer process   
digoal    2496  2492  0 23:22 ?        00:00:00 postgres: writer process   
digoal    2497  2492  0 23:22 ?        00:00:00 postgres: wal writer process   
digoal    2498  2492  0 23:22 ?        00:00:00 postgres: autovacuum launcher process  
digoal    2499  2492  0 23:22 ?        00:00:00 postgres: stats collector process 

# cat /proc/2492/oom_score_adj 
-1000
# cat /proc/2493/oom_score_adj 
0

参考

小结

本文主要帮助大家理解PostgreSQL数据库对操作系统资源的需求,以及计算方法。

如果用户需要在一个系统中运行多个数据库集群,则需要将所有集群的需求加起来。

PostgreSQL 9.2以及以前的版本,在数据库启动时对SysV共享内存段的需求很大,所以要设得比较大,需要用户注意。

祝大家玩得开心,欢迎随时来 阿里云促膝长谈业务需求 ,恭候光临。

阿里云的小伙伴们加油,努力 做好内核与服务,打造最贴地气的云数据库

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍如何基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
Kubernetes 前端开发 Docker
使用GitLab CI/CD部署应用到Kubernetes集群的方案
使用GitLab CI/CD部署应用到Kubernetes集群的方案
4019 0
使用GitLab CI/CD部署应用到Kubernetes集群的方案
|
6月前
|
关系型数据库 Unix Linux
Rocky Linux下用meson编译安装PostgreSQL
本文记录在Rocky Linux 9.7上使用Meson编译PostgreSQL 18.1的完整流程,涵盖环境准备、依赖安装、快速部署命令及详细配置说明,重点介绍Meson构建系统的核心用法与常用参数,助力高效完成源码编译与定制化安装。
|
人工智能 搜索推荐 机器人
Qwen3+MCP快速解决今晚吃什么?
智能体是一种基于云计算和人工智能的自主决策系统,相比通用AI大模型,更注重场景化适配与垂直领域深耕。本文以“今天吃什么”智能体为例,依托Qwen3多模态能力和MCP动态规划框架,打造个性化饮食推荐服务。从开通百炼平台、创建智能体到设计Prompt和部署MCP服务,最终实现精准美食推荐,解决选择困难问题,将点餐变为一场充满惊喜的探险。总结来看,智能体通过云+AI技术,在垂直场景中展现高效精准的决策能力。
|
存储 运维 Linux
Linux磁盘精准缩容:操作详解与技巧
在Linux系统管理中,有效的磁盘空间优化对于维护系统性能至关重要。本文将深入探讨如何在Linux环境下安全地进行磁盘缩容,帮助你合理调整存储资源,确保系统高效运行。跟随本篇的步骤,一起优化你的Linux系统磁盘空间!
Linux磁盘精准缩容:操作详解与技巧
|
Oracle 关系型数据库 Linux
linux8安装oracle 11g遇到的问题记录
Oracle 11g在Linux 8上安装时会遇到link编译环节的问题。官方建议忽略安装中的链接错误,安装完成后应用DBPSU 11.2.0.4.240716补丁及一次性补丁33991024,再重新编译二进制文件,并配置监听器和数据库。但因11g已退出服务期,这些补丁需付费获取。网上信息显示22年1月的PSU补丁也可解决问题,找到该补丁后按常规方式打补丁即可。如有需求或疑问可咨询我。
820 20
|
SQL 监控 关系型数据库
PG数据库释放闲置连接
PG数据库释放闲置连接
1071 0
|
监控 Linux
在Linux中,如何监控磁盘I/O性能?
在Linux中,如何监控磁盘I/O性能?
|
安全 数据库 数据安全/隐私保护
撞库攻击是什么?如何有效阻止撞库攻击?
通过采取这些防护措施,可以有效降低撞库攻击的成功几率,保护用户的账户和数据安全。
1051 0
撞库攻击是什么?如何有效阻止撞库攻击?
|
存储 SQL 缓存
【MySQL】事务日志 redo log 详解
redo 日志降低了刷盘的频率,并且redo日志占用的空间非常小。(redo日志主要存储表空间ID、页号、偏移量以及需要更新的值,所需存储的空间很小,刷盘快)。Redo Log 可以简单的非为两部分组成:重做日志的缓存(redo log buffer),保存在内存中,容易丢失。重做日志文件(rodo log file),保存在硬盘中,保证持久性。Redo Log写入并不是直接写入磁盘的,Innodb引擎会在写Redo Log的时候先写redo log buffer,之后再以一定的频率刷入到真正的redo log file中。这里的一定的频率就是所谓的刷盘策略。
|
消息中间件 监控 JavaScript
用这4招 优雅的实现Spring Boot 异步线程间数据传递
用这4招 优雅的实现Spring Boot 异步线程间数据传递

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版