PostgreSQL Huge Page 使用建议 - 大内存主机、实例注意-阿里云开发者社区

背景

当内存很大时，除了刷脏页的调度可能需要优化，还有一方面是虚拟内存与物理内存映射表相关的部分需要优化。

1 脏页调度优化

1、主要包括，调整后台进程刷脏页的阈值、唤醒间隔、以及老化阈值。（脏页大于多少时开始刷、多久探测一次有多少脏页、刷时多老的脏页刷出。）。

vm.dirty_background_bytes = 4096000000    
vm.dirty_background_ratio = 0    
vm.dirty_expire_centisecs = 6000    
vm.dirty_writeback_centisecs = 100

2、用户进程刷脏页调度，当脏页大于多少时，用户如果要申请内存，需要协助刷脏页。

vm.dirty_bytes = 0    
vm.dirty_ratio = 80

《DBA不可不知的操作系统内核参数》

2 内存表映射优化

这部分主要是因为虚拟内存管理，Linux需要维护虚拟内存地址与物理内存的映射关系，为了提升转换性能，最好这部分能够cache在cpu的cache里面。页越大，映射表就越小。使用huge page可以减少页表大小。

默认页大小可以这样获取，

# getconf PAGESIZE    
4096

https://en.wikipedia.org/wiki/Page_table

另一个使用HUGE PAGE的原因，HUGE PAGE是常驻内存的，不会被交换出去，这也是重度依赖内存的应用（包括数据库）非常喜欢的。

In a virtual memory system, the tables store the mappings between virtual addresses and physical addresses. When the system needs to access a virtual memory location, it uses the page tables to translate the virtual address to a physical address. Using huge pages means that the system needs to load fewer such mappings into the Translation Lookaside Buffer (TLB), which is the cache of page tables on a CPU that speeds up the translation of virtual addresses to physical addresses. Enabling the HugePages feature allows the kernel to use hugetlb entries in the TLB that point to huge pages. The hugetbl entries mean that the TLB entries can cover a larger address space, requiring many fewer entries to map the SGA, and releasing entries that can map other portions of the address space.

With HugePages enabled, the system uses fewer page tables, reducing the overhead for maintaining and accessing them. Huges pages remain pinned in memory and are not replaced, so the kernel swap daemon has no work to do in managing them, and the kernel does not need to perform page table lookups for them. The smaller number of pages reduces the overhead involved in performing memory operations, and also reduces the likelihood of a bottleneck when accessing page tables.

PostgreSQL HugePage使用建议

1、查看Linux huage page页大小

# grep Hugepage /proc/meminfo     
Hugepagesize:       2048 kB

2、准备设置多大的shared buffer参数，假设我们的内存有512GB，想设置128GB的SHARED BUFFER。

vi postgresql.conf    
    
shared_buffers='128GB'

3、计算需要多少huge page

128GB/2MB=65535

4、设置Linux huge page页数

sysctl -w vm.nr_hugepages=67537

5、设置使用huge page。

vi $PGDATA/postgresql.conf    
    
huge_pages = on                 # on, off, or try    
    
# 设置为try的话，会先尝试huge page，如果启动时无法锁定给定数目的大页，则不会使用huge page

6、启动数据库

pg_ctl start

7、查看当前使用了多少huge page

cat /proc/meminfo |grep -i huge    
    
AnonHugePages:      6144 kB    
HugePages_Total:   67537  ## 设置的HUGE PAGE    
HugePages_Free:    66117  ## 这个是当前剩余的，但是实际上真正可用的并没有这么多，因为被PG锁定了65708个大页    
HugePages_Rsvd:    65708  ## 启动PG时申请的HUGE PAGE    
HugePages_Surp:        0    
Hugepagesize:       2048 kB   ## 当前大页2M

8、执行一些查询，可以看到Free会变小。被PG使用掉了。

cat /proc/meminfo |grep -i huge    
AnonHugePages:      6144 kB    
HugePages_Total:   67537    
HugePages_Free:    57482    
HugePages_Rsvd:    57073    
HugePages_Surp:        0    
Hugepagesize:       2048 kB

Oracle HugePage使用建议

Oracle也是重度内存使用应用，当SGA配置较大时，同样建议设置HUGEPAGE。

Oracle 建议当SGA大于或等于8GB时，使用huge page。

10.1 About HugePages

The HugePages feature enables the Linux kernel to manage large pages of memory in addition to the standard 4KB (on x86 and x86_64) or 16KB (on IA64) page size. If you have a system with more than 16GB of memory running Oracle databases with a total System Global Area (SGA) larger than 8GB, you should enable the HugePages feature to improve database performance.

Note
The Automatic Memory Management (AMM) and HugePages features are not compatible in Oracle Database 11g and later. You must disable AMM to be able to use HugePages.

The memory allocated to huge pages is pinned to primary storage, and is never paged nor swapped to secondary storage. You reserve memory for huge pages during system startup, and this memory remains allocated until you change the configuration.

Huge pages are 4MB in size on x86, 2MB on x86_64, and 256MB on IA64.

https://docs.oracle.com/cd/E11882_01/server.112/e10839/appi_vlm.htm#UNXAR394

https://docs.oracle.com/cd/E37670_01/E37355/html/ol_about_hugepages.html

测试对比是否使用HugePage

设计test case

创建10240个表，使用merge insert写入200亿数据。

1、建表

do language plpgsql $$    
declare    
begin    
  execute 'drop table if exists test';    
  execute 'create table test(id int8 primary key, info text, crt_time timestamp)';    
      
  for i in 0..10239 loop    
    execute format('drop table if exists test%s', i);    
    execute format('create table test%s (like test including all)', i);    
  end loop;    
end;    
$$;

2、创建动态写入函数，第一种不使用绑定变量

create or replace function dyn_pre(int8) returns void as $$    
declare    
  suffix int8 := mod($1,10240);    
begin    
  execute format('insert into test%s values(%s, md5(random()::text), now()) on conflict(id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, $1);    
end;    
$$ language plpgsql strict;

3、使用绑定变量，性能更好。

create or replace function dyn_pre(int8) returns void as $$    
declare    
  suffix int8 := mod($1,10240);    
begin    
  execute format('execute p%s(%s)', suffix, $1);    
  exception when others then    
    execute format('prepare p%s(int8) as insert into test%s values($1, md5(random()::text), now()) on conflict(id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, suffix);    
    execute format('execute p%s(%s)', suffix, $1);    
end;    
$$ language plpgsql strict;

4、创建压测脚本

vi test.sql    
    
\set id random(1,20000000000)    
select dyn_pre(:id);

5、写入性能压测

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 56 -j 56 -T 1200000

6、多长连接压测，PAGE TABLE观察

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 950 -j 950 -T 1200000    
    
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 950 -j 950 -T 1200000

1 使用HugePage

1、小量连接写入性能

transaction type: ./test.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 56    
number of threads: 56    
duration: 120 s    
number of transactions actually processed: 17122345    
latency average = 0.392 ms    
latency stddev = 0.251 ms    
tps = 142657.055512 (including connections establishing)    
tps = 142687.784245 (excluding connections establishing)    
script statistics:    
 - statement latencies in milliseconds:    
         0.002  \set id random(1,20000000000)    
         0.390  select dyn_pre(:id);

2、1900个长连接，PAGE TABLE大小（由于是虚拟、物理内存映射关系表。所以耗费取决于连接数，以及每个连接相关联的SHARED BUFFER以及会话自己的relcache, SYSCACHE）

cat /proc/meminfo |grep -i table    
Unevictable:           0 kB    
PageTables:       578612 kB  ## shared buffer使用了huge page，这块省了很多。    
NFS_Unstable:          0 kB

2 未使用HugePage

sysctl -w vm.nr_hugepages=0

1、小量连接的写入性能

transaction type: ./test.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 56    
number of threads: 56    
duration: 120 s    
number of transactions actually processed: 18484181    
latency average = 0.364 ms    
latency stddev = 0.212 ms    
tps = 153887.936028 (including connections establishing)    
tps = 153905.968799 (excluding connections establishing)    
script statistics:    
 - statement latencies in milliseconds:    
         0.002  \set id random(1,20000000000)    
         0.362  select dyn_pre(:id);

小量连接未使用HUGE PAGE性能比使用huge page更好，猜测可能是huge page使用了类似两级转换(因为2MB为单个目标的映射，并不能精准定位到默认8K的数据页的物理内存位置。可能和数据库的索引bitmap scan道理类似，bitmap scan告诉你数据在哪个PAGE内，而不是直接告诉你数据在哪个PAGE的第几条记录上。)，导致了一定的损耗。

cat /proc/meminfo |grep -i table    
Unevictable:           0 kB    
PageTables:     10956556 kB  ## 不一会就增长到了10GB，因为每个连接都在TOUCH shared buffer内的数据，可能导致映射表很大。连接越多。TOUCH shared buffer内数据越多越明显    
# PageTables 还在不断增长    
    
NFS_Unstable:          0 kB

CentOS 7u 配置大页例子

1、修改/boot/grub2/grub.cfg

定位到第一个menuentry 'CentOS Linux'，在linux16 /vmlinuz最后面添加如下：

说明（关闭透明大页，使用默认的2MB大页，你也可以选择用1G的大页，但是在此之前应该先到系统中判断支持哪些大页规格. 查看/proc/cpuinfo里面的FLAG Valid pages sizes on x86-64 are 2M (when the CPU supports "pse") and 1G (when the CPU supports the "pdpe1gb" cpuinfo flag). ，设置启动时创建1536个大页（这部分内存会被保留，所以一定要注意设置合适的大小，建议在LINUX启动后通过sysctl来设置）。）

numa=off transparent_hugepage=never default_hugepagesz=2M hugepagesz=2M hugepages=1536

transparent_hugepage=never表示关闭透明大页，以免不必要的麻烦。透明大页这个特性应该还不太成熟。

hugepagesz 表示页面大小，2M和1G选其一，默认为2M。

hugepages 表示大页面数

总共大页面内存量为hugepagesz * hugepages，这里为3G

例子：

menuentry 'CentOS Linux (3.10.0-693.5.2.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-693.el7.x86_64-advanced-d8179b22-8b44-4552-bf2a-04bae2a5f5dd' {  
        load_video  
        set gfxpayload=keep  
        insmod gzio  
        insmod part_msdos  
        insmod xfs  
        set root='hd0,msdos1'  
        if [ x$feature_platform_search_hint = xy ]; then  
          search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1'  34f87a8d-8b73-4f80-b0ff-8d49b17975ca  
        else  
          search --no-floppy --fs-uuid --set=root 34f87a8d-8b73-4f80-b0ff-8d49b17975ca  
        fi  
        linux16 /vmlinuz-3.10.0-693.5.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/root rhgb quiet LANG=en_US.UTF-8 numa=off transparent_hugepage=never default_hugepagesz=2M hugepagesz=2M hugepages=1536  
        initrd16 /initramfs-3.10.0-693.5.2.el7.x86_64.img  
}

重启系统（如果你不想重启系统而使用HUGE PAGE，使用这种方法即可sysctl -w vm.nr_hugepages=1536 ）

但是修改默认的大页规格(2M or 1G)则一定要重启，例如：

numa=off transparent_hugepage=never default_hugepagesz=1G hugepagesz=2M hugepagesz=1G  
  
重启后就会变这样  
  
cat /proc/meminfo |grep -i huge  
AnonHugePages:         0 kB  
HugePages_Total:       0  
HugePages_Free:        0  
HugePages_Rsvd:        0  
HugePages_Surp:        0  
Hugepagesize:    1048576 kB

申请132GB大页

sysctl -w vm.nr_hugepages=132  
vm.nr_hugepages = 132

重启后可以使用grep Huge /proc/meminfo查看配置情况。看到下面的数据表示已经生效

HugePages_Total:    1536  
HugePages_Free:     1499  
HugePages_Rsvd:     1024  
HugePages_Surp:        0  
Hugepagesize:       2048 kB

数据库配置（如果你想好了非大页不可，就设置huge_pages为on，否则设置为try。on的情况下如果HUGE PAGE不够，则启动会报错。TRY的话，大页不够就尝试申请普通页的方式启动。）

postgresql.conf  
  
huge_pages = on  
shared_buffers = 2GB  # 使用2G内存，这个值需要小于总共大页面内存量

注意

如果postgresql.conf配置huge_pages=on时，且shared_buffers值等于huge_page总内存量（hugepagesz*hugepages）时，数据库无法启动，报如下错误：

This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory or swap space.

解决办法shared_buffers值要小于huge_page总内存量

libhugetlbfs

安装libhugetlbfs可以观察大页的统计信息，分配大页文件系统，例如你想把数据放到内存中持久化测试。

https://lwn.net/Articles/376606/

yum install -y libhugetlbfs*

小结

1、查看、修改Linux目前支持的大页大小。

https://unix.stackexchange.com/questions/270949/how-do-you-change-the-hugepagesize

2、如果连接数较少时，使用HUGE PAGE性能不如不使用（猜测可能是huge page使用了类似两级转换，导致了一定的损耗。）。因此我们可以尽量使用连接池，减少连接数，提升性能。

3、能使用连接池的话，尽量使用连接池，减少连接到数据库的连接数。因为PG与Oracle一样是进程模型，连接越多则进程越多，大内存需要注意一些问题：

3.1 上下文切换，MEM COPY的开销。

3.2 PAGE TABLE增大，内存使用增加。

PageTables: Amount of memory dedicated to the lowest level of page tables. This can increase to a high value if a lot of processes are attached to the same shared memory segment.

3.3 每个会话要缓存syscache, relcache等信息，如果访问的对象很多，会导致内存使用爆增。(这个属于逻辑层面内存使用放大, 如果访问对象不多或者访问过好多对象的长连接不多的话，问题不明显)

《PostgreSQL relcache在长连接应用中的内存霸占"坑"》

这个很容易模拟，使用本例的压测CASE，增加表的数目，增加表的字段数，每个连接的relcache就会增加。

4、如果不能使用连接池，连接数非常多，并且都是长连接(访问较多的对象、shared buffer中的数据时)。那么当shared buffer非常大时，需要考虑使用huge page。这样page tables会比较小。如果无法使用HugePage，那么建议设置较小的shared_buffer。

5、进程自己的内存使用,PAGE TABLE不会有放大效果，因为只是自己使用。所以work_mem, maintenance_work_mem的使用，不大会引起PAGE TABLE过大的问题。

通过观察/proc/meminfo来查看PageTable的占用，判断是否需要启用大页或降低shared_buffer或者连接数。