标签
PostgreSQL , ECS , 云盘 , 快照 , 一致性备份 , 时间点恢复 , zfs , lvm , raid , 并行计算
背景
随着阿里云云盘的发展,云盘的性能已经越来越好了。IOPS可以做到十万以上,读写吞吐也超过1GB/s了。相信随着网络的发展,SSD云盘IOPS突破40万,读写吞吐突破4GB/s也不远了。
不过这里的IOPS和吞吐是指并发的IO能力,单次IO的延迟与本地还是不能比(将来RDMA网络也许能解决这个问题)。
PostgreSQL 如何解决SSD云盘IO延迟高缺陷
某些业务对数据库的(要求响应时间很快的写小事务)对单次IO延迟比较敏感,不过PostgreSQL有方法可以解决这个小缺陷。
1、后台IO,(write syscall)
PostgreSQL 的大多数IO为后台IO(bgwriter, backend writer),所以刷shared buffer对IO延迟不敏感。
2、异步提交
事务提交时,写WAL异步提交。不会造成数据不一致,但是当数据库CRASH,可能丢失在wal buffer中未提交的事务(最多10毫秒)。
这种方法是最有效的。
3、组提交
组提交,解决WAL写瓶颈,将多个同时提交的事务的WAL fsync动作合并为单次,从而减少FSYNC次数,提高高并发时的写小事务的TPS。
PostgreSQL 如何充分利用并发IO的能力
由于我们看到的云盘IOPS和读写吞吐指标是并发指标,数据库如何利用好这么好的指标呢?对高并发小事务不是问题,肯定是能将它用起来的,但是对于低并发,长事务(分析型业务),如何利用云盘的IOPS能力和读写吞吐能力呢?
PostgreSQL的并行计算特性可以充分利用云盘的并发IOPS和读写带宽。
多云盘卷组
单块云盘的IOPS能力有上限,容量有上限,读写带宽也有上限。好在ECS支持多块云盘,目前已支持一台ECS挂载16块云盘。
通过多块云盘,组卷条带后,提高读写带宽。以Linux RHEL/CentOS 7.x为例,组卷和条带方法:
1、逻辑卷
yum install lvm2
2、软RAID
yum install -y mdadm
3、ZFS
以centos 7.3为例
wget http://download.zfsonlinux.org/epel/zfs-release.el7_3.noarch.rpm
rpm -ivh zfs-release.el7_3.noarch.rpm
yum install -y zfs
逻辑卷例子
假设环境中有16块SSD云盘
1、创建PV
pvcreate /dev/vd[b-q]
2、创建VG
vgcreate -A y -s 128M vgdata01 /dev/vd[b-q]
-s, --physicalextentsize PhysicalExtentSize[bBsSkKmMgGtTpPeE]
Sets the physical extent size on physical volumes of this volume group. A size suffix
(k for kilobytes up to t for terabytes) is optional, megabytes is the default if no suffix is present.
For LVM2 format, the value
must be a power of 2 of at least 1 sector (where the sector size is the largest sector size of the
PVs currently used in the VG) or, if not a power of 2, at least 128KiB. For the older LVM1 format, it must be a power
of 2 of at least 8KiB. The default is 4 MiB. Once this value has been set, it is difficult to
change it without recreating the volume group which would involve backing up and restoring data on any logical volumes.
However, if no extents need moving for the new value to apply, it can be altered using vgchange -s.
If the volume group metadata uses lvm1 format, extents can vary in size from 8KiB to 16GiB and
there is a limit of 65534 extents in each logical volume. The default of 4 MiB leads to a maximum logical volume size of
around 256GiB.
If the volume group metadata uses lvm2 format those restrictions do not apply, but having a
large number of extents will slow down the tools but have no impact on I/O performance to the logical volume. The smallest
PE is 1KiB
The 2.4 kernel has a limitation of 2TiB per block device.
3、创建LV,设置条带
16块盘,每个条带单位为8KB。
lvcreate -A y -i 16 -I 8 -l 100%VG -n lv01 vgdata01
-i|--stripes Stripes
Gives the number of stripes. This is equal to the number of physical volumes to scatter the logical volume data.
When creating a RAID 4/5/6 logical volume, the extra devices which are necessary for parity are internally accounted for.
Specifying -i 3 would cause 3 devices for striped and RAID 0 logical volumes, 4 devices for
RAID 4/5, 5 devices for RAID 6 and 6 devices for RAID 10. Alternatively, RAID 0 will stripe across 2
devices, RAID 4/5 across 3 PVs, RAID 6 across 5 PVs and RAID 10 across 4 PVs in the volume group
if the -i argument is omitted. In order to stripe across all PVs of the VG if the -i argument is omitted, set
raid_stripe_all_devices=1 in the allocation section of lvm.conf (5) or add
--config allocation/raid_stripe_all_devices=1
to the command.
Note the current maxima for stripes depend on the created RAID type. For raid10, the maximum of stripes is 32,
for raid0, it is 64, for raid4/5, it is 63 and for raid6 it is 62.
See the --nosync option to optionally avoid initial syncrhonization of RaidLVs.
Two implementations of basic striping are available in the kernel. The original device-mapper implementation
is the default and should normally be used. The alternative implementation using MD, available since ver‐
sion 1.7 of the RAID device-mapper kernel target (kernel version 4.2) is provided to facilitate the
development of new RAID features. It may be accessed with --type raid0[_meta], but is best avoided at present
because of assorted restrictions on resizing and converting such devices.
-I|--stripesize StripeSize
Gives the number of kilobytes for the granularity of the stripes.
StripeSize must be 2^n (n = 2 to 9) for metadata in LVM1 format. For metadata in LVM2 format,
the stripe size may be a larger power of 2 but must not exceed the physical extent size.
4、创建文件系统,设置条带
当数据库数据块=32K时,chunk大小(单块盘读写32KB,再写下一块)。 条带大小512KB(32KB*16)。
mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=8,stripe_width=128 -b 4096 -T largefile -L lv01
或
当数据库数据块=8K时,chunk大小(单块盘读写8KB,再写下一块)。 条带大小128KB(8KB*16)。
mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv01
-b block-size
Specify the size of blocks in bytes. Valid block-size values are 1024, 2048 and 4096 bytes per block.
If omitted, block-size is heuristically determined by the filesystem size and the expected usage of the filesys‐
tem (see the -T option). If block-size is preceded by a negative sign ('-'), then mke2fs will use
heuristics to determine the appropriate block size, with the constraint that the block size will be at least block-
size bytes. This is useful for certain hardware devices which require that the blocksize be a multiple of 2k.
stride=stride-size 单位为blocks
Configure the filesystem for a RAID array with stride-size filesystem blocks. This is the number
of blocks read or written to disk before moving to the next disk, which is sometimes referred to as the
chunk size. This mostly affects placement of filesystem metadata like bitmaps at mke2fs
time to avoid placing them on a single disk, which can hurt performance.
It may also be used by the block allocator.
stripe_width=stripe-width 单位为blocks
Configure the filesystem for a RAID array with stripe-width filesystem blocks per stripe.
This is typically stride-size * N, where N is the number of data-bearing disks in the RAID
(e.g. for RAID 5 there
is one parity disk, so N will be the number of disks in the array minus 1).
This allows the block allocator to prevent read-modify-write of the parity in a
RAID stripe if possible when the data is written.
lazy_itable_init[= <0 to disable, 1 to enable>]
If enabled and the uninit_bg feature is enabled, the inode table will not be fully
initialized by mke2fs. This speeds up filesystem initialization noticeably,
but it requires the kernel to finish initial‐
izing the filesystem in the background when the filesystem is first mounted.
If the option value is omitted, it defaults to 1 to enable lazy inode table zeroing.
5、MOUNT文件系统
如果不使用云盘镜像备份,使用这种挂载模式。
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback LABEL=lv01 /data01
或
如果使用云盘镜像备份,使用这种挂载模式。
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=1,data=ordered LABEL=lv01 /data01
6、FIO测试
yum install -y fio
fio -filename=/data01/testdir -direct=1 -thread -rw=write -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_write.log 2>&1
fio -filename=/data01/testdir -direct=1 -thread -rw=read -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_read.log 2>&1
fio -filename=/data01/testdir -direct=1 -thread -rw=randwrite -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_randwrite.log 2>&1
fio -filename=/data01/testdir -direct=1 -thread -rw=randread -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_randread.log 2>&1
顺序
READ: io=72621MB, aggrb=1210.3MB/s, minb=1210.3MB/s, maxb=1210.3MB/s, mint=60003msec, maxt=60003msec
WRITE: io=36845MB, aggrb=628743KB/s, minb=628743KB/s, maxb=628743KB/s, mint=60007msec, maxt=60007msec
随机
READ: io=53390MB, aggrb=911160KB/s, minb=911160KB/s, maxb=911160KB/s, mint=60002msec, maxt=60002msec
WRITE: io=26078MB, aggrb=445032KB/s, minb=445032KB/s, maxb=445032KB/s, mint=60004msec, maxt=60004msec
数据库部署
1、安装软件
2、初始化数据库
initdb -E SQL_ASCII -U postgres --locale=C -D $PGDATA
3、配置postgresql.conf
port = 1921
max_connections = 1000
superuser_reserved_connections = 13
unix_socket_directories = '.'
shared_buffers = 64GB
maintenance_work_mem = 1GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 500
bgwriter_lru_multiplier = 5.0
bgwriter_flush_after = 0
effective_io_concurrency = 16
max_worker_processes = 128
max_parallel_workers_per_gather = 64
max_parallel_workers = 128
backend_flush_after = 0
wal_level = minimal
synchronous_commit = off
wal_sync_method = fsync
full_page_writes = on
wal_buffers = 1GB
wal_writer_delay = 10ms
wal_writer_flush_after = 0
checkpoint_timeout = 35min
max_wal_size = 128GB
min_wal_size = 8GB
checkpoint_completion_target = 0.5
checkpoint_flush_after = 0
checkpoint_warning = 30s
max_wal_senders = 0
random_page_cost = 1.2
parallel_tuple_cost = 0
parallel_setup_cost = 0
min_parallel_table_scan_size = 0
min_parallel_index_scan_size = 0
effective_cache_size = 200GB
log_destination = 'csvlog'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_file_mode = 0600
log_truncate_on_rotation = on
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_timezone = 'PRC'
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_vacuum_cost_delay = 0
autovacuum_vacuum_cost_limit = 0
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
max_locks_per_transaction = 6400
pg_ctl start
4、配置归档,归档目录不能放在数据盘的云盘里面,你可以放到其他云盘,或者放到OSS。
这里为了演示方便,放在本地云盘。(保证与数据云盘不是同一云盘)
mkdir /archivedir
chown digoal:digoal /archivedir
vi postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /archivedir/%f && cp %p /archivedir/%f'
pg_ctl restart -m fast
5、压测
5.1、写入大吞吐压测
初始化
-- 建表
do language plpgsql $$
declare
begin
for i in 0..65535 loop
execute 'create table IF NOT EXISTS test_'||i||'(id int, info text, crt_time timestamp) with (autovacuum_enabled=off, toast.autovacuum_enabled=off)';
end loop;
end;
$$;
-- 删表
do language plpgsql $$
declare
begin
for i in 0..65535 loop
execute 'drop table IF EXISTS test_'||i;
end loop;
end;
$$;
-- 写表
create or replace function batch_ins(int) returns void as $$
declare
begin
execute 'insert into test_'||$1||' select generate_series(1,10000), md5(random()::text), now()';
end;
$$ language plpgsql strict;
大吞吐写入测试,写入速度直逼200万行/s
vi test.sql
\set id random(0,65535)
select batch_ins(:id);
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 120
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 20801
latency average = 184.678 ms
latency stddev = 57.522 ms
tps = 173.218345 (including connections establishing)
tps = 173.233151 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.003 \set id random(0,65535)
184.684 select batch_ins(:id);
实际为写WAL的瓶颈,采用UNLOGGED TABLE,写入速度直逼1000万行/s。
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 117367
latency average = 32.731 ms
latency stddev = 23.047 ms
tps = 977.228590 (including connections establishing)
tps = 977.331826 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set id random(0,65535)
32.727 select batch_ins(:id);
5.2、10亿数据量,OLTP压测
pgbench -i -s 10000
pgbench -M prepared -n -r -P 1 -c 64 -j 64 -T 120
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 10000
query mode: prepared
number of clients: 64
number of threads: 64
duration: 120 s
number of transactions actually processed: 580535
latency average = 13.229 ms
latency stddev = 30.073 ms
tps = 4829.300799 (including connections establishing)
tps = 4829.989470 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.003 \set aid random(1, 100000 * :scale)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.097 BEGIN;
5.650 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.153 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
3.122 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.631 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
1.418 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
1.148 END;
5.3、并行读压测
postgres=# \dt+ e
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+-------+-------------
public | e | table | postgres | 24 GB |
(1 row)
-- 32个并行读,24G数据处理耗时0.3秒。
postgres=# alter table e set (parallel_workers =32);
postgres=# set max_parallel_workers_per_gather =32;
postgres=# set min_parallel_table_scan_size =0;
postgres=# set min_parallel_index_scan_size =0;
postgres=# set parallel_setup_cost =0;
postgres=# set parallel_tuple_cost =0;
postgres=# set force_parallel_mode =on;
postgres=# select count(*) from e;
count
----------
28766181
(1 row)
Time: 993.876 ms
postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from e;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=178479.59..178479.60 rows=1 width=8) (actual time=305.796..305.796 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=27422
-> Gather (cost=178479.50..178479.51 rows=32 width=8) (actual time=304.868..305.784 rows=33 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 32
Workers Launched: 32
Buffers: shared hit=27422
-> Partial Aggregate (cost=178479.50..178479.51 rows=1 width=8) (actual time=291.483..291.483 rows=1 loops=33)
Output: PARTIAL count(*)
Buffers: shared hit=23934
Worker 0: actual time=281.287..281.287 rows=1 loops=1
Buffers: shared hit=718
..........
Worker 30: actual time=299.304..299.304 rows=1 loops=1
Buffers: shared hit=438
Worker 31: actual time=300.227..300.227 rows=1 loops=1
Buffers: shared hit=460
-> Parallel Index Only Scan using uk_e on public.e (cost=0.44..176232.78 rows=898689 width=0) (actual time=0.085..178.852 rows=871702 loops=33)
Heap Fetches: 0
Buffers: shared hit=23934
Worker 0: actual time=0.078..173.121 rows=1014806 loops=1
Buffers: shared hit=718
..........
Worker 30: actual time=0.089..184.251 rows=635007 loops=1
Buffers: shared hit=438
Worker 31: actual time=0.070..184.000 rows=649729 loops=1
Buffers: shared hit=460
Planning time: 0.092 ms
Execution time: 314.527 ms
(144 rows)
基于云盘镜像的备份
备份过程
1、开始备份
select pg_start_backup('xx');
2、打云盘快照
调用API或在阿里云ECS控制台执行。
3、压测
打快照的过程中,产生一些数据库变更。
使用前面的例子,大吞吐写入测试(使用logged table)、OLTP测试、夹杂建表、删表。
4、云盘快照结束
5、停止压测
6、结束备份
select pg_stop_backup();
7、切换日志
checkpoint;
select pg_switch_wal();
checkpoint;
select pg_switch_wal();
8、检查归档已正常
确保打快照过程中产生的WAL全部归档成功。
9、生成每个测试表的CHECKSUM
do language plpgsql $$
declare
n name;
res int8;
begin
for n in select tablename from pg_tables where schemaname='public'
loop
execute 'select sum(hashtext(t.*::text)) from '||n||' t' into res;
raise notice 'checksum %: %', n, res;
end loop;
end;
$$;
NOTICE: checksum pgbench_history: -422046586146
NOTICE: checksum pgbench_tellers: 215006661683
NOTICE: checksum pgbench_branches: 158568443210
......
基于云盘镜像的恢复
实际上PostgreSQL本身已支持增量备份、块级增量备份的功能,但是既然有云盘快照,可以只备归档和云盘镜像,数据少走一次网络开销。
基于PG本身的备份也简单介绍一下,见本文参考部分。
恢复过程
1、创建新ECS(可以根据镜像创建)
2、如果根据数据库ECS镜像创建,则不需要这一步。
部署postgresql软件,注意需要与主库软件、插件等一致。
3、根据云盘快照创建云盘
4、复原逻辑卷
pvscan
vgscan
lvscan
5、检查文件系统
fsck.ext4 -y /dev/mapper/vgdata01-lv01
6、加载文件系统
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=1,data=ordered LABEL=lv01 /data01
7、清理WAL文件,(因为是多盘快照,快照不可能在同一时间点,因此WAL文件可能出现partial write。wal文件是恢复数据文件的关键,因此必须完整。所以我们从归档来获取WAL。)。
rm -f $PGDATA/pg_wal/*
注意,如果是单盘,可以只删除最后一个WAL。
8、配置恢复
cp $PGHOME/share/recovery.conf.sample $PGDATA/recovery.conf
vi $PGDATA/recovery.conf # (auto copy archive to $PGDATA/pg_wal)
restore_command = 'cp /archivedir/%f %p' # e.g. 'cp /mnt/server/archivedir/%f %p'
standby_mode = on
9、启动数据库
pg_ctl start
10、等待恢复完成,恢复完成即恢复到最后一个已归档的WAL文件。
11、激活数据库
pg_ctl promote
12、检查CHECKSUM
do language plpgsql $$
declare
n name;
res int8;
begin
for n in select tablename from pg_tables where schemaname='public'
loop
execute 'select sum(hashtext(t.*::text)) from '||n||' t' into res;
raise notice 'checksum %: %', n, res;
end loop;
end;
$$;
13、检查所有数据块是否正常。
set vacuum_freeze_min_age =0;
vacuum freeze;
小结
1、PostgreSQL 具备WAL和FULL PAGE WRITE,可以实现数据文件的不一致恢复,也就是说备份时的数据文件partial write可以被检查点后的wal full page write PAGE修复。最终实现一致性。
2、多块云盘并行备份,提高了数据备份的速度。
3、多块云盘,提升了整体的读写IOPS和读写带宽。
4、PostgreSQL 通过组提交、异步提交、数据文件异步写,解决了云盘单次IO延迟较本地延迟更高一点的性能问题。也就是说对IO延迟不敏感。
5、PostgreSQL 利用多核并行,可以在单条SQL中,充分利用多块云盘带来的高IOPS和高读写带宽的能力。
6、通过逻辑卷、软RADI、ZFS等手段,可以充分利用多云盘的能力。
参考
1、《PostgreSQL on Linux 最佳部署手册》
2、zfs
《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)双机HA与块级备份部署》
《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)单个数据库采用多个zfs卷(如表空间)时如何一致性备份》
《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)备份集自动校验》
《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)方案与实战》
3、man lvm
4、man mdadm
5、备份原理
《PostgreSQL 最佳实践 - 块级别增量备份(pg_rman baseon LSN)源码浅析与使用》
6、时间点恢复原理与最佳实践
《PostgreSQL 最佳实践 - 任意时间点恢复源码分析》