一,
wal的基本概念
WAL即Write Ahead Log预写式日志,简称wal日志,相当于oracle中的redo日志。只是oracle中redo是固定几个redo日志文件,然后轮着切换去写入。pg中wal日志是动态切换,单个wal日志写满继续写下一个wal日志,连续不断生成wal日志。(可以简单理解为MySQL里面的binlog日志,虽然运行机制是完全不同的,但两者效果基本是一致的)
WAL的主要用途是用于故障恢复,针对数据库的数据insert/delete/update
操作都会形成一些列的WAL日志记录,多个WAL日志组成WAL的日志序列,这些日志记录记录了哪些page
做了什么修改。如果此时数据库发生故障(主机),哪些未被提交的事务或者需要回滚的事务可以从WAL中进行恢复。
另一个用途是搭建主从数据库以及迁移数据库。
二,
wal的参数
1,
wal_segment_size 参数
单个WAL文件的大小,默认为16MB,参数是wal_segment_size,可以理解为PG把Wal日志存储到N个大小为16M(默认值)的WAL segment file,一般不做更改,Postgresql 11版本之前只能在编译pg时指定,Postgresql 11版本开始支持 initdb(初始化数据库的时候) 和 pg_resetwal(一个postgresql的配置工具) 修改 WAL 文件大小
附注:pg_resetwal工具的简单使用
重设wal文件大小之前,需要停止数据库,并使用pg数据库的管理用户操作,设定的值只能是2的倍数,例如,16,32,64,128以此类推,单位为M,不得超过1G
因此,下面的修改,18是不被接受的,64可以接受
[root@EULEER pg_wal]# su - pg1 -c"pg_resetwal --wal-segsize=18 -D /usr/local/pgsql/data" pg_resetwal: error: argument of --wal-segsize must be a power of 2 between 1 and 1024 [root@EULEER pg_wal]# su - pg1 -c"pg_resetwal --wal-segsize=64 -D /usr/local/pgsql/data" Write-ahead log reset
再次启动数据库的时候,又报这个错了:
意思是min_wal_size的值必须至少是wal_segment_size的两倍,查看postgresql.conf配置文件,发现min_wal_size的默认值是80,因此,修改为128后启动成功。
[root@EULEER pg_wal]# bash ~/start-pgsql.sh waiting for server to start....2023-03-16 00:17:01.023 CST [31444] FATAL: "min_wal_size" must be at least twice "wal_segment_size" 2023-03-16 00:17:01.023 CST [31444] LOG: database system is shut down stopped waiting pg_ctl: could not start server
启动数据库成功后,查看PGDATA目录下的pg_wal目录,可以看到wal日志大小变为了64M,原来默认的是16M
[root@EULEER pg_wal]# pwd /usr/local/pgsql/data/pg_wal [root@EULEER pg_wal]# ls -alh total 65M drwx------ 3 pg1 pg1 4.0K Mar 16 00:15 . drwx------ 19 pg1 pg1 4.0K Mar 16 00:18 .. -rw------- 1 pg1 pg1 64M Mar 16 00:24 00000002000000000000000A -rw------- 1 pg1 pg1 33 Mar 15 20:15 00000002.history drwx------ 2 pg1 pg1 4.0K Mar 16 00:14 archive_status
三,
wal文件的查看
oracle的redo日志是没有内置程序查看的,但MySQL里有,postgresql的查看redo日志程序是pg_waldump
登陆psql客户端,执行一下命令,查询现在postgresql在使用哪个wal文件(此命令是函数嵌套):
如果是开启了备份功能,也就是archive,该文件将会复制到备份目录内,也就是postgresql.conf这个配置文件里的cp命令干的事情(本例是/usr/local/pgsql/backup目录)。
archive_mode = on # enables archiving; off, on, or always # (change requires restart) archive_command = ' cp %p /usr/local/pgsql/backup/%f'
postgres=# select pg_walfile_name(pg_current_wal_lsn()); pg_walfile_name -------------------------- 00000001000000000000000B (1 row)
可以看到,确实是有 00000001000000000000000B
[root@EULEER pg_wal]# ls -al total 65548 drwx------ 3 pg1 pg1 4096 Mar 16 04:38 . drwx------ 19 pg1 pg1 4096 Mar 16 04:23 .. -rw------- 1 pg1 pg1 16777216 Mar 16 04:38 00000001000000000000000B -rw------- 1 pg1 pg1 16777216 Mar 16 04:23 00000001000000000000000C -rw------- 1 pg1 pg1 16777216 Mar 16 04:12 00000001000000000000000D -rw------- 1 pg1 pg1 16777216 Mar 16 04:36 00000001000000000000000E drwx------ 2 pg1 pg1 4096 Mar 16 04:38 archive_status [root@EULEER pg_wal]#
查看wal日志文件有多少个(和上面的命令互相印证,是一致的):
postgres=# select count(*) from pg_ls_waldir(); count ------- 4 (1 row)
[root@EULEER pg_wal]# ls -al total 65548 drwx------ 3 pg1 pg1 4096 Mar 16 04:38 . drwx------ 19 pg1 pg1 4096 Mar 16 04:23 .. -rw------- 1 pg1 pg1 16777216 Mar 16 04:38 00000001000000000000000B -rw------- 1 pg1 pg1 16777216 Mar 16 04:23 00000001000000000000000C -rw------- 1 pg1 pg1 16777216 Mar 16 04:12 00000001000000000000000D -rw------- 1 pg1 pg1 16777216 Mar 16 04:36 00000001000000000000000E drwx------ 2 pg1 pg1 4096 Mar 16 04:38 archive_status [root@EULEER pg_wal]#
随便找一个表,做增删改操作,这里是删除一个表:
postgres=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+----------+-------+----------+------------+------------- public | bonus | table | postgres | 8192 bytes | public | dept | table | postgres | 16 kB | public | emp | table | postgres | 16 kB | public | salgrade | table | postgres | 16 kB | (4 rows) postgres=# drop table bonus; DROP TABLE
持续刷新wal日志文件,类似于tail -f 命令:
pg_waldump -f 00000002000000000000000C [root@EULEER data]# pg_waldump pg_wal/00000001000000000000000C rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/0C000028, prev 0/0B0000D8, desc: CHECKPOINT_SHUTDOWN redo 0/C000028; tli 1; prev tli 1; fpw true; xid 0:518; oid 16420; multi 1; offset 0; oldest xid 479 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0C0000A0, prev 0/0C000028, desc: RUNNING_XACTS nextXid 518 latestCompletedXid 517 oldestRunningXid 518 rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0C0000D8, prev 0/0C0000A0, desc: RUNNING_XACTS nextXid 518 latestCompletedXid 517 oldestRunningXid 518 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/0C000110, prev 0/0C0000D8, desc: CHECKPOINT_ONLINE redo 0/C0000D8; tli 1; prev tli 1; fpw true; xid 0:518; oid 16420; multi 1; offset 0; oldest xid 479 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 518; online rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0C000188, prev 0/0C000110, desc: RUNNING_XACTS nextXid 518 latestCompletedXid 517 oldestRunningXid 518 rmgr: Heap len (rec/tot): 65/ 1141, tx: 518, lsn: 0/0C0001C0, prev 0/0C000188, desc: HOT_UPDATE off 12 xmax 518 flags 0x00 ; new off 15 xmax 0, blkref #0: rel 1663/16386/16395 blk 0 FPW rmgr: Transaction len (rec/tot): 34/ 34, tx: 518, lsn: 0/0C000638, prev 0/0C0001C0, desc: COMMIT 2023-03-16 08:29:11.019821 CST rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0C000660, prev 0/0C000638, desc: RUNNING_XACTS nextXid 519 latestCompletedXid 518 oldestRunningXid 519
刷新出来如下内容:
以上表明删除表的事务ID是572,因此,需要使用前一位的事务ID571来进行恢复,恢复前需要停止数据库:
root@EULEER data]# bash ~/stop-pgsql.sh waiting for server to shut down....2023-03-16 02:01:28.827 CST [8091] LOG: received fast shutdown request 2023-03-16 02:01:28.830 CST [8091] LOG: aborting any active transactions 2023-03-16 02:01:28.831 CST [8395] FATAL: terminating connection due to administrator command 2023-03-16 02:01:28.832 CST [8200] FATAL: terminating connection due to administrator command 2023-03-16 02:01:28.847 CST [8091] LOG: background worker "logical replication launcher" (PID 8099) exited with exit code 1 2023-03-16 02:01:28.847 CST [8093] LOG: shutting down .2023-03-16 02:01:30.231 CST [8091] LOG: database system is shut down done server stopped [root@EULEER data]# su - pg1 -c "pg_resetwal -x 571 -D /usr/local/pgsql/data/" Write-ahead log reset
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/14000138, prev 0/14000100, desc: RUNNING_XACTS nextXid 518 latestCompletedXid 517 oldestRunningXid 518 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/14000170, prev 0/14000138, desc: CHECKPOINT_ONLINE redo 0/14000138; tli 1; prev tli 1; fpw true; xid 0:518; oid 16420; multi 1; offset 0; oldest xid 479 in DB 16386; oldest multi 1 in DB 16386; oldest/newest commit timestamp xid: 0/0; oldest running xid 518; online rmgr: Heap len (rec/tot): 59/ 1511, tx: 518, lsn: 0/140001E8, prev 0/14000170, desc: DELETE off 7 flags 0x00 KEYS_UPDATED , blkref #0: rel 1664/0/1262 blk 0 FPW rmgr: Heap len (rec/tot): 59/ 923, tx: 518, lsn: 0/140007D0, prev 0/140001E8, desc: DELETE off 11 flags 0x00 KEYS_UPDATED , blkref #0: rel 1664/0/1214 blk 0 FPW rmgr: Heap len (rec/tot): 54/ 54, tx: 518, lsn: 0/14000B70, prev 0/140007D0, desc: DELETE off 12 flags 0x00 KEYS_UPDATED , blkref #0: rel 1664/0/1214 blk 0 rmgr: Heap len (rec/tot): 54/ 54, tx: 518, lsn: 0/14000BA8, prev 0/14000B70, desc: DELETE off 13 flags 0x00 KEYS_UPDATED , blkref #0: rel 1664/0/1214 blk 0 rmgr: Heap len (rec/tot): 54/ 54, tx: 518, lsn: 0/14000BE0, prev 0/14000BA8, desc: DELETE off 14 flags 0x00 KEYS_UPDATED , blkref #0: rel 1664/0/1214 blk 0 rmgr: Heap len (rec/tot): 54/ 54, tx: 518, lsn: 0/14000C18, prev 0/14000BE0, desc: DELETE off 10 flags 0x00 KEYS_UPDATED , blkref #0: rel 1664/0/1214 blk 0 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/14000C50, prev 0/14000C18, desc: RUNNING_XACTS nextXid 519 latestCompletedXid 517 oldestRunningXid 518; 1 xacts: 518 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/14000C88, prev 0/14000C50, desc: RUNNING_XACTS nextXid 519 latestCompletedXid 517 oldestRunningXid 518; 1 xacts: 518 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/14000CC0, prev 0/14000C88, desc: CHECKPOINT_ONLINE redo 0/14000C88; tli 1; prev tli 1; fpw true; xid 0:519; oid 16420; multi 1; offset 0; oldest xid 479 in DB 16386; oldest multi 1 in DB 16386; oldest/newest commit timestamp xid: 0/0; oldest running xid 518; online rmgr: Database len (rec/tot): 34/ 34, tx: 518, lsn: 0/14000D38, prev 0/14000CC0, desc: DROP dir 1663/16386 rmgr: Transaction len (rec/tot): 82/ 82, tx: 518, lsn: 0/14000D60, prev 0/14000D38, desc: COMMIT 2023-03-16 08:46:05.921552 CST; inval msgs: catcache 21 snapshot 1214; sync rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/14000DB8, prev 0/14000D60, desc: RUNNING_XACTS nextXid 519 latestCompletedXid 518 oldestRunningXid 519
查询当前事务ID:
postgres=# SELECT CAST(txid_current() AS text); txid_current -------------- 520 (1 row)
利用arclog目录内的归档文件恢复指定事务:
那么现在恢复到删除表前的话,应该是518这个事务了,编辑主配置文件,然后基础备份文件解压替换原来的data文件,重启就可以了。
restore_command = 'cp /usr/local/pgsql/arclog/%f %p' recovery_target_xid = '518'
以上简单的演示了归档文件的查看和使用,其实可以确定的一点就是,arclog归档文件就是配合pg_basebackup命令恢复用的
四,
arclog文件的管理
[postgres@node1 arclog]$ pwd /usr/local/pgsql/arclog [postgres@node1 arclog]$ ll total 1179656 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 16:31 000000010000000100000003 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 16:31 000000010000000100000004 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 16:36 000000010000000100000005 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 16:37 000000010000000100000006 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 16:37 000000010000000100000007 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 17:21 000000010000000100000008 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 17:21 000000010000000100000009 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 17:21 00000001000000010000000A -rwxr-x--- 1 postgres postgres 67108864 Sep 30 17:21 00000001000000010000000B -rwxr-x--- 1 postgres postgres 67108864 Sep 30 17:31 00000001000000010000000C -rwxr-x--- 1 postgres postgres 67108864 Sep 30 18:01 00000001000000010000000D -rwxr-x--- 1 postgres postgres 67108864 Sep 30 18:06 00000001000000010000000E -rwxr-x--- 1 postgres postgres 67108864 Sep 30 18:09 00000001000000010000000F -rwxr-x--- 1 postgres postgres 67108864 Sep 30 18:17 000000010000000100000010 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 18:17 000000010000000100000011 -rwxr-x--- 1 postgres postgres 67108864 Sep 30 18:18 000000010000000100000012 -rwxr-x--- 1 postgres postgres 340 Sep 30 18:18 000000010000000100000012.00000028.backup -rwxr-x--- 1 postgres postgres 67108864 Sep 30 18:18 000000010000000100000013 -rw-r----- 1 postgres postgres 67108864 Sep 30 18:46 000000020000000100000014 -rw-r----- 1 postgres postgres 34 Sep 30 18:31 00000002.history
如何清理这些文件呢?如果你确定最近短期内不会进行任何恢复工作的话:
[root@node1 pg_wal]# pg_controldata |grep "REDO WAL" Latest checkpoint's REDO WAL file: 000000020000000100000015 [root@node1 pg_wal]# pg_archivecleanup -d /usr/local/pgsql/arclog/ 000000020000000100000015
可以先试运行,在正式删除,参数-n是试运行,-d 是详细输出:
[root@node1 arclog]# pg_archivecleanup -n /usr/local/pgsql/arclog/ 000000020000000100000019 /usr/local/pgsql/arclog//000000020000000100000015 /usr/local/pgsql/arclog//000000020000000100000016 /usr/local/pgsql/arclog//000000020000000100000017 /usr/local/pgsql/arclog//000000020000000100000018 [root@node1 arclog]# pg_archivecleanup -d /usr/local/pgsql/arclog/ 000000020000000100000019 pg_archivecleanup: keeping WAL file "/usr/local/pgsql/arclog//000000020000000100000019" and later pg_archivecleanup: removing file "/usr/local/pgsql/arclog//000000020000000100000015" pg_archivecleanup: removing file "/usr/local/pgsql/arclog//000000020000000100000016" pg_archivecleanup: removing file "/usr/local/pgsql/arclog//000000020000000100000017" pg_archivecleanup: removing file "/usr/local/pgsql/arclog//000000020000000100000018"
五,
arclog的拷贝命令
- 将归档文件复制到指定目录(普通的拷贝命令):
archive_command = 'cp %p /path/to/archive/%f'
- 将归档文件压缩后复制到指定目录:
archive_command = 'gzip -c %p > /path/to/archive/%f.gz'
例子:
这些主要是针对磁盘空间比较少的情况,通常使用压缩或者rsync 推送到远程服务器上,比如,在本地压缩:
archive_mode = on # enables archiving; off, on, or always # (change requires restart) archive_command = 'test ! -f /usr/local/pgsql/arclog/%f && gzip -c %p >/usr/local/pgsql/arclog/%f.gz' [root@node1 arclog]# ls -alh total 65M drwxr-x--- 2 postgres postgres 161 Sep 30 21:57 . drwxr-x---. 8 postgres postgres 82 Sep 30 13:28 .. -rwxr-x--- 1 postgres postgres 340 Sep 30 18:18 000000010000000100000012.00000028.backup -rw-r----- 1 postgres postgres 64M Sep 30 21:56 000000020000000100000019 -rw-r----- 1 postgres postgres 64K Sep 30 21:57 00000002000000010000001A.gz -rw-r----- 1 postgres postgres 34 Sep 30 18:31 00000002.history
可以看到,很明显的压缩后,归档文件小了很多很多
- 使用rsync将归档文件复制到远程服务器:
示例:
archive_command = 'rsync -av %p user@remote:/path/to/archive/%f'
- 将归档文件发送到AWS S3存储桶:
archive_command = 'aws s3 cp %p s3://my-bucket/archive/%f'
- 利用脚本执行归档命令
推荐指数:五颗※※※※※
为什么是推荐使用脚本呢?主要是脚本可以非常的灵活,脚本内你可以写本地cp命令来进行归档,也可以写华为云或者aws对象存储命令,也可以写rsync命令将归档文件传送到远端服务器,当然了,脚本内也可以加上一些逻辑,比如,指定保留多少天的归档文件,指定压缩文件等等功能,关键是脚本的实时修改会立刻产生作用,可以立刻改变归档方式,不需要重启数据库什么的操作,这点就非常的nice了
下面就以rsync命令传送归档文件到远程服务器为例来演示一下:
rsync服务器的搭建就不重复说了,是搭建在远端服务器的,主要还是根据上面的博文搭建,数据库服务器的IP是192.168.123.11,远端存放wal归档文件的服务器地址是192.168.123.12
远端服务器12的rsync服务器主配置文件内容如下:
[root@node2 pg_archive]# cat /etc/rsyncd.conf uid = postgres gid = postgres port = 873 use chroot = yes max connections = 4 hosts allow = * pid file = /var/run/rsyncd.pid log file = /var/log/rsyncd/rsyncd.log lock file =/var/run/rsync.lock exclude = lost+found/ transfer logging = yes timeout = 900 ignore nonreadable = yes dont compress = *.gz *.tgz *.zip *.z *.Z *.rpm *.deb *.bz2 [httpd_back] path = /data/pg_archive/ comment = ftp export area read only = no auth users = rsync secrets file = /etc/rsyncd.passwd
数据库服务器11的存放归档命令的脚本:
[root@node1 arclog]# cat /home/postgres/arclog.sh #!/bin/bash #!author zsk rsync -avz /usr/local/pgsql/data/pg_wal/ rsync@192.168.123.12::httpd_back --password-file=/etc/rsync.passwd
数据库服务器11的postgresql主配置文件内的归档命令:
archive_mode = on # enables archiving; off, on, or always # (change requires restart) archive_command = '/home/postgres/arclog.sh %p %f'
注意,上面的rsync配置的用户是postgres,因此,需要给passwd文件赋权此用户,否则命令失败,会报错:
chown postgres. /etc/rsync.passwd
观察postgresql的日志可以看到rsync正常运行:
sent 8,850 bytes received 57,433 bytes 44,188.67 bytes/sec total size is 536,870,946 speedup is 8,099.68 sending incremental file list 000000020000000200000002 archive_status/ archive_status/000000020000000200000001.done archive_status/000000020000000200000002.ready sent 8,874 bytes received 57,433 bytes 44,204.67 bytes/sec total size is 536,870,946 speedup is 8,096.75 sending incremental file list ./ 000000020000000200000003 000000020000000200000007 000000020000000200000008 000000020000000200000009 00000002000000020000000A archive_status/ archive_status/000000020000000200000003.ready sent 270,479 bytes received 57,493 bytes 131,188.80 bytes/sec total size is 536,870,946 speedup is 1,636.94
OK,修改脚本,改成本地存放wal归档文件:
[root@node1 arclog]# cat /home/postgres/arclog.sh #!/bin/bash #rsync -avz /usr/local/pgsql/data/pg_wal/ rsync@192.168.123.12::httpd_back --password-file=/etc/rsync.passwd test ! -f /usr/local/pgsql/arclog/$1 && gzip -c %p >/usr/local/pgsql/arclog/$2.gz
登录postgresql数据库,执行刷新wal日志命令:
SELECT pg_switch_wal()
可以看到立刻就在本地看到了压缩后的刷新出来的归档文件,这里就完全不需要修改postgresql数据库的主配置文件了,非常的简单就可以切换wal日志文件归档方式了:
[root@node1 arclog]# pwd /usr/local/pgsql/arclog [root@node1 arclog]# ls 000000020000000200000006.gz