标签
PostgreSQL , recovery , recovery.conf , restore_command , timeline , 时间线 , next wal , PITR , 时间点恢复
背景
PostgreSQL数据库支持PITR时间点恢复。默认情况下,只需要配置目标是时间点,resotre_command即可,PG会自动调用resotre_command去找需要的WAL文件。
一个典型的recovery.conf配置如下:
#---------------------------------------------------------------------------
# ARCHIVE RECOVERY PARAMETERS
#---------------------------------------------------------------------------
#
# restore_command
#
# specifies the shell command that is executed to copy log files
# back from archival storage. The command string may contain %f,
# which is replaced by the name of the desired log file, and %p,
# which is replaced by the absolute path to copy the log file to.
#
# This parameter is *required* for an archive recovery, but optional
# for streaming replication.
#
# It is important that the command return nonzero exit status on failure.
# The command *will* be asked for log files that are not present in the
# archive; it must return nonzero when so asked.
#
# NOTE that the basename of %p will be different from %f; do not
# expect them to be interchangeable.
#
restore_command = 'cp /data01/digoal/wal/%f %p'
#---------------------------------------------------------------------------
# RECOVERY TARGET PARAMETERS
#---------------------------------------------------------------------------
#
# By default, recovery will rollforward to the end of the WAL log.
# If you want to stop rollforward at a specific point, you
# must set a recovery target.
#
# You may set a recovery target either by transactionId, by name, by
# timestamp or by WAL location (LSN). Recovery may either include or
# exclude the transaction(s) with the recovery target value (i.e.,
# stop either just after or just before the given target,
# respectively).
#
#
#recovery_target_name = '' # e.g. 'daily backup 2011-01-26'
#
recovery_target_time = '2019-03-05 20:52:16.294366+08' # e.g. '2004-07-14 22:39:00 EST'
#
#recovery_target_xid = ''
#
#recovery_target_lsn = '' # e.g. '0/70006B8'
#
#recovery_target_inclusive = true
recovery_target_timeline = 'latest'
# If recovery_target_action = 'pause', recovery will pause when the
# recovery target is reached. The pause state will continue until
# pg_wal_replay_resume() is called. This setting has no effect if
# no recovery target is set. If hot_standby is not enabled then the
# server will shutdown instead, though you may request this in
# any case by specifying 'shutdown'.
#
#recovery_target_action = 'pause'
#---------------------------------------------------------------------------
# STANDBY SERVER PARAMETERS
#---------------------------------------------------------------------------
#
# standby_mode
#
# When standby_mode is enabled, the PostgreSQL server will work as a
# standby. It will continuously wait for the additional XLOG records, using
# restore_command and/or primary_conninfo.
#
standby_mode = on
恢复目标支持:
1、时间
2、自定义还原点名字
3、事务ID
4、WAL LSN
《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 2》
《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 1》
《PostgreSQL recovery target introduce》
接下来的问题,如果无法直接通过restore_command获取文件,又当如何呢?
restore_command = 'cp /data01/digoal/wal/%f %p'
方法1,通过restore_command吐出需要的文件名
recovery.conf
restore_command = 'cp /data01/digoal/wal/%f %p || echo "`date +%F%T` %f" >> /tmp/needwalfile;'
当找不到WAL文件时,就会吐到/tmp/needwalfile
cat /tmp/needwalfile
2019-03-0522:11:28 000000010000005D000000B2
2019-03-0522:11:28 00000002.history
2019-03-0522:11:33 000000010000005D000000B2
2019-03-0522:11:33 00000002.history
2019-03-0522:11:38 000000010000005D000000B2
2019-03-0522:11:38 00000002.history
将文件拷贝到restore_command配置的/data01/digoal/wal目录,restore_command命令将继续。
优先拷贝history文件(走到新的时间线),原理参考末尾引用文档。
通过pg_is_wal_replay_paused()函数得到当前实例是否已经到达目标还原点。如果返回T,表示已到达,则不再需要给PG新的文件。
postgres=# select pg_is_wal_replay_paused();
pg_is_wal_replay_paused
-------------------------
f
(1 row)
方法2,通过log文件得到需要的WAL文件名
配置PG的LOG文件,一样能得到上面的内容。
postgresql.conf
# - Where to Log -
log_destination = 'csvlog' # Valid values are combinations of
# stderr, csvlog, syslog, and eventlog,
# depending on platform. csvlog
# requires logging_collector to be on.
# This is used when logging to stderr:
logging_collector = on # Enable capturing of stderr and csvlog
# into log files. Required to be on for
# csvlogs.
# (change requires restart)
# These are only used if logging_collector is on:
log_directory = 'log' # directory where log files are written,
# can be absolute or relative to PGDATA
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
log_filename='pg.log'
# can include strftime() escapes
#log_file_mode = 0600 # creation mode for log files,
# begin with 0 to use octal notation
log_truncate_on_rotation = on # If on, an existing log file with the
# same name as the new log file will be
# truncated rather than appended to.
# But such truncation only occurs on
# time-driven rotation, not on restarts
# or size-driven rotation. Default is
# off, meaning append to existing files
# in all cases.
log_rotation_age = 1d # Automatic rotation of logfiles will
# happen after that time. 0 disables.
##log_rotation_size = 10MB # Automatic rotation of logfiles will
# happen after that much log output.
# 0 disables.
当找不到WAL文件时,就会吐到$PGDATA/log/pg.log
digoal@pg11-test-> cat pg.log
2019-03-05 22:14:00.167 CST [38155] LOG: 00000: ending log output to stderr
2019-03-05 22:14:00.167 CST [38155] HINT: Future log output will go to log destination "csvlog".
2019-03-05 22:14:00.167 CST [38155] LOCATION: PostmasterMain, postmaster.c:1298
cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory
cp: cannot stat ‘/data01/digoal/wal/000000010000005D000000B2’: No such file or directory
cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory
cp: cannot stat ‘/data01/digoal/wal/000000010000005D000000B2’: No such file or directory
cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory
cp: cannot stat ‘/data01/digoal/wal/000000010000005D000000B2’: No such file or directory
cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory
将文件拷贝到restore_command配置的/data01/digoal/wal目录,restore_command命令将继续。
优先拷贝history文件(走到新的时间线),原理参考末尾引用文档。
通过pg_is_wal_replay_paused()函数得到当前实例是否已经到达目标还原点。如果返回T,表示已到达,则不再需要给PG新的文件。
postgres=# select pg_is_wal_replay_paused();
pg_is_wal_replay_paused
-------------------------
f
(1 row)
方法3,修改内核,通过UDF支持
例如,直接从UDF中获取当前startup进程需要的WAL文件名和TL history文件名。
src/backend/access/transam/xlog.c
UDF支持的弊端:当数据库还没有进入一致状态时,并不能连接到数据库执行查询,另外如果没有开启hot_standby模式,也不能连到恢复中的从库进行查询。使用场景受限。
开启数据库的hot_standby模式,确保可以在恢复过程中,连接到数据库进行UDF查询。
# These settings are ignored on a master server.
hot_standby = on # "off" disallows queries during recovery
# (change requires restart)
#max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
#max_standby_streaming_delay = 30s # max delay before canceling queries
# when reading streaming WAL;
# -1 allows indefinite delay
wal_receiver_status_interval = 1s # send replies at least this often
# 0 disables
#hot_standby_feedback = off # send info from standby to prevent
# query conflicts
#wal_receiver_timeout = 60s # time that receiver waits for
# communication from master
# in milliseconds; 0 disables
#wal_retrieve_retry_interval = 5s # time to wait before retrying to
# retrieve WAL after a failed attempt
其他知识点
时间点恢复,如何取下一个文件。
将数据库恢复配置为hot_standby模式,允许在数据库恢复过程中,连接到数据库。获取需要的信息。
1、当返回database is in startup mode,表示无法连接数据库时,说明还需要日志文件,数据库才能到一致性点允许连接,此时,除了从前面说的LOG文件中获得需要的文件,实际上进程也会突出对应的内容。
例如
digoal 25596 25594 0 19:54 ? 00:00:00 postgres: startup recovering 000000010000005D000000A7
digoal 20 0 16.684g 1860 1276 S 0.0 0.0 0:02.59 postgres: startup waiting for 000000010000005D000000B2
日志的内容如下
cp: cannot stat ‘/data01/digoal/waltest/000000010000005D000000A7’: No such file or directory
cp: cannot stat ‘/data01/digoal/waltest/00000002.history’: No such file or directory
当我们将需要的归档拷贝到对应目录后,
digoal@pg11-test-> cp wal/000000010000005D000000A7 waltest/
当我们将需要的归档拷贝到对应目录后,需要的WAL文件向前推移,日志的内容如下
cp: cannot stat ‘/data01/digoal/waltest/000000010000005D000000A8’: No such file or directory
cp: cannot stat ‘/data01/digoal/waltest/00000002.history’: No such file or directory
2、当可以连接恢复中的数据库后,可以通过一些系统函数,查看到数据库的一些信息
2.1、查看当前数据库正在replay 的wal LSN
postgres=# select pg_last_wal_replay_lsn();
pg_last_wal_replay_lsn
------------------------
5D/A7FFFFE0
(1 row)
2.2、查看当前数据库的恢复是否pause,(如果是自动pause的,说明已经到达设置的还原点)
postgres=# select pg_is_wal_replay_paused();
pg_is_wal_replay_paused
-------------------------
f
(1 row)
2.3、查看lsn对应的wal文件,不允许在standby实例中执行,如果能执行的话,可以直接从当前数据库正在replay 的wal LSN得到WAL文件名。
postgres=# select * from pg_walfile_name('5D/A7FFFFE0');
ERROR: 55000: recovery is in progress
HINT: pg_walfile_name() cannot be executed during recovery.
LOCATION: pg_walfile_name, xlogfuncs.c:521
2.4、当前数据库的时间线 (history)
postgres=# select * from pg_control_checkpoint();
-[ RECORD 1 ]--------+-------------------------
checkpoint_lsn | 5D/A7000028
redo_lsn | 5D/A7000028
redo_wal_file | 000000010000005D000000A7
timeline_id | 1
prev_timeline_id | 1
full_page_writes | t
next_xid | 0:1286297007
next_oid | 1912406
next_multixact_id | 1
next_multi_offset | 0
oldest_xid | 101420357
oldest_xid_dbid | 13285
oldest_active_xid | 0
oldest_multi_xid | 1
oldest_multi_dbid | 1910618
oldest_commit_ts_xid | 0
newest_commit_ts_xid | 0
checkpoint_time | 2019-03-05 19:44:51+08
2.5、控制文件内容
postgres=# select * from pg_control_system();
-[ RECORD 1 ]------------+-----------------------
pg_control_version | 1100
catalog_version_no | 201809051
system_identifier | 6636510237226062864
pg_control_last_modified | 2019-03-05 19:54:56+08
2.6、当前实例如果重启,需要的最早的REDO。
postgres=# select * from pg_control_recovery();
-[ RECORD 1 ]-----------------+------------
min_recovery_end_lsn | 5D/A7FFFFE0
min_recovery_end_timeline | 1
backup_start_lsn | 0/0
backup_end_lsn | 0/0
end_of_backup_record_required | f
2.7、从当前wal目录中,获取到最大的WAL文件名,通常会是当前需要的WAL或者上一个已经REPLAY万的WAL文件。
postgres=# select * from pg_ls_waldir() order by 1 desc limit 1;
name | size | modification
--------------------------+----------+------------------------
000000010000005D000000A7 | 16777216 | 2019-03-05 19:48:53+08
(1 row)
时间点恢复,手工拷贝wal文件的流程 - 通常不需要手工拷贝,只要指定restore_command让数据库自己来即可
配置recovery.conf
1、配置恢复目标
2、配置restore_command命令,打印下一个需要的WAL文件以及HISTORY文件,输出到某个文件中。参考方法1。
3、配置pause
4、配置打开hot_standby
5、从restore_command命令输出到某个文件中得到。下一个需要的WAL文件以及HISTORY文件。
优先拷贝history文件,防止走错时间线。
如果history文件确实存在并拷贝成功,下一个拷贝的文件是.partial文件,千万不要搞错。
6、通过pg_is_wal_replay_paused判断是否停止
postgres=# select pg_is_wal_replay_paused();
pg_is_wal_replay_paused
-------------------------
f
(1 row)
如果返回T,表示已经到达还原点,不需要在拷贝文件。
参考
《PostgreSQL 时间点恢复(PITR)在异步流复制主从模式下,如何避免主备切换后PITR恢复(备库、容灾节点、只读节点)走错时间线(timeline , history , partial , restore_command , recovery.conf)》
《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 2》
《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 1》
《PostgreSQL recovery target introduce》