PostgreSQL pending patch : fail-back without fresh backup (have bug?)

本文涉及的产品
云原生数据库 PolarDB PostgreSQL 版,标准版 2核4GB 50GB
云原生数据库 PolarDB MySQL 版,通用型 2核4GB 50GB
简介:
[补充说明]
本文测试fail-back未成功的bug是-m immediate的一个bug, 已经修复 : 
[正文]
PostgreSQL的流复制应用非常普遍, 同时也可以作为HA的关键技术, 当主节点出现故障时, 可以通过激活备节点的方法达到高可用到目的.
当备份节点激活后, 原来的主节点可以作为备节点来使用. 
但是, 如果主节点在发生故障前的WAL没有全部同步给备节点的话, 备节点激活后, 实际上与主节点是存在一定差异的. 这种情况下主节点不能直接作为备节点参与利用流复制做的增量恢复.  遇到这样的情况 怎么样把主节点变成备节点呢?
1. 使用pg_basebackup, 或者COPY 的方法重新做一个standby, 这样的话相当于要把整个数据库的数据文件全部传一遍. 数据量大或者网络交差的话是比较痛苦的事情.
2. 使用rsync, 只传输增量数据, 但是需要注意的是, 如果数据库很大的话, rsync的数据比对过程非常漫长, 并且要消耗大量的io资源.
除了以上两种方法, 还有其他的工具可以利用, 例如 在此前写过一篇关于fail-back的工具pg_rewind的使用文章. 
pg_rewind利用新的primary的history 文件, 取出切换时间线是的xlog位置信息, 用这个位置信息来匹配旧的primary节点的xlog文件中的信息, 如果旧的primary在这个点之后还产生了xlog信息, 那么说明standby在promote后有未同步至standby节点的xlog数据, 所以需要对这部分xlog做差异处理. 把这部分xlog对应的变更的块以及控制文件,clog等从新的primary拷贝到老的primary节点. 
具体可参考 : 
除了pg_rewind, PostgreSQL可能在内核中加入类似pg_rewind的功能, 当然原理完全不一样, 加入内核的功能原理如下 : 
The proposal is based on the concept of master should not do any file
system level change until corresponding WAL record is replicated to the
standby.

There are many places in the code which need to be handled to support the
proposed solution.  Following cases explains the need of fresh backup at
the time of failover, and how can we avoid this need by our approach.

1. We must not write any heap pages to the disk before the WAL records
corresponding to those changes are received by the standby. Otherwise if
standby failed to receive WAL corresponding to those heap pages there will
be inconsistency.

2. When CHECKPOINT happens on the master, control file of master gets
updated and last checkpoint record is written to it. Suppose failover
happens and standby fails to receive the WAL record corresponding to
CHECKPOINT, then master and standby has inconsistent copies of control file
that leads to the mismatch in redo record and recovery will not start
normally. To avoid this situation we must not update the control file of
master before the corresponding checkpoint WAL record is received by the
standby

3. Also when we truncate any of the physical files on the master and
suppose the standby failed to receive corresponding WAL, then that physical
file is truncated on master but still available on standby causing
inconsistency. To avoid this we must not truncate physical files on the
master before the WAL record corresponding to that operation is received by
the standby.

4. Same case applies to CLOG pages. If CLOG page is written to the disk and
corresponding WAL record is not replicated to the standby, leads to the
inconsistency. So we must not write the CLOG pages (and may be other SLRU
pages too) to the disk before the corresponding WAL records are received by
standby.

5. The same problem applies for the commit hint bits. But it is more
complicated than the other problems, because no WAL records are generated
for that, hence we cannot apply the same above method, that is wait for
corresponding WAL record to be replicated on standby.  So we delay the
processes of updating the commit hint bits, similar to what is done by
asynchronous commits.  In other words we need to check if the WAL
corresponding to the transaction commit is received by the failback safe
standby and then only allow hint bit updates.

*Patch explanation:*

The initial work on this patch is done by Pavan Deolasee. I tested it and
will make further enhancements based on the community feedback.

This patch is not complete yet, but I plan to do so with the help of this
community. At this point, the primary purpose is to understand the
complexities and get some initial performance numbers to alleviate some of
the concerns raised by the community.

There are two GUC parameters which supports this failsafe standby

1. failback_safe_standby_name  [ name of the failsafe standby ] It is the
name of failsafe standby. Master will not do any file system level change
before corresponding WAL is replicated on the this failsafe standby

2. failback_safe_standby_mode  [ off/remote_write/remote_flush] This
parameter specifies the behavior of master i.e. whether it should wait for
WAL to be written on standby or WAL to be flushed on standby.  We should
turn it off when we do not want the failsafe standby. This failsafe mode
can be combined with synchronous as well as asynchronous streaming
replication.

Most of the changes are done in the syncrep.c. This is a slight misnomer
because that file deals with synchronous standby and a failback standby
could and most like be a async standby. But keeping the changes this way
has ensured that the patch is easy to read. Once we have acceptance on the
approach, the patch can be modified to reorganize the code in a more
logical way.

The patch adds a new state SYNC_REP_WAITING_FOR_FAILBACK_SAFETY to the sync
standby states. A backend which is waiting for a failback safe standby to
receive WAL records, will wait in this state.  Failback safe mechanism can
work in two different modes, that is wait for WAL to be written or flushed
on failsafe standby. That is represented by two new modes
SYNC_REP_WAIT_FAILBACK_SAFE_WRITE and SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH
respectively.

Also the SyncRepWaitForLSN() is changed for conditional wait. So that we
can delay hint bit updates on master instead of blocking the wait for the
failback safe standby to receiver WAL's.

到第四版的补丁, 新增的参数名更改为 synchronous_transfer, 可配置3个值 :
synchronous_transfer = commit   # data page synchronization level
                                # commit, data_flush or all

synchronous_transfer参数的含义 : 
-'commit'
    'commit' means that master waits for corresponding WAL to flushed
to disk of standby server on commits.
    but master doesn't waits for replicated data pages.

  -'data_flush'
    'data_flush' means that master waits for replicated data page
(e.g, CLOG, pg_control) before flush to disk of master server.
    but if user set to 'data_flush' to this parameter,
'synchronous_commit' values is ignored even if user set
'synchronous_commit'.

  -'all'
    'all' means that master waits for replicated WAL and data page.

commit类似同步流复制,记录master节点的commit状态前,等待该xlog已经在standby flushed to disk. 
data_flush则表示master节点在flush data page前, 必须确保这些变更对应的xlog信息已经在standby 节点flush to disk了. (这个配置将忽略synchronous_commit的配置)

同步流复制增加1种等待模式( SYNC_REP_WAIT_DATA_FLUSH ) : 
+++ b/src/include/replication/syncrep.h
@@ -19,23 +19,42 @@
 #define SyncRepRequested() \
        (max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
 
+#define SyncTransRequested() \
+       (max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+
+#define IsSyncRepSkipped() \
+       (max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+
 /* SyncRepWaitMode */
-#define SYNC_REP_NO_WAIT               -1
-#define SYNC_REP_WAIT_WRITE            0
-#define SYNC_REP_WAIT_FLUSH            1
+#define SYNC_REP_NO_WAIT                                       -1
+#define SYNC_REP_WAIT_WRITE                                    0
+#define SYNC_REP_WAIT_FLUSH                                    1
+#define SYNC_REP_WAIT_DATA_FLUSH       2
 
-#define NUM_SYNC_REP_WAIT_MODE 2
+#define NUM_SYNC_REP_WAIT_MODE                         3
 
 /* syncRepState */
-#define SYNC_REP_NOT_WAITING           0
-#define SYNC_REP_WAITING                       1
-#define SYNC_REP_WAIT_COMPLETE         2
+#define SYNC_REP_NOT_WAITING                                   0
+#define SYNC_REP_WAITING                                               1
+#define SYNC_REP_WAIT_COMPLETE                                 2
+
+typedef enum
+{
+       SYNCHRONOUS_TRANSFER_COMMIT,            /* no wait for flush data page */
+       SYNCHRONOUS_TRANSFER_DATA_FLUSH,        /* wait for data page flush only
+                                                                                * no wait for WAL */
+       SYNCHRONOUS_TRANSFER_ALL                /* wait for data page flush */
+}      SynchronousTransferLevel;

src/backend/replication/syncrep.c
/*
 * ===========================================================
 * Synchronous Replication functions for normal user backends
 * ===========================================================
 */

/*
 * Wait for synchronous replication, if requested by user.
 *
 * Initially backends start in state SYNC_REP_NOT_WAITING and then
 * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
 * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
 * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
 * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
 *
 * ForDataFlush - if TRUE, we wait for the flushing data page.
 * Otherwise wait for the sync standby
 *
 * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
 * the standby has already made progressed upto the given XactCommitLSN
 *
 * Return TRUE if either the sync standby is not
 * configured/turned off OR the standby has made enough progress
 */
bool
SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)

[测试]
下载PostgreSQL src : 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=snapshot;h=e5592c61adb0766eaee53ec07d2f05783d1c6548;sf=tgz
tar -zxvf postgresql-e5592c6.tar.gz
cd postgresql-e5592c6

补丁 :
[root@db-172-16-3-39 postgresql-e5592c6]# wget http://www.postgresql.org/message-id/attachment/29640/failback_safe_standby_v4.patch
[root@db-172-16-3-39 postgresql-e5592c6]# patch -p1 < ./failback_safe_standby_v4.patch 
patching file src/backend/access/transam/clog.c
patching file src/backend/access/transam/slru.c
patching file src/backend/access/transam/twophase.c
patching file src/backend/access/transam/xact.c
patching file src/backend/access/transam/xlog.c
patching file src/backend/catalog/storage.c
patching file src/backend/replication/syncrep.c
patching file src/backend/storage/buffer/bufmgr.c
patching file src/backend/utils/cache/relmapper.c
patching file src/backend/utils/misc/guc.c
patching file src/backend/utils/misc/postgresql.conf.sample
patching file src/backend/utils/time/tqual.c
patching file src/include/access/xlog.h
patching file src/include/replication/syncrep.h

安装PostgreSQL
[root@db-172-16-3-39 postgresql-e5592c6]# ./configure --prefix=/home/pg94/pgsql9.4devel --with-pgport=2999 --with-perl --with-tcl --with-python --with-openssl --with-pam --without-ldap --with-libxml --with-libxslt --enable-thread-safety --with-wal-blocksize=16 && gmake && gmake install
cd contrib
gmake
gmake install

初始化数据库
initdb -D $PGDATA -E UTF8 --locale=C -W -U postgres

配置master
cd $PGDATA
vi pg_hba.conf 
host replication postgres 172.16.3.0/24 md5
host all all 0.0.0.0/0 md5
vi postgresql.conf 
listen_addresses = '0.0.0.0'            # what IP address(es) to listen on;
port = 2999                             # (change requires restart)
max_connections = 100                   # (change requires restart)
superuser_reserved_connections = 13     # (change requires restart)
unix_socket_directories = '.'   # comma-separated list of directories
unix_socket_permissions = 0700          # begin with 0 to use octal notation
tcp_keepalives_idle = 60                # TCP_KEEPIDLE, in seconds;
tcp_keepalives_interval = 10            # TCP_KEEPINTVL, in seconds;
tcp_keepalives_count = 10               # TCP_KEEPCNT;
shared_buffers = 1024MB                 # min 128kB
maintenance_work_mem = 512MB            # min 1MB
shared_preload_libraries = 'pg_stat_statements'         # (change requires restart)
vacuum_cost_delay = 10                  # 0-100 milliseconds
vacuum_cost_limit = 10000               # 1-10000 credits
bgwriter_delay = 10ms                   # 10-10000ms between rounds
wal_level = hot_standby                 # minimal, archive, or hot_standby
synchronous_commit = remote_write               # synchronization level;
wal_sync_method = fdatasync             # the default is the first option
full_page_writes = on                   # recover from partial page writes
wal_buffers = 16384kB                   # min 32kB, -1 sets based on shared_buffers
checkpoint_segments = 32                # in logfile segments, min 1, 16MB each
archive_mode = on               # allows archiving to be done
archive_command = '/bin/date'           # command to use to archive a logfile segment
max_wal_senders = 32            # max number of walsender processes
wal_keep_segments = 128         # in logfile segments, 16MB each; 0 disables
synchronous_transfer = data_flush   # data page synchronization level
hot_standby = on                        # "on" allows queries during recovery
max_standby_archive_delay = 300s        # max delay before canceling queries
max_standby_streaming_delay = 300s      # max delay before canceling queries
wal_receiver_status_interval = 1s       # send replies at least this often
hot_standby_feedback = on               # send info from standby to prevent
random_page_cost = 1.5                  # same scale as above
effective_cache_size = 128000MB
log_destination = 'csvlog'              # Valid values are combinations of
logging_collector = on          # Enable capturing of stderr and csvlog
log_directory = 'pg_log'                # directory where log files are written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
log_file_mode = 0600                    # creation mode for log files,
log_truncate_on_rotation = on           # If on, an existing log file with the
log_rotation_age = 1d                   # Automatic rotation of logfiles will
log_rotation_size = 10MB                # Automatic rotation of logfiles will
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose           # terse, default, or verbose messages
log_timezone = 'PRC'
autovacuum = on                 # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'                       # locale for system error message
lc_monetary = 'C'                       # locale for monetary formatting
lc_numeric = 'C'                        # locale for number formatting
lc_time = 'C'                           # locale for time formatting
default_text_search_config = 'pg_catalog.english'
pg_stat_statements.max = 1000
pg_stat_statements.track = all
vi ~/.pgpass
172.16.3.33:2999:replication:postgres:postgres
172.16.3.39:2999:replication:postgres:postgres
chmod 400 ~/.pgpass
vi $PGDATA/recovery.done
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=172.16.3.33 port=2999 user=postgres keepalives_idle=60'                # e.g. 'host=localhost port=5432'


创建standby
安装过程略
复制
pg94@db-172-16-3-33-> pg_basebackup -D $PGDATA -F p -P -v -h 172.16.3.39 -p 2999 -U postgres
WARNING:  skipping special file "./.s.PGSQL.2999"
WARNING:  skipping special file "./.s.PGSQL.2999"94/pg_root/pg_clog/0000)
26943/26943 kB (100%), 1/1 tablespace                                         
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
pg_basebackup: base backup completed

配置
cd $PGDATA
mv recovery.done recovery.conf
vi recovery.conf
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=172.16.3.39 port=2999 user=postgres keepalives_idle=60'                # e.g. 'host=localhost port=5432'

启动standby
pg_ctl start


主节点,  查看复制状态 : 
pg94@db-172-16-3-39-> psql
psql (9.4devel)
Type "help" for help.
digoal=# select * from pg_stat_replication ;
  pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |   st
ate   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-----
------+---------------+----------------+----------------+-----------------+---------------+------------
 12221 |       10 | postgres | walreceiver      | 172.16.3.33 |                 |       20441 | 2013-08-06 09:15:35.617838+08 | stre
aming | 0/3000090     | 0/3000090      | 0/3000090      | 0/3000090       |             0 | async
(1 row)


创建测试表, 测试函数 
digoal=# create table test (id int primary key, info text, crt_time timestamp);
CREATE TABLE
digoal=# create or replace function func() returns void as $$
digoal$# declare 
digoal$#   v_id int;
digoal$# begin
digoal$#   v_id := round(5000000*random());
digoal$#   update test set info=md5(random()::text),crt_time=clock_timestamp() where id=v_id;
digoal$#   if found then
digoal$#     return;
digoal$#   else
digoal$#     insert into test values (v_id,md5(random()::text),clock_timestamp());
digoal$#   end if;
digoal$#   return;
digoal$# end;
digoal$# $$ language plpgsql strict;
CREATE FUNCTION

测试

digoal=# select func();
 func 
------
 
(1 row)
digoal=# select * from test;
   id    |               info               |          crt_time          
---------+----------------------------------+----------------------------
 1894289 | e759b24e9cb227c300f85666c4ac2f81 | 2013-08-06 09:21:30.910392
(1 row)

压力测试  : 
pg94@db-172-16-3-39-> vi test.sql
select func();
pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 10 digoal
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 10 s
number of transactions actually processed: 211697
tps = 21156.265771 (including connections establishing)
tps = 21187.462509 (excluding connections establishing)
statement latencies in milliseconds:
        0.752017        select func();


测试failback 失败 : 
pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 100 digoal
Client 2 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (id)=(521247) already exists.
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"
PL/pgSQL function func() line 10 at SQL statement
Client 14 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (id)=(4957439) already exists.
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"
PL/pgSQL function func() line 10 at SQL statement
Client 15 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (id)=(3050904) already exists.
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"
PL/pgSQL function func() line 10 at SQL statement
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 100 s
number of transactions actually processed: 2063626
tps = 20633.969217 (including connections establishing)
tps = 20637.002726 (excluding connections establishing)
statement latencies in milliseconds:
        0.712672        select func();

以上测试过程中将standby数据库关闭
pg94@db-172-16-3-33-> pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped

测试完后, 在主节点执行checkpoint; 因为配置了synchronous_transfer = data_flush. 会出现等待状态!
postgres: checkpointer process    waiting for 0/2D0C99D8

此时停主库也是无法停止的, 也有flush的动作, 所以会等待.
12198 pg94      15   0 1181m  47m  46m S  0.0  0.6   0:00.04 /home/pg94/pgsql9.4devel/bin/postgres                                  
12199 pg94      15   0  108m 1032  504 S  0.0  0.0   0:00.00 postgres: logger process                                               
12201 pg94      15   0 1182m 198m 197m S  0.0  2.5   0:00.34 postgres: checkpointer process    waiting for 0/2D0C99D8               
12205 pg94      15   0  110m 1024  460 S  0.0  0.0   0:00.01 postgres: archiver process   last was 00000001000000000000002C         
12206 pg94      15   0  110m 1228  576 S  0.0  0.0   0:00.04 postgres: stats collector process                                      
12687 pg94      15   0 66068 1552 1208 S  0.0  0.0   0:00.01 -bash                                                                  
12786 pg94      15   0  109m 1748 1408 S  0.0  0.0   0:00.00 pg_ctl stop -m fast

使用-m immediate关闭后, 重启主库.
pg94@db-172-16-3-39-> psql
psql (9.4devel)
Type "help" for help.
digoal=# select count(*),sum(hashtext(test.*::text)) from test;
  count  |     sum      
---------+--------------
 1828052 | 747103747132
(1 row)

开启standby, 数据一致
pg94@db-172-16-3-33-> psql
psql (9.4devel)
Type "help" for help.

digoal=# select count(*),sum(hashtext(test.*::text)) from test;
  count  |     sum      
---------+--------------
 1828052 | 747103747132
(1 row)


测试fail-over.
主库测试过程中,  promote备库.
pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 100 digoal
./test.sql: No such file or directory
pg94@db-172-16-3-39-> cd
pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 100 digoal
Client 4 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (id)=(2592452) already exists.
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"
PL/pgSQL function func() line 10 at SQL statement
Client 7 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (id)=(3677118) already exists.
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"
PL/pgSQL function func() line 10 at SQL statement

promote备库.
pg94@db-172-16-3-33-> pg_ctl promote
server promoting
2013-08-06 09:34:46.772 CST,,,4728,,52005220.1278,6,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"received promote request",,,,,,,,"CheckForStandbyTrigger, xlog.c:11127",""
2013-08-06 09:34:46.772 CST,,,4745,,5200523e.1289,2,,2013-08-06 09:32:46 CST,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:167",""
2013-08-06 09:34:46.871 CST,,,4728,,52005220.1278,7,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"record with zero length at 0/45AD5F80",,,,,,,,"ReadRecord, xlog.c:4420",""
2013-08-06 09:34:46.871 CST,,,4728,,52005220.1278,8,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"redo done at 0/45AD5F50",,,,,,,,"StartupXLOG, xlog.c:6878",""
2013-08-06 09:34:46.871 CST,,,4728,,52005220.1278,9,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"last completed transaction was at log time 2013-08-06 09:34:46.753686+08",,,,,,,,"StartupXLOG, xlog.c:6883",""
2013-08-06 09:34:46.871 CST,,,4728,,52005220.1278,10,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"selected new timeline ID: 2",,,,,,,,"StartupXLOG, xlog.c:6989",""
2013-08-06 09:34:47.080 CST,,,4728,,52005220.1278,11,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,"exitArchiveRecovery, xlog.c:5639",""
2013-08-06 09:35:21.939 CST,,,4729,,52005220.1279,2,,2013-08-06 09:32:16 CST,,0,LOG,00000,"restartpoint complete: wrote 24656 buffers (18.8%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.076 s, sync=0.868 s, total=150.366 s; sync files=41, longest=0.527 s, average=0.021 s",,,,,,,,"LogCheckpointEnd, xlog.c:7898",""
2013-08-06 09:35:21.939 CST,,,4729,,52005220.1279,3,,2013-08-06 09:32:16 CST,,0,LOG,00000,"recovery restart point at 0/2D0C99D8","last completed transaction was at log time 2013-08-06 09:34:46.753686+08",,,,,,,"CreateRestartPoint, xlog.c:8715",""
2013-08-06 09:36:03.135 CST,,,4840,"",52005303.12e8,1,"",2013-08-06 09:36:03 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,"BackendInitialize, postmaster.c:3923",""
2013-08-06 09:36:03.136 CST,"postgres","digoal",4840,"[local]",52005303.12e8,2,"authentication",2013-08-06 09:36:03 CST,2/3,0,LOG,00000,"connection authorized: user=postgres database=digoal",,,,,,,,"PerformAuthentication, postinit.c:239",""
2013-08-06 09:36:16.409 CST,"postgres","digoal",4840,"[local]",52005303.12e8,3,"idle",2013-08-06 09:36:03 CST,,0,LOG,00000,"disconnection: session time: 0:00:13.274 user=postgres database=digoal host=[local]",,,,,,,,"log_disconnections, postgres.c:4427","psql"


注意由于备库也配置了synchronous_transfer = data_flush, 所有的data flush操作都会等待, 所以要尽快把主库转成备库.
pg94@db-172-16-3-39-> cd $PGDATA
pg94@db-172-16-3-39-> mv recovery.done recovery.conf
pg94@db-172-16-3-39-> pg_ctl restart -m immediate
waiting for server to shut down.... done
server stopped
server starting
pg94@db-172-16-3-39-> LOG:  00000: loaded library "pg_stat_statements"
LOCATION:  load_libraries, miscinit.c:1296

主库 fail-back失败 : 
2013-08-06 09:39:38.208 CST,,,13320,,520053da.3408,1,,2013-08-06 09:39:38 CST,,0,LOG,00000,"database system was interrupted; last known up at 2013-08-06 09:32:33 CST",,,,,,,,"StartupXLOG, xlog.c:6080",""
2013-08-06 09:39:38.208 CST,,,13320,,520053da.3408,2,,2013-08-06 09:39:38 CST,,0,LOG,00000,"entering standby mode",,,,,,,,"StartupXLOG, xlog.c:6133",""
2013-08-06 09:39:38.208 CST,,,13320,,520053da.3408,3,,2013-08-06 09:39:38 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,"StartupXLOG, xlog.c:6443",""
2013-08-06 09:39:38.210 CST,,,13320,,520053da.3408,4,,2013-08-06 09:39:38 CST,1/0,0,LOG,00000,"redo starts at 0/36000090",,,,,,,,"StartupXLOG, xlog.c:6687",""
2013-08-06 09:39:53.644 CST,,,13320,,520053da.3408,5,,2013-08-06 09:39:38 CST,1/0,0,LOG,00000,"record with zero length at 0/52438570",,,,,,,,"ReadRecord, xlog.c:4420",""
2013-08-06 09:39:53.644 CST,,,13320,,520053da.3408,6,,2013-08-06 09:39:38 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/52438570",,,,,,,,"CheckRecoveryConsistency, xlog.c:7348",""
2013-08-06 09:39:53.645 CST,,,13318,,520053da.3406,1,,2013-08-06 09:39:38 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,"sigusr1_handler, postmaster.c:4737",""
2013-08-06 09:39:53.649 CST,,,13339,,520053e9.341b,1,,2013-08-06 09:39:53 CST,,0,LOG,00000,"started streaming WAL from primary at 0/52000000 on timeline 1",,,,,,,,"WalReceiverMain, walreceiver.c:366",""
2013-08-06 09:39:53.650 CST,,,13339,,520053e9.341b,2,,2013-08-06 09:39:53 CST,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR:  requested starting point 0/52000000 is ahead of the WAL flush position of this server 0/45AD6088
",,,,,,,,"libpqrcv_receive, libpqwalreceiver.c:513",""
2013-08-06 09:40:08.653 CST,,,13346,,520053f8.3422,1,,2013-08-06 09:40:08 CST,,0,LOG,00000,"started streaming WAL from primary at 0/52000000 on timeline 1",,,,,,,,"WalReceiverMain, walreceiver.c:366",""
2013-08-06 09:40:08.654 CST,,,13346,,520053f8.3422,2,,2013-08-06 09:40:08 CST,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR:  requested starting point 0/52000000 is ahead of the WAL flush position of this server 0/45AD6088
",,,,,,,,"libpqrcv_receive, libpqwalreceiver.c:513",""

将备库的synchronous_transfer改为commit. fail-back失败.
pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 100 digoal
Client 0 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (id)=(2242608) already exists.
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"
PL/pgSQL function func() line 10 at SQL statement
Client 7 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (id)=(2615244) already exists.
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"
PL/pgSQL function func() line 10 at SQL statement

pg94@db-172-16-3-39-> psql
psql (9.4devel)
Type "help" for help.
digoal=# checkpoint;
CHECKPOINT
digoal=# checkpoint;
CHECKPOINT

pg94@db-172-16-3-33-> pg_ctl promote
server promoting
pg94@db-172-16-3-33-> top -c -u pg94
top - 09:52:29 up 74 days, 16:10,  1 user,  load average: 0.29, 0.13, 0.07
Tasks: 209 total,   1 running, 208 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  14369440k total,  9195600k used,  5173840k free,   404756k buffers
Swap:  8385920k total,      184k used,  8385736k free,  8036496k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                
 5409 pg94      15   0 12760 1204  844 R  0.7  0.0   0:00.12 top -c -u pg94                                                         
 4262 pg94      15   0 67272 2900 1328 S  0.0  0.0   0:00.26 -bash                                                                  
 5382 pg94      15   0 1181m  47m  46m S  0.0  0.3   0:00.04 /home/pg94/pgsql9.4devel/bin/postgres                                  
 5383 pg94      16   0  108m 1048  516 S  0.0  0.0   0:00.00 postgres: logger process                                               
 5386 pg94      15   0 1182m 296m 295m S  0.0  2.1   0:00.55 postgres: checkpointer process                                         
 5387 pg94      15   0 1182m 6744 6172 S  0.0  0.0   0:00.00 postgres: writer process                                               
 5388 pg94      15   0  110m 1100  524 S  0.0  0.0   0:00.00 postgres: stats collector process                                      
 5404 pg94      15   0 1182m 1192  620 S  0.0  0.0   0:00.00 postgres: wal writer process                                           
 5405 pg94      17   0 1183m 2248 1164 S  0.0  0.0   0:00.00 postgres: autovacuum launcher process                                  
 5406 pg94      18   0  110m 1044  472 S  0.0  0.0   0:00.00 postgres: archiver process   last was 00000002.history 

pg94@db-172-16-3-39-> pg_ctl stop -m immediate
waiting for server to shut down.... done
server stopped
pg94@db-172-16-3-39-> cd $PGDATA
pg94@db-172-16-3-39-> mv recovery.done recovery.conf
pg94@db-172-16-3-39-> pg_ctl start
failed-back 失败 : 
pg94@db-172-16-3-39-> cat postgresql-2013-08-06_095340.csv
2013-08-06 09:53:40.585 CST,,,13969,,52005724.3691,1,,2013-08-06 09:53:40 CST,,0,LOG,00000,"database system was interrupted; last known up at 2013-08-06 09:51:32 CST",,,,,,,,"StartupXLOG, xlog.c:6080",""
2013-08-06 09:53:40.585 CST,,,13969,,52005724.3691,2,,2013-08-06 09:53:40 CST,,0,LOG,00000,"entering standby mode",,,,,,,,"StartupXLOG, xlog.c:6133",""
2013-08-06 09:53:40.585 CST,,,13969,,52005724.3691,3,,2013-08-06 09:53:40 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,"StartupXLOG, xlog.c:6443",""
2013-08-06 09:53:40.587 CST,,,13969,,52005724.3691,4,,2013-08-06 09:53:40 CST,1/0,0,LOG,00000,"redo starts at 0/74BAF018",,,,,,,,"StartupXLOG, xlog.c:6687",""
2013-08-06 09:53:45.958 CST,,,13969,,52005724.3691,5,,2013-08-06 09:53:40 CST,1/0,0,LOG,00000,"record with zero length at 0/89217048",,,,,,,,"ReadRecord, xlog.c:4420",""
2013-08-06 09:53:45.959 CST,,,13969,,52005724.3691,6,,2013-08-06 09:53:40 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/89217048",,,,,,,,"CheckRecoveryConsistency, xlog.c:7348",""
2013-08-06 09:53:45.960 CST,,,13967,,52005724.368f,1,,2013-08-06 09:53:40 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,"sigusr1_handler, postmaster.c:4737",""
2013-08-06 09:53:45.965 CST,,,13976,,52005729.3698,1,,2013-08-06 09:53:45 CST,,0,LOG,00000,"fetching timeline history file for timeline 2 from primary server",,,,,,,,"WalRcvFetchTimeLineHistoryFiles, walreceiver.c:666",""
2013-08-06 09:53:45.966 CST,,,13976,,52005729.3698,2,,2013-08-06 09:53:45 CST,,0,FATAL,XX000,"could not start WAL streaming: ERROR:  requested starting point 0/89000000 on timeline 1 is not in this server's history
DETAIL:  This server's history forked from timeline 1 at 0/82603A30
",,,,,,,,"libpqrcv_startstreaming, libpqwalreceiver.c:195",""


[其他]
1. 加入data_flush后 , 需要特别注意等待事件, 如果standby正常的话, 影响不大, 如果standby 性能不行或者standby 挂了, 就如同sync replication一样会带来堵塞.
2. 如果配置了synchronous_transfer = data_flush|all, 在没有已经连上来的其他standby节点的情况下, 使用pg_basebackup创建standby会处于等待状态, 因为pg_basebackup需要在master节点执行checkpoint.
3. 使用异步流复制的情况下, 测试 synchronous_transfer  = data_flush, all都不能完成主节点的failback.
4. 使用同步流复制的情况下, 测试 synchronous_transfer  = data_flush, all可以完成正常的主节点的failback.

[参考]
7. 第4版补丁synchronous_transfer参数介绍
we support following 2 cases.
1. SYNC server and also make same failback safe standby server
2. ASYNC server and also make same failback safe standby server

1.  changed name of parameter
  give up 'failback_safe_standby_names' parameter from the first patch.
  and changed name of parameter from 'failback_safe_mode ' to
'synchronous_transfer'.
  this parameter accepts 'all', 'data_flush' and 'commit'.

  -'commit'
    'commit' means that master waits for corresponding WAL to flushed
to disk of standby server on commits.
    but master doesn't waits for replicated data pages.

  -'data_flush'
    'data_flush' means that master waits for replicated data page
(e.g, CLOG, pg_control) before flush to disk of master server.
    but if user set to 'data_flush' to this parameter,
'synchronous_commit' values is ignored even if user set
'synchronous_commit'.

  -'all'
    'all' means that master waits for replicated WAL and data page.

2. put SyncRepWaitForLSN() function into XLogFlush() function
  we have put SyncRepWaitForLSN() function into XLogFlush() function,
and change argument of XLogFlush().

they are setup case and need to set parameters.

- SYNC server and also make same failback safe standgy server (case 1)
  synchronous_transfer = all
  synchronous_commit = remote_write/on
  synchronous_standby_names = <ServerName>

- ASYNC server and also make same failback safe standgy server (case 2)
  synchronous_transfer = data_flush
  (synchronous_commit values is ignored)

- default SYNC replication
  synchronous_transfer = commit
  synchronous_commit = on
  synchronous_standby_names = <ServerName>

- default ASYNC replication
  synchronous_transfer = commit

ToDo
1. currently this patch supports synchronous transfer. so we can't set
different synchronous transfer mode to each server.
    we need to improve the patch for support following cases.
   - SYNC standby and make separate ASYNC failback safe standby
   - ASYNC standby and make separate ASYNC failback safe standby

2. we have not measure performance yet. we need to measure perfomance.

please give me your feedback.

Regards,

-------
Sawada Masahiko

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
关系型数据库 数据库 PostgreSQL
|
SQL 关系型数据库 PostgreSQL
|
安全 关系型数据库
|
关系型数据库 Shell 数据库