群里的一位兄弟问到的一个问题 :
[广州]土豆饼-pg 10:34:08
流复制的问题
[广州]土豆饼-pg 10:34:20
我今天在主库导入了150多个数据库
[广州]土豆饼-pg 10:34:28
然后对主库做基线备份
[广州]土豆饼-pg 10:34:46
拷贝到备库启动时 流复制一直起不来
[广州]土豆饼-pg 10:34:51
报错
cp: 无法 stat “/opt/PostgreSQL/9.1/archive/0000000100000000000000FD”: 没有那个文件或目录
cp: 无法 stat “/opt/PostgreSQL/9.1/archive/00000002.history”: 没有那个文件或目录
[广州]土豆饼-pg 10:35:08
主备库都照了没有这个文件00000002.history
[广州]土豆饼-pg 10:35:20
请问我的流复制怎么起起来
从现象上来看是时间线历史文件丢失了, 具体是什么原因丢失的先不管.
如果不想重做基线备份重做standby, 首先要保证standby需要的xlog信息在主库还没有被rotate掉.
下面假设xlog信息未rotate掉, 那么解决问题的关键就在如何自建时间线历史文件中的内容.
时间线历史文件的格式如下 :
* Each line in the file represents a timeline switch:
*
* <parentTLI> <switchpoint> <reason>
*
* parentTLI ID of the parent timeline
* switchpoint XLogRecPtr of the WAL position where the switch happened
* reason human-readable explanation of why the timeline was changed
*
* The fields are separated by tabs. Lines beginning with # are comments, and
* are ignored. Empty lines are also ignored.
1. parentTLI指从那个timeline切换而来, 如当前timeline=1如果发生切换的话那么新的timeline=2, 这个时间线文件的parentTLI=1.
2. switchpoint指发生切换时的xlog文件名.
3. 可以理解为comment.
以上三个分段的内容用TAB分隔.
空行和#开头的行代表注释. timeline.c略过处理这些行.
接下来模拟一下丢失时间线文件的恢复.
测试环境 :
PostgreSQL 9.1.3
node A(primary) :
node B(standby) :
node A(primary) :
node B(primary) :
node A(standby) :
【参考】
digoal@db-172-16-3-33-> psql
psql (9.1.3)
Type "help" for help.
digoal=> drop table test;
DROP TABLE
digoal=> create table test(id int,crt_time timestamp default clock_timestamp());
CREATE TABLE
digoal=> insert into test (id) select generate_series(1,1000000);
INSERT 0 1000000
digoal=> delete from test ;
DELETE 1000000
digoal=> insert into test values (1,now());
INSERT 0 1
digoal=> select * from test;
id | crt_time
----+----------------------------
1 | 2013-02-01 12:10:19.299484
(1 row)
node B(standby) :
digoalstd@db-172-16-3-33-> psql
psql (9.1.3)
Type "help" for help.
digoal=> select * from test ;
id | crt_time
----+----------------------------
1 | 2013-02-01 12:10:19.299484
(1 row)
node A(primary) :
-- 关闭节点A
digoal@db-172-16-3-33-> pg_ctl stop -m fast
waiting for server to shut down..... done
server stopped
node B(primary) :
-- promote 节点B
digoalstd@db-172-16-3-33-> pg_ctl promote
server promoting
digoalstd@db-172-16-3-33-> psql
psql (9.1.3)
Type "help" for help.
digoal=> insert into test (id) select generate_series(1,100000);
INSERT 0 100000
digoal=> insert into test (id) select generate_series(1,100000);
INSERT 0 100000
digoal=> insert into test (id) select generate_series(1,100000);
INSERT 0 100000
digoal=> \c digoal postgres
digoal=# checkpoint;
CHECKPOINT
digoal=# select pg_switch_xlog();
pg_switch_xlog
----------------
0/146A6CC8
(1 row)
digoal=# select pg_switch_xlog();
pg_switch_xlog
----------------
0/15000000
(1 row)
-- 查看新生成的history文件, 并删除
digoalstd@db-172-16-3-33-> cat 00000004.history
2 000000020000000000000009 no recovery target specified
3 000000030000000000000012 no recovery target specified
digoalstd@db-172-16-3-33-> rm -f 00000004.history
node A(standby) :
-- 转换成standby节点, 配置recovery.conf略.
digoal@db-172-16-3-33-> pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
digoal@db-172-16-3-33-> cd $PGDATA
digoal@db-172-16-3-33-> mv recovery.done recovery.conf
-- 启动
digoal@db-172-16-3-33-> pg_ctl start
server starting
-- 查看日志
digoal@db-172-16-3-33-> less postgresql-2013-02-01_121939.csv
2013-02-01 12:19:39.812 CST,,,20015,,510b425b.4e2f,1,,2013-02-01 12:19:39 CST,,0,LOG,00000,"database system was shut down in recovery at 2013-02-01 12:17:50 CST",,,,,,,,,""
2013-02-01 12:19:39.812 CST,,,20015,,510b425b.4e2f,2,,2013-02-01 12:19:39 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2013-02-01 12:19:39.840 CST,,,20015,,510b425b.4e2f,4,,2013-02-01 12:19:39 CST,1/0,0,LOG,00000,"record with zero length at 0/12EAE578",,,,,,,,,""
2013-02-01 12:19:39.840 CST,,,20009,,510b425b.4e29,1,,2013-02-01 12:19:39 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""
2013-02-01 12:19:39.843 CST,,,20018,,510b425b.4e32,1,,2013-02-01 12:19:39 CST,,0,FATAL,XX000,"timeline 4 of the primary does not match recovery target timeline 3",,,,,,,,,""
创建时间线文件名取自日志中的 :
timeline 4 of the primary does not match recovery target timeline 3
所以文件名应该为 : 00000004.history
时间线文件的内容来自timeline 4 of the primary does not match recovery target timeline 3以及它的前一条输出record with zero length at 0/12EAE578.
1. parentTLI = 3
2. switchpoint 由3部分组成 :
snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli, \
(uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
(uint32) ((logSegNo) % XLogSegmentsPerXLogId))
tli = 3
(uint32) ((logSegNo) / XLogSegmentsPerXLogId) = 0
(uint32) ((logSegNo) % XLogSegmentsPerXLogId) = 12
最后得出switchpoint = 000000030000000000000012
具体怎么算来的可参考 :
http://blog.163.com/digoal@126/blog/static/1638770402012914112949546/
3. reason 随便填一串字符
所以时间线文件的内容为 :
3 000000030000000000000012 recovery by digoal
在node A(standby)创建这个文件 :
digoal@db-172-16-3-33-> cd $PGDATA/pg_xlog
digoal@db-172-16-3-33-> vi 00000004.history
再次查看日志 :
2013-02-01 12:29:54.865 CST,,,20015,,510b425b.4e2f,5,,2013-02-01 12:19:39 CST,1/0,0,LOG,00000,"new target timeline is 4",,,,,,,,,""
2013-02-01 12:29:54.867 CST,,,20435,,510b44c2.4fd3,1,,2013-02-01 12:29:54 CST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,,""
2013-02-01 12:29:55.345 CST,,,20015,,510b425b.4e2f,6,,2013-02-01 12:19:39 CST,1/0,0,LOG,00000,"redo starts at 0/12EAE578",,,,,,,,,""
digoal@db-172-16-3-33->
现在正常的连接到主节点了.
数据正常 :
digoal@db-172-16-3-33-> psql
psql (9.1.3)
Type "help" for help.
digoal=> select count(*) from test ;
count
--------
300001
(1 row)
【参考】
2.
src/include/access/timeline.h
3. src/backend/access/transam/timeline.c