PostgreSQL recovery from timeline history file lost case-阿里云开发者社区

开发者社区> 数据库> 正文

PostgreSQL recovery from timeline history file lost case

简介:
群里的一位兄弟问到的一个问题 :
[广州]土豆饼-pg  10:34:08
流复制的问题
[广州]土豆饼-pg  10:34:20
我今天在主库导入了150多个数据库
[广州]土豆饼-pg  10:34:28
然后对主库做基线备份
[广州]土豆饼-pg  10:34:46
拷贝到备库启动时 流复制一直起不来
[广州]土豆饼-pg  10:34:51
报错
cp: 无法 stat “/opt/PostgreSQL/9.1/archive/0000000100000000000000FD”: 没有那个文件或目录
cp: 无法 stat “/opt/PostgreSQL/9.1/archive/00000002.history”: 没有那个文件或目录
[广州]土豆饼-pg  10:35:08
主备库都照了没有这个文件00000002.history
[广州]土豆饼-pg  10:35:20
请问我的流复制怎么起起来

从现象上来看是时间线历史文件丢失了, 具体是什么原因丢失的先不管. 
如果不想重做基线备份重做standby, 首先要保证standby需要的xlog信息在主库还没有被rotate掉.
下面假设xlog信息未rotate掉, 那么解决问题的关键就在如何自建时间线历史文件中的内容.
时间线历史文件的格式如下 :
 * Each line in the file represents a timeline switch:
 *
 * <parentTLI> <switchpoint> <reason>
 *
 *      parentTLI       ID of the parent timeline
 *      switchpoint     XLogRecPtr of the WAL position where the switch happened
 *      reason          human-readable explanation of why the timeline was changed
 *
 * The fields are separated by tabs. Lines beginning with # are comments, and
 * are ignored. Empty lines are also ignored.

1. parentTLI指从那个timeline切换而来, 如当前timeline=1如果发生切换的话那么新的timeline=2, 这个时间线文件的parentTLI=1.
2. switchpoint指发生切换时的xlog文件名.
3. 可以理解为comment.
以上三个分段的内容用TAB分隔.
空行和#开头的行代表注释. timeline.c略过处理这些行.

接下来模拟一下丢失时间线文件的恢复.
测试环境 : 
PostgreSQL 9.1.3

node A(primary) :
digoal@db-172-16-3-33-> psql
psql (9.1.3)
Type "help" for help.
digoal=> drop table test;
DROP TABLE
digoal=> create table test(id int,crt_time timestamp default clock_timestamp());
CREATE TABLE
digoal=> insert into test (id) select generate_series(1,1000000);
INSERT 0 1000000
digoal=> delete from test ;
DELETE 1000000
digoal=> insert into test values (1,now());
INSERT 0 1
digoal=> select * from test;
 id |          crt_time          
----+----------------------------
  1 | 2013-02-01 12:10:19.299484
(1 row)


node B(standby) : 
digoalstd@db-172-16-3-33-> psql
psql (9.1.3)
Type "help" for help.
digoal=> select * from test ;
 id |          crt_time          
----+----------------------------
  1 | 2013-02-01 12:10:19.299484
(1 row)


node A(primary) :

-- 关闭节点A
digoal@db-172-16-3-33-> pg_ctl stop -m fast
waiting for server to shut down..... done
server stopped


node B(primary) :
-- promote 节点B
digoalstd@db-172-16-3-33-> pg_ctl promote
server promoting
digoalstd@db-172-16-3-33-> psql
psql (9.1.3)
Type "help" for help.
digoal=> insert into test (id) select generate_series(1,100000);
INSERT 0 100000
digoal=> insert into test (id) select generate_series(1,100000);
INSERT 0 100000
digoal=> insert into test (id) select generate_series(1,100000);
INSERT 0 100000
digoal=> \c digoal postgres
digoal=# checkpoint;
CHECKPOINT
digoal=# select pg_switch_xlog();
 pg_switch_xlog 
----------------
 0/146A6CC8
(1 row)
digoal=# select pg_switch_xlog();
 pg_switch_xlog 
----------------
 0/15000000
(1 row)
-- 查看新生成的history文件, 并删除
digoalstd@db-172-16-3-33-> cat 00000004.history
2       000000020000000000000009        no recovery target specified

3       000000030000000000000012        no recovery target specified
digoalstd@db-172-16-3-33-> rm -f 00000004.history


node A(standby) :
-- 转换成standby节点, 配置recovery.conf略.
digoal@db-172-16-3-33-> pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
digoal@db-172-16-3-33-> cd $PGDATA
digoal@db-172-16-3-33-> mv recovery.done recovery.conf
-- 启动
digoal@db-172-16-3-33-> pg_ctl start
server starting
-- 查看日志
digoal@db-172-16-3-33-> less postgresql-2013-02-01_121939.csv
2013-02-01 12:19:39.812 CST,,,20015,,510b425b.4e2f,1,,2013-02-01 12:19:39 CST,,0,LOG,00000,"database system was shut down in recovery at 2013-02-01 12:17:50 CST",,,,,,,,,""
2013-02-01 12:19:39.812 CST,,,20015,,510b425b.4e2f,2,,2013-02-01 12:19:39 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2013-02-01 12:19:39.840 CST,,,20015,,510b425b.4e2f,4,,2013-02-01 12:19:39 CST,1/0,0,LOG,00000,"record with zero length at 0/12EAE578",,,,,,,,,""
2013-02-01 12:19:39.840 CST,,,20009,,510b425b.4e29,1,,2013-02-01 12:19:39 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""
2013-02-01 12:19:39.843 CST,,,20018,,510b425b.4e32,1,,2013-02-01 12:19:39 CST,,0,FATAL,XX000,"timeline 4 of the primary does not match recovery target timeline 3",,,,,,,,,""
创建时间线文件名取自日志中的 : 
timeline 4 of the primary does not match recovery target timeline 3
所以文件名应该为 : 00000004.history
时间线文件的内容来自timeline 4 of the primary does not match recovery target timeline 3以及它的前一条输出record with zero length at 0/12EAE578.
1. parentTLI = 3
2. switchpoint 由3部分组成 : 
snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli,       \
             (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
             (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
tli = 3
(uint32) ((logSegNo) / XLogSegmentsPerXLogId) = 0
(uint32) ((logSegNo) % XLogSegmentsPerXLogId) = 12
最后得出switchpoint = 000000030000000000000012
具体怎么算来的可参考 : 
http://blog.163.com/digoal@126/blog/static/1638770402012914112949546/
3. reason 随便填一串字符
所以时间线文件的内容为 : 
3       000000030000000000000012        recovery by digoal
在node A(standby)创建这个文件 : 
digoal@db-172-16-3-33-> cd $PGDATA/pg_xlog
digoal@db-172-16-3-33-> vi 00000004.history
再次查看日志 : 
2013-02-01 12:29:54.865 CST,,,20015,,510b425b.4e2f,5,,2013-02-01 12:19:39 CST,1/0,0,LOG,00000,"new target timeline is 4",,,,,,,,,""
2013-02-01 12:29:54.867 CST,,,20435,,510b44c2.4fd3,1,,2013-02-01 12:29:54 CST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,,""
2013-02-01 12:29:55.345 CST,,,20015,,510b425b.4e2f,6,,2013-02-01 12:19:39 CST,1/0,0,LOG,00000,"redo starts at 0/12EAE578",,,,,,,,,""
digoal@db-172-16-3-33-> 
现在正常的连接到主节点了.
数据正常 : 
digoal@db-172-16-3-33-> psql
psql (9.1.3)
Type "help" for help.

digoal=> select count(*) from test ;
 count  
--------
 300001
(1 row)

【参考】
2. src/include/access/timeline.h
3. src/backend/access/transam/timeline.c

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

分享:
数据库
使用钉钉扫一扫加入圈子
+ 订阅

分享数据库前沿,解构实战干货,推动数据库技术变革

其他文章