PostgreSQL pg_resetwal pg_resetxlog 强制使用某些值（例如system id）-阿里云开发者社区

PostgreSQL pg_resetwal pg_resetxlog 强制使用某些值（例如system id）

2019-04-14 1943

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

RDS PostgreSQL Serverless，0.5-4RCU 50GB 3个月

云原生数据库 PolarDB 分布式版，标准版 2核8GB

RDS MySQL Serverless 基础系列，0.5-2RCU 50GB

简介： 标签PostgreSQL , pg_resetxlog , pg_resetwal , 修复控制文件 , pg_controldata , 修复恢复异常背景使用pg_resetwal , pg_resetxlog 修复控制文件时，如何强制指定数据库实例systemid?pg_rese...

背景

使用pg_resetwal , pg_resetxlog 修复控制文件时，如何强制指定数据库实例systemid?

pg_resetxlog, pg_resetwal 能干什么

1、可以修复XLOG异常导致的启动或恢复失败

2、可以重建pg_control文件

/*-------------------------------------------------------------------------  
 *  
 * pg_resetxlog.c  
 *        A utility to "zero out" the xlog when it's corrupt beyond recovery.  
 *        Can also rebuild pg_control if needed.  
 *  
 * The theory of operation is fairly simple:  
 *        1. Read the existing pg_control (which will include the last  
 *               checkpoint record).  If it is an old format then update to  
 *               current format.  
 *        2. If pg_control is corrupt, attempt to intuit reasonable values,  
 *               by scanning the old xlog if necessary.  
 *        3. Modify pg_control to reflect a "shutdown" state with a checkpoint  
 *               record at the start of xlog.  
 *        4. Flush the existing xlog files and write a new segment with  
 *               just a checkpoint record in it.  The new segment is positioned  
 *               just past the end of the old xlog, so that existing LSNs in  
 *               data pages will appear to be "in the past".  
 * This is all pretty straightforward except for the intuition part of  
 * step 2 ...  
 *  
 *  
 * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group  
 * Portions Copyright (c) 1994, Regents of the University of California  
 *  
 * src/bin/pg_resetxlog/pg_resetxlog.c  
 *  
 *-------------------------------------------------------------------------  
 */

systemid 生成算法

使用pg_resetxlog , pg_resetwal重置控制文件时，systemid会重新生成，算法如下：

src/bin/pg_resetxlog/pg_resetxlog.c

        /*  
         * Create a new unique installation identifier, since we can no longer use  
         * any old XLOG records.  See notes in xlog.c about the algorithm.  
         */  
        gettimeofday(&tv, NULL);  
        sysidentifier = ((uint64) tv.tv_sec) << 32;  
        sysidentifier |= (uint32) (tv.tv_sec | tv.tv_usec);  
  
        ControlFile.system_identifier = sysidentifier;

重新生成的控制文件，与以前的控制文件systemid会不一样。

pg_controldata |grep system  
  
Database system identifier:           6593269818598452546

数据库systemid有什么用？

1、当使用流复制的物理备库时，需要判断上下游节点的system id是否一致，如果不一致，物理复制中断（当然，物理备库是完全一致的，因为文件级一致）。

流复制协议中，可以看到获取systemid的接口

https://www.postgresql.org/docs/11/static/protocol-replication.html

IDENTIFY_SYSTEM  
Requests the server to identify itself. Server replies with a result set of a single row, containing four fields:  
  
systemid (text)  
The unique system identifier identifying the cluster. This can be used to check that the base backup used to initialize the standby came from the same cluster.  
  
timeline (int4)  
Current timeline ID. Also useful to check that the standby is consistent with the master.  
  
xlogpos (text)  
Current WAL flush location. Useful to get a known location in the write-ahead log where streaming can start.  
  
dbname (text)  
Database connected to or null.

2、在recovery时，如果发现xlog的systemid与当前数据库的systemid不一致，同样也会不使用这个xlog文件。这个目的当然也很纯洁，因为只会使用自己产生的xlog，当然不能用别人（的库）产生的XLOG。

避免用错XLOG。

如何强制设置控制文件的systemid为某个常量？

当一些非常极端的情况，需要用hacker的方法来修正控制文件的NEXT XLOG，回到以前的某个XLOG，进行恢复时，怎么处理呢？

场景还原例子：

1、某个集群发生了主备切换，老的主库比新的主库多产生了一些XLOG。(异步流复制，HA)

2、老的主库，变成了备库。并且新备库的状态一直处于切换时间点。(新备库处于时间线1，新主库处于时间线2（新主库的时间线1的终点LSN，小于新备库当前的时间线1的LSN）)，简单来说出现了分歧。

3、备份集：只有新备库的物理备份，新主库从切换开始后的所有归档文件。

4、需要创建一个PITR任务，恢复到切换后的某个时间点（早于当前时间）。

你会发现这个任务正常无法完成，因为备份集处于时间线1，并且备份集的LSN已经超越了现有归档文件（时间线1）的最小LSN。

如果使用pg_rewind，从新主库来修复备库，只能把备库修复到与当前主库一样。并不能恢复到过去的某个时刻。

《PostgreSQL primary-standby failback tools : pg_rewind》

《PostgreSQL 9.5 new feature - pg_rewind fast sync Split Brain Primary & Standby》

《PostgreSQL 9.5 add pg_rewind for Fast align for PostgreSQL unaligned primary & standby》

所以需要使用HACKER的方法，使用pg_resetwal，把新备库的next wal文件改到小于现有归档文件（时间线1）的最小LSN。

但是前面说了pg_resetwal时，新产生的控制文件systemid会变，变了之后，就没法使用原来集群的REDO文件来恢复了。所以需要hacker一下pg_resetwal，让他使用原来集群的systemid.

        /*  
         * Create a new unique installation identifier, since we can no longer use  
         * any old XLOG records.  See notes in xlog.c about the algorithm.  
         */  
        gettimeofday(&tv, NULL);  
        sysidentifier = ((uint64) tv.tv_sec) << 32;  
        sysidentifier |= (uint32) (tv.tv_sec | tv.tv_usec);  
  
        // ControlFile.system_identifier = sysidentifier;  
        ControlFile.system_identifier = 6319303457022381234;  // 使用原来集群的systemid

步骤

1、生成控制文件，设置next wal小于现有归档文件（时间线1）的最小LSN。(刚好在临界点就最好了)

2、启动备库，读到当前hot_standby配置并写入控制文件，关闭备库。

3、配置recovery.conf，包括恢复到的目标时间点。

4、开始PITR恢复。

参考

《PostgreSQL 11 preview - pg_resetwal 在线修改 WAL segment size》

《使用pg_resetxlog修复PostgreSQL控制文件的方法》

《异版本pg_resetxlog后导致的控制文件差异问题处理》

《Use pg_resetxlog simulate tuple disappear within PostgreSQL》

《Get txid from pg_controldata's output》

src/bin/pg_resetxlog/pg_resetxlog.c

man pg_resetwal

man pg_restwal

man pg_controldata

PostgreSQL pg_resetwal pg_resetxlog 强制使用某些值（例如system id）

标签

背景

pg_resetxlog, pg_resetwal 能干什么

systemid 生成算法

数据库systemid有什么用？

如何强制设置控制文件的systemid为某个常量？

步骤

参考

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像