背景
穷鬼玩PolarDB RAC一写多读集群系列已经写了3篇:
本篇文章介绍一下如何进行PolarDB计算节点的Switchover? 实验环境依赖 《在Docker容器中用loop设备模拟共享存储》 , 如果没有环境, 请自行参考以上文章搭建环境.
还需要参考如下文档:
DEMO
只有当 Primary 节点停止写入后,才可以将 Replica 节点提升为 Primary 节点,否则将会出现集群内多个节点同时写入存储的情况。当数据库检测到出现多节点写入时,将会导致运行异常。
1、停止p1 primary
# 进入pb1容器 docker exec -ti pb1 bash # 停止primary pg_ctl stop -m fast -D ~/primary 日志如下: waiting for server to shut down.... done server stopped
2、promote p2 replica 切记: promote之前请再三确认RW节点已经停止了, 因为promote的命令没有做保护(激活过程中polardb和pfs都没有所谓的分布式排他锁), 会直接激活, 所以可能出现多个RW的情况, 数据有被破坏的可能.
# 进入pb2容器 docker exec -ti pb2 bash # promote replica pg_ctl promote -D ~/replica1
出现报错如下:
[PFSD_SDK INF Dec 18 14:33:38.327335][859]pfsd_umount_force 247: pbdname nvme1n1 [PFSD_SDK INF Dec 18 14:33:38.330154][859]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme1n1/859.pid [PFSD_SDK INF Dec 18 14:33:38.330163][859]pfsd_umount_force 261: umount success for nvme1n1 waiting for server to promote...[PFSD_SDK INF Dec 18 14:33:38.444794][859]pfsd_umount_force 247: pbdname nvme1n1 [PFSD_SDK INF Dec 18 14:33:38.446306][859]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme1n1/859.pid [PFSD_SDK INF Dec 18 14:33:38.446314][859]pfsd_umount_force 261: umount success for nvme1n1 .[PFSD_SDK INF Dec 18 14:33:38.690673][859]pfsd_umount_force 247: pbdname nvme1n1 [PFSD_SDK INF Dec 18 14:33:38.693022][859]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme1n1/859.pid [PFSD_SDK INF Dec 18 14:33:38.693035][859]pfsd_umount_force 261: umount success for nvme1n1 [PFSD_SDK INF Dec 18 14:33:38.917831][859]pfsd_umount_force 247: pbdname nvme1n1 [PFSD_SDK INF Dec 18 14:33:38.919652][859]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme1n1/859.pid [PFSD_SDK INF Dec 18 14:33:38.919664][859]pfsd_umount_force 261: umount success for nvme1n1 [PFSD_SDK INF Dec 18 14:33:39.364163][859]pfsd_umount_force 247: pbdname nvme1n1 [PFSD_SDK INF Dec 18 14:33:39.390100][859]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme1n1/859.pid [PFSD_SDK INF Dec 18 14:33:39.390109][859]pfsd_umount_force 261: umount success for nvme1n1 stopped waiting pg_ctl: server did not promote in time
数据库日志如下:
$PGDATA/pg_log/postgresql-2024-12-18_142905_error.log 849 2024-12-18 14:33:38.330 CST LOG: received promote request 849 2024-12-18 14:33:38.330 CST LOG: async lock replay: startup@polar_alr_terminate_worker: terminate worker 849 2024-12-18 14:33:38.333 CST LOG: redo done at 0/484CAF88 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 263.78 s 849 2024-12-18 14:33:38.333 CST LOG: last completed transaction was at log time 2024-12-18 14:29:14.656403+08 849 2024-12-18 14:33:38.333 CST WARNING: polar_log_read_record read records from 'currRecPtr: 0/484C5168, read record: (0/484C5168, 0/484C51E0), decoded record: (0/484C5168, 0/484C51E0, 0/0)' to 'currRecPtr: 0/484CAF88, read record: (0/484CAF88, 0/484CB000), decoded record: (0/484CAF88, 0/484CB000, 0/0)', next record is 'currRecPtr: 0/484CAF88, read record: (0/484CAF88, 0/484CB000), decoded record: (0/484CAF88, 0/484CB000, 0/0)' 849 2024-12-18 14:33:38.380 CST WARNING: replica can't modify file in shared storage by pfs: /nvme1n1/shared_data//base/5 849 2024-12-18 14:33:38.424 CST WARNING: replica can't modify file in shared storage by pfs: /nvme1n1/shared_data//base/1 849 2024-12-18 14:33:38.465 CST WARNING: replica can't modify file in shared storage by pfs: /nvme1n1/shared_data//base/4 849 2024-12-18 14:33:38.509 CST WARNING: replica can't modify file in shared storage by pfs: /nvme1n1/shared_data//base/13572 849 2024-12-18 14:33:38.509 CST LOG: selected new timeline ID: 2 849 2024-12-18 14:33:38.511 CST LOG: vfs_unlink /nvme1n1/shared_data//pg_wal/xlogtemp.849 Got coredump signal(11) in process(849) Process Info: pid: 849 type: startup sql: <NULL> Backtrace: #0 polar_program_error_handler+0x388 [0xaaaad54f72a4] #1 +0x388 [0xffffabc2d790] #2 strerrordesc_np+0x1e30 [0xffffaac19950] #3 dopr+0x164 [0xaaaad55824b0] #4 pg_vsnprintf+0xa0 [0xaaaad5581c68] #5 pvsnprintf+0x44 [0xaaaad5574e54] #6 appendStringInfoVA+0xb4 [0xaaaad5576cb8] #7 errmsg+0x1c4 [0xaaaad54ef674] #8 polar_vfs_file_handle_node_type+0x158 [0xffffa77de3cc] #9 polar_stat_file_before_hook+0x44 [0xffff933b5ef8] #10 vfs_unlink+0xf0 [0xffffa77daa00] #11 polar_unlink+0x4c [0xaaaad4e017bc] #12 XLogFileCopy+0x130 [0xaaaad4e06f38] #13 XLogInitNewTimeline+0xe4 [0xaaaad4e0acf4] #14 StartupXLOG+0xcd0 [0xaaaad4e0bf68] #15 StartupProcessMain+0xcc [0xaaaad51f49b4] #16 AuxiliaryProcessMain+0x284 [0xaaaad51e2520] #17 StartChildProcess+0x60 [0xaaaad51f27c0] #18 PostmasterMain+0x1564 [0xaaaad51eb5bc] #19 main+0x380 [0xaaaad50dde60] #20 __libc_init_first+0x7c [0xffffaaba73fc] #21 __libc_start_main+0x98 [0xffffaaba74cc] #22 _start+0x30 [0xaaaad4cc1f70] 851 2024-12-18 14:33:38.628 CST WARNING: Failed to get the instance memory usage 836 2024-12-18 14:33:39.049 CST LOG: startup process (PID 849) was terminated by signal 11: Segmentation fault 836 2024-12-18 14:33:39.049 CST LOG: terminating any other active server processes 836 2024-12-18 14:33:39.051 CST LOG: shutting down due to startup process failure [PFSD_SDK INF Dec 18 14:33:39.051684][836]pfsd_umount_force 247: pbdname nvme1n1 [PFSD_SDK INF Dec 18 14:33:39.072031][836]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme1n1/836.pid [PFSD_SDK INF Dec 18 14:33:39.072049][836]pfsd_umount_force 261: umount success for nvme1n1 836 2024-12-18 14:33:39.072 CST LOG: umount pbd nvme1n1 success 836 2024-12-18 14:33:39.103 CST LOG: database system is shut down
core文件bt如下:
$ gdb --exec=/home/postgres/tmp_polardb_pg_15_base/bin/postgres --core=/home/postgres/replica1/core GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1 Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "aarch64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word". warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing warning: Can't open file /dev/shm/PostgreSQL.3630676106 during file-backed mapping note processing warning: Can't open file /SYSV000c289a (deleted) during file-backed mapping note processing [New LWP 849] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". Core was generated by `postgres(5432): startup '. Program terminated with signal SIGSEGV, Segmentation fault. #0 __pthread_kill_implementation (threadid=281473559998496, signo=signo@entry=11, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) bt #0 __pthread_kill_implementation (threadid=281473559998496, signo=signo@entry=11, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 #1 0x0000ffffaabff254 in __pthread_kill_internal (signo=11, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #2 0x0000ffffaabba67c in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26 #3 0x0000aaaad54f74f0 in ?? () #4 0x0000000b00000000 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?)
在pb2 sudo dmesg
系统层面未看到oom或其他报错, 尝试了关闭其他容器(降低内存使用), 问题依旧.
暂时无法进行后续操作, 请问有什么解决办法?
在研发同学帮助下, 已经找到问题并解决: 没有开启logindex, 导致ro无法被激活. 但是这个手册中没有提及, 期待未来官方文档的补充说明. 开启logindex配置几个参数即可, 参数已经更新到 《在Docker容器中用loop设备模拟共享存储》 . 你跟着做的话, 应该不会碰到我的问题了.
promote 正常的日志如下:
$ pg_ctl promote -D ~/replica1 [PFSD_SDK INF Dec 24 09:56:35.922139][747]pfsd_umount_force 247: pbdname nvme1n1 [PFSD_SDK INF Dec 24 09:56:35.923984][747]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme1n1/747.pid [PFSD_SDK INF Dec 24 09:56:35.924002][747]pfsd_umount_force 261: umount success for nvme1n1 waiting for server to promote...[PFSD_SDK INF Dec 24 09:56:36.036961][747]pfsd_umount_force 247: pbdname nvme1n1 [PFSD_SDK INF Dec 24 09:56:36.038059][747]chnl_connection_release_shm 1164: client umount return : deleted /var/run/pfsd//nvme1n1/747.pid [PFSD_SDK INF Dec 24 09:56:36.038070][747]pfsd_umount_force 261: umount success for nvme1n1 done server promoted postgres@895e6aa6ffd4:~$ psql psql (PostgreSQL 15.10 (PolarDB 15.10.2.0 build d4f5477d debug) on aarch64-linux-gnu) Type "help" for help. slot 也正常的切换了: postgres=# select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase -----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+----------- replica1 | | physical | | | f | f | | | | 0/40940E40 | | reserved | | f standby1 | | physical | | | f | f | | | | 0/40942F38 | | reserved | | f (2 rows)
3、“配置p1 primary作为replica” OR “重建1个replica” .
参考: 《穷鬼玩PolarDB RAC一写多读集群系列 | 在Docker容器中用loop设备模拟共享存储》
假设是全新的计算节点(容器), 需要参考以上文章依次执行如下:
- 1 部署PolarDB二进制. 参考步骤 9、将编译好的二进制拷贝到pb1, pb2的HOME目录, 便于调用
- 2 部署PFS二进制. 参考步骤 使用 polardb_pg_devel:ubuntu22.04 镜像, 自带pfs, 无需再安装pfs.
- 3 接入共享块设备. 参考步骤 可以跳过步骤 : 5.1 和 5.2
- 4 将共享块设备软链到 nvme1n1. 参考步骤 5.3、将loop设备软链到nvme1n1
- 5 启动 pfsd 参考步骤 7、在pb1, pb2都启动pfsd, -w为工作进程数量, 数字越大IO性能越好, 这里受制于容器资源限制设置为1.
- 6 新增部署replica节点. 参考步骤 15、replica节点部署. 及后续步骤
本文将继续使用pb1的容器来演示, 因为在pb1 容器中前面5步都已经完成了, 所以只需要操作第6步(新增部署replica节点).
3.1、清理pb1 旧的本地数据目录
# 进入pb1容器 docker exec -ti pb1 bash rm -rf $HOME/primary
3.2、通过 polar-initdb.sh
脚本使用共享存储上的数据目录来初始化 Replica 节点的本地目录。 因为我想后续其他实验部分依旧把pb1作为primary节点, 所以目录依旧使用primary命名.
mkdir -m 0700 $HOME/primary sudo /home/postgres/tmp_polardb_pg_15_base/bin/polar-initdb.sh $HOME/primary/ /nvme1n1/shared_data/ replica
3.3、修改replica数据目录及子目录及文件的权限和owner
sudo chmod -R 700 $HOME/primary sudo chown -R postgres:postgres $HOME/primary
3.4、生成初始配置文件
initdb -D /tmp/primary cp /tmp/primary/*.conf $HOME/primary/ rm -rf /tmp/primary
3.5、编辑pb1 replica节点配置文件
配置postgresql.conf
echo " port=5432 polar_hostid=1 # 注意 polar_enable_shared_storage_mode=on polar_disk_name='nvme1n1' polar_datadir='/nvme1n1/shared_data/' polar_vfs.localfs_mode=off shared_preload_libraries='\$libdir/polar_vfs,\$libdir/polar_worker' polar_storage_cluster_name='disk' logging_collector=on log_line_prefix='%p\t%r\t%u\t%m\t' log_directory='pg_log' listen_addresses='0.0.0.0' max_connections=200 # 下面几个参数解决replica不能promote的问题, 因为RO依赖logindex. polar_logindex_mem_size=64MB polar_xlog_queue_buffers=64MB polar_xlog_page_buffers=64MB # 使用pfs时可以关掉 full page write 和 polar_has_partial_write , 否则请打开这两 full_page_writes = off polar_has_partial_write = off polar_resource_manager.enable_resource_manager=off # replication primary_slot_name='replica1' primary_conninfo='host=172.17.0.3 port=5432 user=postgres dbname=postgres application_name=replica1' # 注意 " >> ~/primary/postgresql.conf
配置pg_hba.conf
echo " host all all 0.0.0.0/0 md5 host replication postgres 172.17.0.0/16 trust " >> ~/primary/pg_hba.conf
3.6、标识节点以 Replica 模式启动:
touch $HOME/primary/replica.signal
3.7、启动 Replica 节点:
pg_ctl start -D $HOME/primary
3.8、pb2 检查到replica复制槽正常连接
postgres=# select * from pg_stat_replication; -[ RECORD 1 ]----+------------------------------ pid | 844 usesysid | 10 usename | postgres application_name | replica1 client_addr | 172.17.0.2 client_hostname | client_port | 57928 backend_start | 2024-12-24 10:06:13.679898+08 backend_xmin | state | streaming sent_lsn | 0/40943100 write_lsn | 0/40943100 flush_lsn | 0/40943100 replay_lsn | 0/40943100 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async reply_time | 2024-12-24 10:06:43.729921+08
此时已经完成了switchover.
4、switchover again.
因为我想后续其他实验部分依旧把pb1作为primary节点, 所以我下面再做一次角色转变: 把pb2 停掉, 把pb1 promote, 把pb2 重建为replica.
4.1、把pb2 RW节点停掉
pg_ctl stop -m fast -D ~/replica1
4.2、把pb1 promote 为RW节点
pg_ctl promote -D ~/primary
4.3、把pb2 重建为replica.
略. 参考本文第3步 以及 下面的文章 15、replica节点部署
章节.
因为pb2之前就是RO节点, 其他都配置好了. 实际上把pb2切换回RO节点比较简单, 只需要2步. 但在工程化过程中往往会复用模块, 统一采用新增replica的流程.
touch $HOME/replica1/replica.signal pg_ctl start -D ~/replica1
参考
《穷鬼玩PolarDB RAC一写多读集群系列 | 在Docker容器中用loop设备模拟共享存储》
《穷鬼玩PolarDB RAC一写多读集群系列 | 如何搭建PolarDB容灾(standby)节点》
《穷鬼玩PolarDB RAC一写多读集群系列 | 共享存储在线扩容》
https://apsaradb.github.io/PolarDB-for-PostgreSQL/zh/operation/ro-online-promote.html