1.概述
关于PostgreSQL的HA有很多种解决方案,各有利弊,具体可参考PostgreSQL官方手册。本文介绍共享存储HA的搭建方法。
http://www.postgresql.org/docs/9.3/static/different-replication-solutions.html
2.总体设计
测试环境由3台VMWare虚拟机构成。其中1台服务器作为共享存储服务器使用并且凑足3个节点以满足法定投票数。
使用external/ssh作为测试用的fence设备,生产环境务必要使用物理fence设备。
共享存储服务器
OS:CentOS release 6.3 (Final)
主机名:disknode
网卡1:
用于集群内部消息的通信
网卡类型:NAT
IP:192.168.152.120
网卡2:
用于共享盘的iscsi通信
网卡类型:Host-Only
IP:192.168.146.120
网卡3:
用于external/ssh fence设备通信
网卡类型:桥接
IP:10.167.217.107
HA节点1
OS:CentOS release 6.5 (Final)
主机名:hanode1
网卡1:
用于集群公开IP(192.168.152.200)和集群内部消息的通信
网卡类型:NAT
IP:192.168.152.130
网卡2:
用于共享盘的iscsi通信
网卡类型:Host-Only
IP:192.168.146.130
网卡3:
用于external/ssh fence设备通信
网卡类型:桥接
IP:10.167.217.169
HA节点2
OS:CentOS release 6.5 (Final)
主机名:hanode2
网卡1:
用于集群公开IP(192.168.152.200)和集群内部消息的通信
网卡类型:NAT
IP:192.168.152.140
网卡2:
用于共享盘的iscsi通信
网卡类型:Host-Only
IP:192.168.146.140
网卡3:
用于external/ssh fence设备通信
网卡类型:桥接
IP:10.167.217.171
集群公开IP
192.168.152.200
3. 网络环境配置
每台虚拟机上准备好可配置在不同网段的3块网卡,这里分别使用了NAT,Host-Only和桥接3种不同类型的虚拟网卡。
网卡1(NAT)用作集群公开IP和集群内部消息的心跳通信。
网卡2(Host-Only)用于共享盘的iscsi通信
网卡3(桥接)用于external/ssh fence设备通信
按照[2.总体设计],分别给3个主机设置主机名和IP地址。比如对HA节点1
[root@hanode1 ~]# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hanode1
GATEWAY=192.168.152.2
[root@hanode1 ~]# source /etc/sysconfig/network
[root@hanode1 ~]# ifconfig eth0 192.168.152.130 netmask 255.255.255.0
[root@hanode1 ~]# ifconfig eth1 192.168.146.130 netmask 255.255.255.0
[root@hanode1 ~]# ifconfig eth2 10.167.217.169 netmask 255.255.255.0
把3个网卡ip地址分别设置到相应的配置文件,比如eth0
[root@hanode1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
TYPE=Ethernet
IPADDR=192.168.152.130
NETMASK=255.255.255.0
NETWORK=192.168.152.0
BROADCAST=192.168.152.255
STARTMODE=onboot
USERCONTROL=no
将网卡3的IP地址记录到3个主机的/etc/hosts中,external/ssh会使用这个IP。
$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.167.217.169 hanode1 hanode1.localdomain
10.167.217.171 hanode2 hanode2.localdomain
10.167.217.107 disknode diknode.localdomain
建立SSH互信
在hanode1上执行以下命令
[root@hanode1 ~]# cd ~
[root@hanode1 ~]# ssh-keygen -t dsa -f ~/.ssh/id_dsa -N ""
[root@hanode1 ~]# cp .ssh/id_dsa.pub .ssh/authorized_keys
[root@hanode1 ~]# scp -r .ssh hanode2:
[root@hanode1 ~]# scp -r .ssh disknode:
4.集群软件的选择与安装
经过比较,最终在适合做数据库集群的几个候选中选择了Pacemaker + corosync 1.x。
RHCS
heartbeat 1.x
Pacemaker + heartbeat 2.x/3.x
Pacemaker + cman
Pacemaker + corosync 1.x
Pacemaker + corosync 2.x
理由:
RHCS据说过于复杂,先不考虑。
heartbeat已处于维护阶段没更新了,而且也不受主流Linux发行版支持了,不考虑。
根据Pacemaker的官网说明,Pacemaker + corosync 2.x是今后的方向,因此锁定corosync。但没找到corosync 2.x的yum包,也不想编译源码那么麻烦了,于是先用corosync 1.x。
http://clusterlabs.org/quickstart.html
----------
Eventually everyone will move to Corosync 2 which removes support for CMAN and custom plugins BUT natively includes the APIs Pacemaker needs for quorum and membership. In this case, users would configure corosync.conf and use the Pacemaker init-script to start up after Corosync.
----------
pacemaker和corosync的安装
centos自带的源里resource-agents等软件太旧,所以使用opensuse的源安装pacemaker和corosync
wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-6/network:ha-clustering:Stable.repo
cp network:ha-clustering:Stable.repo /etc/yum.repos.d/
yum install pacemaker-1.1.9 corosync crmsh
但是这个版本的pacemaker编译时没有支持heartbeat项目里的fence设备(包括external/ssh),需要重新编译pacemaker。
[root@hanode1 ~]# pacemakerd -F
Pacemaker 1.1.9-2.7 (Build: 2db99f1)
Supporting v3.0.7: generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc nagios corosync-plugin
下载pacemaker 1.1.9的代码,解压
[root@hanode1 ~]# wget https://github.com/ClusterLabs/pacemaker/archive/Pacemaker-1.1.9.tar.gz
[root@hanode1 ~]# tar xfz Pacemaker-1.1.9.tar.gz
使用yum安装以下依赖包,特别是最后2个。
automake
autoconf
libtool-ltdl-devel
pkgconfig
python
glib2-devel
libxml2-devel
libxslt-devel
python-devel
gcc-c++
bzip2-devel
gnutls-devel
pam-devel
libqb-devel
corosynclib-devel
cluster-glue-libs-devel
编译安装Pacemaker
[root@hanode1 ~]# pacemaker-Pacemaker-1.1.9/
[root@hanode1 ~]# ./autogen.sh
[root@hanode1 ~]# ./configure
[root@hanode1 ~]# make
[root@hanode1 ~]# sudo make install
编译时如发现有包缺失,需要装上再执行编译。
编译中如果遇到下面的错误,把出错的2行注掉(这2个服务后面不会用到)。
upstart.c:264: error: implicit declaration of function 'g_variant_lookup_value'
upstart.c:264: error: nested extern declaration of 'g_variant_lookup_value'
upstart.c:264: error: assignment makes pointer from integer without a cast
...
systemd.c:227: error: implicit declaration of function 'g_variant_lookup_value'
systemd.c:227: error: nested extern declaration of 'g_variant_lookup_value'
systemd.c:227: error: assignment makes pointer from integer without a cast
再用pacemakerd -F查看,已经包含了lha-fencing了。
[root@hanode1 ~]# pacemakerd -F
Pacemaker 1.1.9 (Build: 2a917dd)
Supporting: libqb-logging libqb-ipc lha-fencing upstart systemd nagios heartbeat corosync-plugin cman snmp
5. Corosync的配置
3台主机使用相同配置。bindnetaddr设置为网卡1的网络地址。
[root@hanode1 ~]# cat /etc/corosync/corosync.conf
- # Please read the corosync.conf.5 manual page
- compatibility: whitetank
- totem {
- version: 2
- secauth: off
- threads: 0
- interface {
- ringnumber: 0
- bindnetaddr: 192.168.152.0
- mcastaddr: 226.94.1.1
- mcastport: 5405
- ttl: 1
- }
- }
- logging {
- fileline: off
- to_stderr: no
- to_logfile: yes
- to_syslog: yes
- logfile: /var/log/cluster/corosync.log
- debug: off
- timestamp: on
- logger_subsys {
- subsys: AMF
- debug: off
- }
- }
- amf {
- mode: disabled
- }
- service {
- ver: 0
- name: pacemaker
- clustername: mycluster
- }
- aisexec {
- user: root
- group: root
- }
[root@hanode1 ~]# /etc/init.d/corosync start
[root@hanode1 ~]# chkconfig corosync on
5.共享存储的安装与配置
5.1 在disknode上配置iSCSI targer server
安装iSCSI target
[root@disknode ~]# yum -y install scsi-target-utils
在targets.conf中追加target,将/dev/sdb作为共享磁盘。
[root@disknode ~]# vi /etc/tgt/targets.conf
- backing-store /dev/sdb
- write-cache off
启动tgtd
[root@disknode ~]# service tgtd start
[root@disknode ~]# chkconfig tgtd on
5.2 在hanode1和hanode2上配置iSCSI客户端
安装iscsi initiator(以hanode1为例,hanode2相同)
[root@hanode1 ~]# yum install iscsi-initiator-utils-*.rpm
启动iscsid
[root@hanode1 ~]# service iscsid start
[root@hanode1 ~]# chkconfig iscsid on
发现和登录target会话
[root@hanode1 ~]# iscsiadm -m discovery -t sendtargets -p 192.168.146.120
[root@hanode1 ~]# iscsiadm -m node -T iqn.2008-09.com.example:server.target1 -p 192.168.146.120 -l
iscsi共享盘加载到客户端是/dev/sdc
6. PostgreSQL的安装和配置
在hanode1和hanode2上安装PostgreSQL,并确保2台机器上postgres账号的id相同。
[root@hanode1 ~]# yum install http://yum.postgresql.org/9.3/redhat/rhel-6-x86_64/pgdg-redhat93-9.3-1.noarch.rpm
[root@hanode1 ~]# yum install postgresql93-server postgresql93-contrib
在hanode1上初始化数据库
[root@hanode1 ~]# mkdir -p /mnt/pg
[root@hanode1 ~]# chown postgres:postgres /mnt/pg
[root@hanode1 ~]# mount /dev/sdc /mnt/pg
[root@hanode1 ~]# mv /var/lib/pgsql /var/lib/pgsql_org
[root@hanode1 ~]# ln -sf /mnt/pg/pgsql /var/lib/pgsql
[root@hanode1 ~]# mkdir -p /var/lib/pgsql/9.3/backups
[root@hanode1 ~]# mkdir -p /var/lib/pgsql/9.3/data
[root@hanode1 ~]# chown postgres:postgres -R /var/lib/pgsql/9.3
[root@hanode1 ~]# service postgresql-9.3 initdb
修改数据库配置
[root@hanode1 ~]# vi /var/lib/pgsql/9.3/data/postgresql.conf
- listen_addresses = '*'
- port = 5432
[root@hanode1 ~]# vi /var/lib/pgsql/data/pg_hba.conf
- local all all trust
- host all all all trust
在hanode2上建立数据目录的链接
[root@hanode2 ~]# mkdir -p /mnt/pg
[root@hanode2 ~]# chown postgres:postgres /mnt/pg
[root@hanode2 ~]# mv /var/lib/pgsql /var/lib/pgsql_org
[root@hanode2 ~]# ln -sf /mnt/pg/pgsql /var/lib/pgsql
7. 配置Pacemaker
[root@hanode1 ~]# crm configure edit
- node disknode
node hanode1
node hanode2
primitive ClusterIP IPaddr2 \
params ip=192.168.152.200 cidr_netmask=32 \
op monitor interval=30s
primitive DataFS Filesystem \
params device="/dev/sdc" directory="/mnt/pg" fstype=ext4 \
op monitor interval=15s
primitive pg93 pgsql \
meta target-role=Started is-managed=true \
op monitor interval=15s
primitive rs_iscsid lsb:iscsid \
op monitor interval=30s \
meta target-role=Started
primitive st-ssh stonith:external/ssh \
params hostlist="hanode1,hanode2"
group PgGroup ClusterIP rs_iscsid DataFS pg93
clone st-sshclone st-ssh
location no_iscsid rs_iscsid -inf: disknode
location votenode ClusterIP -inf: disknode
property cib-bootstrap-options: \
dc-version=1.1.9-2a917dd \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes=3 \
stonith-enabled=true \
no-quorum-policy=stop \
last-lrm-refresh=1402882416
#vim:set syntax=pcmk
查看集群状态(当前hanode2为主节点)
[root@hanode1 ~]# crm status
Last updated: Wed Sep 3 02:28:02 2014
Last change: Wed Sep 3 01:18:27 2014 via cibadmin on hanode1
Stack: classic openais (with plugin)
Current DC: disknode - partition with quorum
Version: 1.1.9-2a917dd
3 Nodes configured, 3 expected votes
7 Resources configured.
Online: [ disknode hanode1 hanode2 ]
Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode2
rs_iscsid (lsb:iscsid): Started hanode2
DataFS (ocf::heartbeat:Filesystem): Started hanode2
pg93 (ocf::heartbeat:pgsql): Started hanode2
Clone Set: st-sshclone [st-ssh]
Started: [ disknode hanode1 hanode2 ]
为确保状态正常,可以在3个节点上重启一下corosync
/etc/init.d/corosync restart
通过集群IP可以访问主服务器hanode2上的PG服务
[root@disknode ~]# psql -h 192.168.152.200 -U postgres
psql (9.3.4)
Type "help" for help.
8.验证测试
1)重置failcount
每次切换试验前重置前面的测试产生的failcount,否则Pacemaker在ziyuan分配时会屏蔽这个节点。还有一个偷懒的方法是设置pg93的属性“failure-timeout=60s”使其在60秒后让failcount自动失效。
[root@hanode1 ~]# crm resource pg93 set hanode1 0
[root@hanode1 ~]# crm_failcount -D -N hanode1 -r pg93
[root@hanode1 ~]# crm resource cleanup pg93
2)强制杀PG进程
在主服务器hanode2上杀死PG进程后,一会PG进程会被pacemaker在hanode2上自动起来
[root@hanode2 ~]# ps -ef|grep postgres
root 8910 8795 0 02:38 pts/1 00:00:00 su - postgres
postgres 8911 8910 0 02:38 pts/1 00:00:00 -bash
postgres 14555 1 0 02:56 ? 00:00:00 /usr/pgsql-9.3/bin/postgres -D /var/lib/pgsql/data -c config_file=/var/lib/pgsql/data/postgresql.conf
postgres 14597 14555 0 02:56 ? 00:00:00 postgres: logger process
postgres 14599 14555 0 02:56 ? 00:00:00 postgres: checkpointer process
postgres 14600 14555 0 02:56 ? 00:00:00 postgres: writer process
postgres 14601 14555 0 02:56 ? 00:00:00 postgres: wal writer process
postgres 14602 14555 0 02:56 ? 00:00:00 postgres: autovacuum launcher process
postgres 14603 14555 0 02:56 ? 00:00:00 postgres: stats collector process
root 15013 9121 0 02:57 pts/0 00:00:00 grep postgres
[root@hanode2 ~]# kill -9 14555
[root@hanode2 ~]# ps -ef|grep postgres
root 8910 8795 0 02:38 pts/1 00:00:00 su - postgres
postgres 8911 8910 0 02:38 pts/1 00:00:00 -bash
root 15113 9121 0 02:57 pts/0 00:00:00 grep postgres
...
[root@hanode2 ~]# ps -ef|grep postgres
root 8910 8795 0 02:38 pts/1 00:00:00 su - postgres
postgres 8911 8910 0 02:38 pts/1 00:00:00 -bash
postgres 15294 1 0 02:57 ? 00:00:00 /usr/pgsql-9.3/bin/postgres -D /var/lib/pgsql/data -c config_file=/var/lib/pgsql/data/postgresql.conf
postgres 15336 15294 0 02:57 ? 00:00:00 postgres: logger process
postgres 15338 15294 0 02:57 ? 00:00:00 postgres: checkpointer process
postgres 15339 15294 0 02:57 ? 00:00:00 postgres: writer process
postgres 15340 15294 0 02:57 ? 00:00:00 postgres: wal writer process
postgres 15341 15294 0 02:57 ? 00:00:00 postgres: autovacuum launcher process
postgres 15342 15294 0 02:57 ? 00:00:00 postgres: stats collector process
root 15344 9121 0 02:57 pts/0 00:00:00 grep postgres
3)切断主服务器心跳网卡(兼业务网卡)
切断前hanode1是主服务器
[root@hanode1 ~]# crm status
Last updated: Wed Sep 3 03:49:23 2014
Last change: Wed Sep 3 02:39:49 2014 via cibadmin on hanode1
Stack: classic openais (with plugin)
Current DC: disknode - partition with quorum
Version: 1.1.9-2a917dd
3 Nodes configured, 3 expected votes
7 Resources configured.
Online: [ disknode hanode1 hanode2 ]
Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode1
rs_iscsid (lsb:iscsid): Started hanode1
DataFS (ocf::heartbeat:Filesystem): Started hanode1
pg93 (ocf::heartbeat:pgsql): Started hanode1
Clone Set: st-sshclone [st-ssh]
Started: [ disknode hanode1 hanode2 ]
切断hanode1网卡后,hanode1服务器被重启,hanode2接管服务
[root@hanode2 ~]# crm status
Last updated: Wed Sep 3 03:57:13 2014
Last change: Wed Sep 3 02:39:49 2014 via cibadmin on hanode1
Stack: classic openais (with plugin)
Current DC: disknode - partition with quorum
Version: 1.1.9-2a917dd
3 Nodes configured, 3 expected votes
7 Resources configured.
Online: [ disknode hanode1 hanode2 ]
Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode2
rs_iscsid (lsb:iscsid): Started hanode2
DataFS (ocf::heartbeat:Filesystem): Started hanode2
pg93 (ocf::heartbeat:pgsql): Started hanode2
Clone Set: st-sshclone [st-ssh]
Started: [ disknode hanode1 hanode2 ]
4)切断主服务器共享磁盘网卡
切断后主服务器hanode2共享磁盘网卡后,15秒后,pacemaker检出故障,然后切换到从服务器hanode1。(命令输出略)
5)停止主服务器集群服务
[root@hanode2 ~]# /etc/init.d/corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting for corosync services to unload:.. [ OK ]
很快从服务器接管服务(命令输出略)
6)切断主服务器心跳网卡和fencing网卡
中段主服务器hanode2所有网卡后,集群启动hanode1,可惜卡在PG的启动上。重启corosync后解决。
这个Case也有问题,但我们如果假设fencing设备是不会失效的,就不用考虑这个Case了。
30秒后pacemaker误以为fencing成功了,切换到hanode1
/var/log/messages
Sep 3 10:59:11 hanode1 stonith-ng[14464]: notice: handle_request: Client crmd.14468.e4823c4b wants to fence (reboot) 'hanode2' with device '(any)'
Sep 3 10:59:11 hanode1 stonith-ng[14464]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for hanode2: 7ce22e99-046c-4bfe-a4cd-02033b646a93 (0)
Sep 3 10:59:31 hanode1 stonith-ng[14464]: notice: remote_op_done: Operation reboot of hanode2 by hanode1 for crmd.14468@disknode.7ce22e99: OK
Sep 3 10:59:31 hanode1 crmd[14468]: notice: tengine_stonith_callback: Stonith operation 7/31:82:0:385d8cd2-e2ac-4f62-8e04-f9c7dcfff7ba: OK (0)
Sep 3 10:59:31 hanode1 crmd[14468]: notice: tengine_stonith_notify: Peer hanode2 was terminated (st_notify_fence) by hanode1 for disknode: OK (ref=7ce22e99-046c-4bfe-a4cd-02033b646a93) by client crmd.14468
Sep 3 10:59:31 hanode1 crmd[14468]: notice: te_rsc_command: Initiating action 6: start ClusterIP_start_0 on hanode1
后面的日志有一堆错误,但用crm status查看切换还是成功了,psql也可以连接。登上hanode2执行df发现hanode2上的共享磁盘也没有了,postgres进程也没有了。也就是说切换的很干净。(这是由于hanode2运行的pacemaker发现自己被孤立了,达不到法定票数,于是把自己的资源都停掉了)
[root@hanode1 ~]# crm status
Last updated: Wed Sep 3 12:13:55 2014
Last change: Wed Sep 3 10:39:37 2014 via cibadmin on hanode1
Stack: classic openais (with plugin)
Current DC: hanode1 - partition with quorum
Version: 1.1.9-2a917dd
3 Nodes configured, 3 expected votes
7 Resources configured.
Online: [ disknode hanode1 ]
OFFLINE: [ hanode2 ]
Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode1
rs_iscsid (lsb:iscsid): Started hanode1
DataFS (ocf::heartbeat:Filesystem): Started hanode1
pg93 (ocf::heartbeat:pgsql): Started hanode1
Clone Set: st-sshclone [st-ssh]
Started: [ disknode hanode1 ]
Stopped: [ st-ssh:2 ]
再打开hanode2的网卡,hanode2又上线了。
[root@hanode1 ~]# crm status
Last updated: Wed Sep 3 12:22:24 2014
Last change: Wed Sep 3 10:39:37 2014 via cibadmin on hanode1
Stack: classic openais (with plugin)
Current DC: hanode2 - partition with quorum
Version: 1.1.9-2a917dd
3 Nodes configured, 3 expected votes
7 Resources configured.
Online: [ disknode hanode1 hanode2 ]
Resource Group: PgGroup
ClusterIP (ocf::heartbeat:IPaddr2): Started hanode1
rs_iscsid (lsb:iscsid): Started hanode1
DataFS (ocf::heartbeat:Filesystem): Started hanode1
pg93 (ocf::heartbeat:pgsql): Started hanode1
Clone Set: st-sshclone [st-ssh]
Started: [ disknode hanode1 hanode2 ]
7)切断主服务器上所有网卡
结果和6相同。
8)主服务器宕机
主服务器宕机时切换很正常。(命令输出略)
9. 总结
试验证明Pacemaker+corosync的组合组数据库集群控制确实很实用。HA集群害怕两个东西,首先是脑裂导致数据破坏,其次是从机不能正常切换(可能部分资源被故障机占用)。通过组成集群的3个节点的投票,基本可以杜绝脑裂,加上fencing确保故障机器上的资源释放,可靠性很高了。如果只有两个节点,可靠性有所下降,这时就完全靠fencing了。
10. 参考1:几个常用命令
1)查询获得可用的资源类
[root@hanode2 ~]# crm ra classes
lsb
ocf / heartbeat linbit pacemaker redhat tummy
service
stonith
upstart
2)列出某个资源类中所有的资源脚本
[root@hanode2 ~]# crm ra list ocf pacemaker
ClusterMon Dummy HealthCPU HealthSMART
NodeUtilization Stateful SysInfo SystemHealth
controld o2cb ping pingd
remote
3)查看某个资源脚本的元数据
[root@hanode2 ~]# crm ra meta ocf:heartbeat:Filesystem
...
4)查询已安装的fence设备
[root@hanode2 ~]# stonith_admin -I
...
5)查询fence设备的元数据
[root@hanode2 ~]# stonith_admin --metadata -a external/ssh
...
11.参考2:相关链接
http://www.linux-ha.org/wiki/Pacemakerhttp://www.linux-ha.org/wiki/Resource_Agents
http://clusterlabs.org/wiki/Main_Page
http://clusterlabs.org/doc/zh-CN/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html
http://clusterlabs.org/doc/crm_fencing.html
https://github.com/ClusterLabs/pacemaker