一,概述:
什么是AIS和OpenAIS?
AIS是应用接口规范,是用来定义应用程序接口(API)的开放性规范的集合,这些应用程序作为中间件为应用服务提供一种开放、高移植性的程序接口。是在实现高可用应用过程中是亟需的。服务可用性论坛(SA Forum)是一个开放性论坛,它开发并发布这些免费规范。使用AIS规范的应用程序接口(API),可以减少应用程序的复杂性和缩短应用程序的开发时间,这些规范的主要目的就是为了提高中间组件可移植性和应用程序的高可用性。
OpenAIS是基于SA Forum 标准的集群框架的应用程序接口规范。OpenAIS提供一种集群模式,这个模式包括集群框架,集群成员管理,通信方式,集群监测等,能够为集群软件或工具提供满足 AIS标准的集群接口,但是它没有集群资源管理功能,不能独立形成一个集群。
corosync简介
corosync最初只是用来演示OpenAIS集群框架接口规范的一个应用,可以说corosync是OpenAIS的一部分,但后面的发展明显超越了官方最初的设想,越来越多的厂商尝试使用corosync作为集群解决方案。如Redhat的RHCS集群套件就是基于corosync实现。
corosync只提供了message layer,而没有直接提供CRM,一般使用Pacemaker进行资源管理。
CRM中的几个基本概念
资源粘性:
资源粘性表示资源是否倾向于留在当前节点,如果为正整数,表示倾向,负数则会离开,-inf表示正无穷,inf表示正无穷。
资源类类型:
primitive (native):基本资源,原始资源
group:资源组
clone:克隆资源(可同时运行在多个节点上),要先定义为primitive后才能进行clone。主要包含 STONITH和集群文件系统(cluster filesystem)
master/slave:主从资源,如drbd
RA类型:
lsb:linux表中库,一般位于/etc/rc.d/init.d/目录下的支持start|stop|status等参数的服务脚本都是lsb
ocf:Open cluster Framework,开放集群架构
heartbeat:heartbeat V1版本
stonith:专为配置stonith设备而用
集群类型和模型:
corosync+pacemaker可实现多种集群模型,包括 Active/Active, Active/Passive, N+1, N+M, N-to-1 and N-to-N。
Active/Passive冗余:
N to N 冗余(多个节点多个服务):
二.系统环境:Centos 6.4 x86_64
1.配置俩台节点的主机名
1
2
3
|
[root@node1 ~]
# hostname node1.luojianlong
[root@node1 ~]
# sed -i 's@\(HOSTNAME=\).*@\1node1.luojianlong.com@g' /etc/sysconfig/network
[root@node1 ~]
# bash
|
2.设置俩台节点ssh互信无密码登录
1
2
3
4
|
[root@node1 ~]
# ssh-keygen -t rsa
[root@node1 ~]
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@node2
[root@node2 ~]
# ssh-keygen -t rsa
[root@node2 ~]
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@node1
|
1.配置俩台节点主机名以及hosts文件:
1
2
3
|
[root@node1 ~]
# cat /etc/hosts
192.168.30.116 node1.luojianlong.com
192.168.30.117 node2.luojianlong.com
|
2.在节点一和节点二上分别安装corosync,pacemaker
1
|
[root@node1 ~]
# yum -y install corosync pacemaker
|
3.在节点一和节点二上分别关闭NetworkManager服务
1
2
|
[root@node1 ~]
# service NetworkManager stop
[root@node1 ~]
# chkconfig NetworkManager off
|
4.在节点一和节点二上分别安装crmsh-1.2.6-4
1
2
|
[root@node1 ~]
# yum -y --nogpgcheck localinstall crmsh*.rpm pssh*.rpm
[root@node2 ~]
# yum -y --nogpgcheck localinstall crmsh*.rpm pssh*.rpm
|
5.编译corosync配置文件:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
totem {
version: 2
secauth: on
threads: 0
interface {
ringnumber: 0
bindnetaddr: 192.168.30.0
mcastaddr: 226.94.1.1
mcastport: 5405
ttl: 1
}
}
logging {
fileline: off
to_stderr: no
to_logfile:
yes
to_syslog: no
logfile:
/var/log/cluster/corosync
.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
service {
ver: 0
name: pacemaker
# use_mgmtd: yes
}
aisexec {
user: root
group: root
}
|
version:表示版本信息;
secauth:表示集群之间加密认证;
threads:表示开启的线程数;
bindnetaddr:集群所在的网络地址;
mcastaddr:集群发送信息的多播地址
service:表示以插件的方式运行pacemaker;
6.生成集群认证密钥
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
[root@node1 ~]
# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits
for
key from
/dev/random
.
Press keys on your keyboard to generate entropy.
Press keys on your keyboard to generate entropy (bits = 176).
Press keys on your keyboard to generate entropy (bits = 240).
Press keys on your keyboard to generate entropy (bits = 304).
Press keys on your keyboard to generate entropy (bits = 368).
Press keys on your keyboard to generate entropy (bits = 432).
Press keys on your keyboard to generate entropy (bits = 496).
Press keys on your keyboard to generate entropy (bits = 560).
Press keys on your keyboard to generate entropy (bits = 624).
Press keys on your keyboard to generate entropy (bits = 688).
Press keys on your keyboard to generate entropy (bits = 752).
Press keys on your keyboard to generate entropy (bits = 816).
Press keys on your keyboard to generate entropy (bits = 880).
Press keys on your keyboard to generate entropy (bits = 944).
Press keys on your keyboard to generate entropy (bits = 1008).
Writing corosync key to
/etc/corosync/authkey
.
[root@node1 corosync]
# scp authkey corosync.conf root@192.168.30.117:/etc/corosync/
|
7.启动corosync服务:
1
2
|
[root@node1 ~]
# service corosync start
Starting Corosync Cluster Engine (corosync): [ OK ]
|
8.查看corosync日志
查看corosync引擎是否正常启动:
1
2
3
|
[root@node1 ~]
# grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log
Jan 10 10:49:13 corosync [MAIN ] Corosync Cluster Engine (
'1.4.1'
): started and ready to provide service.
Jan 10 10:49:13 corosync [MAIN ] Successfully
read
main configuration
file
'/etc/corosync/corosync.conf'
.
|
查看初始化成员节点通知是否正常发出:
1
2
3
4
5
6
|
[root@node1 ~]
# grep TOTEM /var/log/cluster/corosync.log
Jan 10 10:49:13 corosync [TOTEM ] Initializing transport (UDP
/IP
Multicast).
Jan 10 10:49:13 corosync [TOTEM ] Initializing transmit
/receive
security: libtomcrypt SOBER128
/SHA1HMAC
(mode 0).
Jan 10 10:49:13 corosync [TOTEM ] The network interface [192.168.30.116] is now up.
Jan 10 10:49:14 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 10 10:51:11 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
|
检查启动过程中是否有错误产生。下面的错误信息表示packmaker不久之后将不再作为corosync的插 件 运行,因此,建议使用cman作为集群基础架构服务;此处可安全忽略
1
2
3
|
[root@node1 ~]
# grep ERROR: /var/log/cluster/corosync.log | grep -v unpack_resources
Jan 10 10:49:14 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin
for
Corosync. The plugin is not supported
in
this environment and will be removed very soon.
Jan 10 10:49:14 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of
'Clusters from Scratch'
(http:
//www
.clusterlabs.org
/doc
)
for
details on using Pacemaker with CMAN
|
查看pacemaker是否正常启动:
1
2
3
4
5
6
|
[root@node1 ~]
# grep pcmk_startup /var/log/cluster/corosync.log
Jan 10 10:49:14 corosync [pcmk ] info: pcmk_startup: CRM: Initialized
Jan 10 10:49:14 corosync [pcmk ] Logging: Initialized pcmk_startup
Jan 10 10:49:14 corosync [pcmk ] info: pcmk_startup: Maximum core
file
size is: 18446744073709551615
Jan 10 10:49:14 corosync [pcmk ] info: pcmk_startup: Service: 9
Jan 10 10:49:14 corosync [pcmk ] info: pcmk_startup: Local
hostname
: node1.luojianlong.com
|
如果上面命令执行均没有问题,接着可以执行如下命令启动node2上的corosync
1
|
[root@node1 ~]
# ssh node2 '/etc/init.d/corosync start'
|
注意:启动node2需要在node1上使用如上命令进行,不要在node2节点上直接启动。下面是node1上的相 关日志
1
2
3
4
5
6
7
8
9
10
11
|
[root@node1 ~]
# tail /var/log/cluster/corosync.log
Jan 10 10:51:14 [14875] node1.luojianlong.com crmd: info: te_rsc_command: Action 3 confirmed - no wait
Jan 10 10:51:14 [14875] node1.luojianlong.com crmd: notice: run_graph: Transition 1 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=
/var/lib/pacemaker/pengine/pe-input-1
.bz2): Complete
Jan 10 10:51:14 [14875] node1.luojianlong.com crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received
in
state S_TRANSITION_ENGINE
Jan 10 10:51:14 [14875] node1.luojianlong.com crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jan 10 10:51:14 [14869] node1.luojianlong.com cib: info: cib_process_request: Completed cib_query operation
for
section
//cib/status//node_state
[@
id
=
'node1.luojianlong.com'
]
//transient_attributes//nvpair
[@name=
'probe_complete'
]: OK (rc=0, origin=
local
/attrd/7
, version=0.5.6)
Jan 10 10:51:14 [14869] node1.luojianlong.com cib: info: cib_process_request: Completed cib_modify operation
for
section status: OK (rc=0, origin=
local
/attrd/8
, version=0.5.6)
Jan 10 10:51:16 [14869] node1.luojianlong.com cib: info: cib_process_request: Completed cib_modify operation
for
section status: OK (rc=0, origin=node2.luojianlong.com
/attrd/6
, version=0.5.7)
Jan 10 10:51:31 [14869] node1.luojianlong.com cib: info: crm_client_new: Connecting 0x913f10
for
uid=0 gid=0 pid=9832
id
=41a762cd-b87c-4f39-8482-a64c2c61209c
Jan 10 10:51:31 [14869] node1.luojianlong.com cib: info: cib_process_request: Completed cib_query operation
for
section
'all'
: OK (rc=0, origin=
local
/crm_mon/2
, version=0.5.7)
Jan 10 10:51:31 [14869] node1.luojianlong.com cib: info: crm_client_destroy: Destroying 0 events
|
9.如果安装了crmsh,可使用如下命令查看集群节点的启动状态:
1
2
3
4
5
6
7
8
9
10
|
[root@node1 ~]
# crm status
Last updated: Fri Jan 10 11:07:14 2014
Last change: Fri Jan 10 10:51:11 2014 via crmd on node1.luojianlong.com
Stack: classic openais (with plugin)
Current DC: node1.luojianlong.com - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured, 2 expected votes
0 Resources configured
Online: [ node1.luojianlong.com node2.luojianlong.com ]
#从上面的信息可以看出两个节点都已经正常启动,并且集群已经处于正常工作状态。
|
10.执行ps auxf命令可以查看corosync启动的各相关进程。
1
2
3
4
5
6
7
|
[root@node1 ~]
# ps auxf
189 14869 0.4 0.2 93928 10624 ? S 09:20 0:05 \_
/usr/libexec/pacemaker/cib
root 14871 0.0 0.1 94280 4036 ? S 09:20 0:01 \_
/usr/libexec/pacemaker/stonithd
root 14872 0.0 0.0 75996 3216 ? S 09:20 0:00 \_
/usr/libexec/pacemaker/lrmd
189 14873 0.0 0.0 89536 3444 ? S 09:20 0:00 \_
/usr/libexec/pacemaker/attrd
189 14874 0.0 0.4 117180 18920 ? S 09:20 0:00 \_
/usr/libexec/pacemaker/pengine
189 14875 0.0 0.1 147684 6624 ? S 09:20 0:00 \_
/usr/libexec/pacemaker/crmd
|
11.配置集群的工作属性,禁用stonith
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
#corosync默认启用了stonith,而当前集群并没有相应的stonith设备,因此此默认配置目前尚不可用,这可以通过如下命令验正:
[root@node1 ~]
# crm_verify -L -V
error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
#我们里可以通过如下命令先禁用stonith:
[root@node1 ~]
# crm configure property stonith-enabled=false
#使用如下命令查看当前的配置信息:
[root@node1 ~]
# crm configure show
node node1.luojianlong.com
node node2.luojianlong.com
property $
id
=
"cib-bootstrap-options"
\
dc
-version=
"1.1.10-14.el6_5.1-368c726"
\
cluster-infrastructure=
"classic openais (with plugin)"
\
expected-quorum-votes=
"2"
\
stonith-enabled=
"false"
#从中可以看出stonith已经被禁用。
#上面的crm,crm_verify命令是1.0后的版本的pacemaker提供的基于命令行的集群管理工具;可以在集群中的任何一个节点上执行。
|
12.查看支持的集群资源
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
#corosync支持heartbeat,LSB和ocf等类型的资源代理,目前较为常用的类型为LSB和OCF两类,stonith类专为配置stonith设备而用;
#可以通过如下命令查看当前集群系统所支持的类型:
[root@node1 ~]
# crm ra classes
lsb
ocf / heartbeat pacemaker
service
stonith
#如果想要查看某种类别下的所用资源代理的列表,可以使用类似如下命令实现:
[root@node1 ~]
# crm ra list lsb
[root@node1 ~]
# crm ra list ocf heartbeat
[root@node1 ~]
# crm ra list ocf pacemaker
[root@node1 ~]
# crm ra list stonith
# crm ra info [class:[provider:]]resource_agent
#例如:
[root@node1 ~]
# crm ra info ocf:heartbeat:IPaddr
|
13.接下来要创建的web集群创建一个IP地址资源,以在通过集群提供web服务时使用;这可以通过如下方式实现:
语法:
primitive <rsc> [<class>:[<provider>:]]<type>
[params attr_list]
[operations id_spec]
[op op_type [<attribute>=<value>...] ...]
op_type :: start | stop | monitor
例子:
primitive apcfence stonith:apcsmart \
params ttydev=/dev/ttyS0 hostlist="node1 node2" \
op start timeout=60s \
op monitor interval=30m timeout=60s
1
|
[root@node1 ~]
# crm configure primitive WebIP ocf:heartbeat:IPaddr params ip=192.168.30.230
|
14.通过如下的命令执行结果可以看出此资源已经在node1.luojianlong.com上启动:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
[root@node1 ~]
# crm status
Last updated: Tue Mar 25 11:10:40 2014
Last change: Tue Mar 25 11:10:00 2014 via cibadmin on node1.luojianlong.com
Stack: classic openais (with plugin)
Current DC: node1.luojianlong.com - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured, 2 expected votes
1 Resources configured
Online: [ node1.luojianlong.com node2.luojianlong.com ]
WebIP (ocf::heartbeat:IPaddr): Started node1.luojianlong.com
[root@node1 ~]
# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link
/loopback
00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1
/8
scope host lo
inet6 ::1
/128
scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link
/ether
00:0c:29:f3:fc:ba brd ff:ff:ff:ff:ff:ff
inet 192.168.30.116
/24
brd 192.168.30.255 scope global eth0
inet 192.168.30.230
/24
brd 192.168.30.255 scope global secondary eth0
inet6 fe80::20c:29ff:fef3:fcba
/64
scope link
valid_lft forever preferred_lft forever
|
15.而后我们到node2上通过如下命令停止node1上的corosync服务,并查看集群状态
1
2
3
4
5
6
7
8
9
10
11
12
13
|
[root@node2 ~]
# ssh node1 '/etc/init.d/corosync stop'
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting
for
corosync services to unload:.[ OK ]
[root@node2 ~]
# crm status
Last updated: Tue Mar 25 11:13:57 2014
Last change: Tue Mar 25 11:13:22 2014 via crmd on node2.luojianlong.com
Stack: classic openais (with plugin)
Current DC: node2.luojianlong.com - partition WITHOUT quorum
Version: 1.1.10-14.el6_5.2-368c726
2 Nodes configured, 2 expected votes
1 Resources configured
Online: [ node2.luojianlong.com ]
OFFLINE: [ node1.luojianlong.com ]
|
16.上面的信息显示node1.luojianlong.com已经离线,但资源WebIP却没能在node2.luojianlong.com上启动。这是因为此时的集群状态为"WITHOUT quorum",即已经失去了quorum,此时集群服务本身已经不满足正常运行的条件,这对于只有两节点的集群来讲是不合理的。因此,我们可以通过如下的命令来修改忽略quorum不能满足的集群状态检查:
1
|
[root@node2 ~]
# crm configure property no-quorum-policy=ignore
|
17.片刻之后,集群就会在目前仍在运行中的节点node2上启动此资源了,如下所示:
1
2
3
4
5
6
7
8
9
10
11
|
[root@node2 ~]
# crm status
Last updated: Tue Mar 25 11:16:21 2014
Last change: Tue Mar 25 11:15:47 2014 via cibadmin on node2.luojianlong.com
Stack: classic openais (with plugin)
Current DC: node2.luojianlong.com - partition WITHOUT quorum
Version: 1.1.10-14.el6_5.2-368c726
2 Nodes configured, 2 expected votes
1 Resources configured
Online: [ node2.luojianlong.com ]
OFFLINE: [ node1.luojianlong.com ]
WebIP (ocf::heartbeat:IPaddr): Started node2.luojianlong.com
|
18.好了,验正完成后,我们正常启动node1.luojianlong.com
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
[root@node2 ~]
# ssh node1 '/etc/init.d/corosync start'
Starting Corosync Cluster Engine (corosync): [ OK ]
[root@node1 ~]
# ssh node2 '/etc/init.d/corosync stop'
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting
for
corosync services to unload:.[ OK ]
[root@node1 ~]
# crm status
Last updated: Tue Mar 25 11:24:10 2014
Last change: Tue Mar 25 11:22:42 2014 via crmd on node1.luojianlong.com
Stack: classic openais (with plugin)
Current DC: node1.luojianlong.com - partition WITHOUT quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured, 2 expected votes
1 Resources configured
Online: [ node1.luojianlong.com ]
OFFLINE: [ node2.luojianlong.com ]
WebIP (ocf::heartbeat:IPaddr): Started node1.luojianlong.com
|
正常启动node1.luojianlong.com后,集群资源WebIP很可能会重新从node2.luojianlong.com转移回node1.luojianlong.com。资源的这种在节点间每一次的来回流动都会造成那段时间内其无法正常被访问,所以,我们有时候需要在资源因为节点故障转移到其它节点后,即便原来的节点恢复正常也禁止资源再次流转回来。这可以通过定义资源的黏性(stickiness)来实现。在创建资源时或在创建资源后,都可以指定指定资源黏性。
资源黏性值范围及其作用:
0:这是默认选项。资源放置在系统中的最适合位置。这意味着当负载能力“较好”或较差的节点变得可用时才转移资源。此选项的作用基本等同于自动故障回复,只是资源可能会转移到非之前活动的节点上;
大于0:资源更愿意留在当前位置,但是如果有更合适的节点可用时会移动。值越高表示资源越愿意留在当前位置;
小于0:资源更愿意移离当前位置。绝对值越高表示资源越愿意离开当前位置;
INFINITY:如果不是因节点不适合运行资源(节点关机、节点待机、达到migration-threshold 或配置更改)而强制资源转移,资源总是留在当前位置。此选项的作用几乎等同于完全禁用自动故障回复;
-INFINITY:资源总是移离当前位置;
19.我们这里可以通过以下方式为资源指定默认黏性值:
1
|
[root@node1 ~]
# crm configure rsc_defaults resource-stickiness=100
|
20.结合上面已经配置好的IP地址资源,将此集群配置成为一个active/passive模型的web(httpd)服务集群,为了将此集群启用为web(httpd)服务器集群,我们得先在各节点上安装httpd,并配置其能在本地各自提供一个测试页面
1
2
3
4
5
6
|
# node1
[root@node1 ~]
# yum -y install httpd
[root@node1 ~]
# echo "node1.luojianlong.com" > /var/www/html/index.html
# node2
[root@node2 ~]
# yum -y install httpd
[root@node2 ~]
# echo "node2.luojianlong.com" > /var/www/html/index.html
|
21.而后在各节点手动启动httpd服务,并确认其可以正常提供服务。接着使用下面的命令停止httpd服务,并确保其不会自动启动(在两个节点各执行一遍)
1
2
3
4
5
6
|
[root@node1 ~]
# /etc/init.d/httpd stop
Stopping httpd: [FAILED]
[root@node1 ~]
# chkconfig httpd off
[root@node2 ~]
# /etc/init.d/httpd stop
Stopping httpd: [FAILED]
[root@node2 ~]
# chkconfig httpd off
|
22.接下来我们将此httpd服务添加为集群资源。将httpd添加为集群资源有两处资源代理可用:lsb和ocf:heartbeat,为了简单起见,我们这里使用lsb类型,首先可以使用如下命令查看lsb类型的httpd资源的语法格式
1
|
[root@node1 ~]
# crm ra info lsb:httpd
|
23.接下来新建资源WebSite
1
|
[root@node1 ~]
# crm configure primitive WebServer lsb:httpd
|
24.查看配置文件中生成的定义
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
[root@node1 ~]
# crm configure show
node node1.luojianlong.com
node node2.luojianlong.com
primitive WebIP ocf:heartbeat:IPaddr \
params ip=
"192.168.30.230"
primitive WebServer lsb:httpd
property $
id
=
"cib-bootstrap-options"
\
dc
-version=
"1.1.10-14.el6_5.1-368c726"
\
cluster-infrastructure=
"classic openais (with plugin)"
\
expected-quorum-votes=
"2"
\
stonith-enabled=
"false"
\
no-quorum-policy=
"ignore"
rsc_defaults $
id
=
"rsc-options"
\
resource-stickiness=
"100"
|
25.查看资源的启用状态
1
2
3
4
5
6
7
8
9
10
11
|
[root@node1 ~]
# crm status
Last updated: Tue Mar 25 11:36:53 2014
Last change: Tue Mar 25 11:35:37 2014 via cibadmin on node1.luojianlong.com
Stack: classic openais (with plugin)
Current DC: node1.luojianlong.com - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured, 2 expected votes
2 Resources configured
Online: [ node1.luojianlong.com node2.luojianlong.com ]
WebIP (ocf::heartbeat:IPaddr): Started node1.luojianlong.com
WebServer (lsb:httpd): Started node2.luojianlong.com
|
从上面的信息中可以看出WebIP和WebServer有可能会分别运行于两个节点上,这对于通过此IP提供Web服务的应用来说是不成立的,即此两者资源必须同时运行在某节点上。
由此可见,即便集群拥有所有必需资源,但它可能还无法进行正确处理。资源约束则用以指定在哪些群集节点上运行资源,以何种顺序装载资源,以及特定资源依赖于哪些其它资源。pacemaker共给我们提供了三种资源约束方法:
1)Resource Location(资源位置):定义资源可以、不可以或尽可能在哪些节点上运行;
2)Resource Collocation(资源排列):排列约束用以定义集群资源可以或不可以在某个节点上同时运行;
3)Resource Order(资源顺序):顺序约束定义集群资源在节点上启动的顺序;
定义约束时,还需要指定分数。各种分数是集群工作方式的重要组成部分。其实,从迁移资源到决定在已降级集群中停止哪些资源的整个过程是通过以某种方式修改分数来实现的。分数按每个资源来计算,资源分数为负的任何节点都无法运行该资源。在计算出资源分数后,集群选择分数最高的节点。INFINITY(无穷大)目前定义为 1,000,000。加减无穷大遵循以下3个基本规则:
1)任何值 + 无穷大 = 无穷大
2)任何值 - 无穷大 = -无穷大
3)无穷大 - 无穷大 = -无穷大
#定义资源约束时,也可以指定每个约束的分数。分数表示指派给此资源约束的值。分数较高的约束先应用,分数较低的约束后应用。通过使用不同的分数为既定资源创建更多位置约束,可以指定资源要故障转移至的目标节点的顺序。
26.对于前述的WebIP和WebServer可能会运行于不同节点的问题,可以通过以下命令来解决
1
2
3
4
5
6
7
8
9
10
11
12
|
[root@node1 ~]
# crm configure colocation webserver-with-webip INFINITY: WebServer WebIP
[root@node1 ~]
# crm status
Last updated: Tue Mar 25 11:39:58 2014
Last change: Tue Mar 25 11:39:31 2014 via cibadmin on node1.luojianlong.com
Stack: classic openais (with plugin)
Current DC: node1.luojianlong.com - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured, 2 expected votes
2 Resources configured
Online: [ node1.luojianlong.com node2.luojianlong.com ]
WebIP (ocf::heartbeat:IPaddr): Started node1.luojianlong.com
WebServer (lsb:httpd): Started node1.luojianlong.com
|
27.接着,我们还得确保WebSite在某节点启动之前得先启动WebIP,这可以使用如下命令实现
1
|
[root@node1 ~]
# crm configure order web-server-after-webip mandatory: WebIP WebServer
|
28.接下来验证高可用的效果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
#现在所有资源都在node1上面:
[root@node1 ~]
# crm status
Last updated: Tue Mar 25 11:46:41 2014
Last change: Tue Mar 25 11:41:18 2014 via cibadmin on node1.luojianlong.com
Stack: classic openais (with plugin)
Current DC: node1.luojianlong.com - partition WITHOUT quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured, 2 expected votes
2 Resources configured
Online: [ node1.luojianlong.com ]
OFFLINE: [ node2.luojianlong.com ]
WebIP (ocf::heartbeat:IPaddr): Started node1.luojianlong.com
WebServer (lsb:httpd): Started node1.luojianlong.com
#在node2上面停止node1的corosync
[root@node2 ~]
# ssh node1 '/etc/init.d/corosync stop'
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting
for
corosync services to unload:..[ OK ]
|
访问http://192.168.30.230/index.html
29.此外,由于HA集群本身并不强制每个节点的性能相同或相近,所以,某些时候我们可能希望在正常时服务总能在某个性能较强的节点上运行,这可以通过位置约束来实现:
1
|
[root@node1 ~]
# crm configure location prefer-node1 WebServer rule 200: node1.luojianlong.com
|
这条命令实现了将WebSite约束在node1上,且指定其分数为200。
部署过程中常见的问题:
1.启动corosync服务后,发现日志不断的报下面这个错误,后来发现是selinux没有关闭的问题。
1
2
3
4
5
6
7
8
9
10
11
12
|
[root@node1 ~]
# /etc/init.d/corosync start
[root@node1 ~]
# tail -f /var/log/cluster/corosync.log
Mar 26 11:38:43 [7875] node1.luojianlong.com lrmd: info: crm_client_new: Connecting 0x1bce860
for
uid=189 gid=0 pid=8001
id
=630eb39c-1acd-4cad-8930-700c71680f91
Mar 26 11:38:43 [7875] node1.luojianlong.com lrmd: error: qb_ipcs_shm_rb_open: qb_rb_chmod:lrmd-request-7875-8001-153: Operation not permitted (1)
Mar 26 11:38:43 [7875] node1.luojianlong.com lrmd: error: qb_ipcs_shm_connect: shm connection FAILED: Operation not permitted (1)
Mar 26 11:38:43 [7875] node1.luojianlong.com lrmd: error: handle_new_connection: Error
in
connection setup (7875-8001-153): Operation not permitted (1)
Mar 26 11:38:43 [8001] node1.luojianlong.com crmd: info: crm_ipc_connect: Could not establish lrmd connection: Operation not permitted (1)
Mar 26 11:38:43 [8001] node1.luojianlong.com crmd: warning: do_lrm_control: Failed to sign on to the LRM 12 (30 max)
times
Mar 26 11:38:43 [7960] node1.luojianlong.com cib: error: plugin_dispatch: Receiving message body failed: (2) Library error: Success (0)
Mar 26 11:38:43 [7960] node1.luojianlong.com cib: error: cib_cs_destroy: Corosync connection lost! Exiting.
Mar 26 11:38:43 [7960] node1.luojianlong.com cib: info: terminate_cib: cib_cs_destroy: Exiting fast...
Mar 26 11:38:43 [7960] node1.luojianlong.com cib: info: qb_ipcs_us_withdraw: withdrawing server sockets
|
解决办法:
1
2
3
4
5
|
[root@node1 ~]
# setenforce 0
[root@node1 ~]
# vi /etc/selinux/config
# 修改SELINUX=disabled
# 查看selinux状态
[root@node1 ~]
# getenforce
|