WSREP has not yet prepared node for application use-阿里云开发者社区

开发者社区> leshami> 正文

WSREP has not yet prepared node for application use

简介:
+关注继续查看

最近PXC 5.7出现脑裂,前端Navicate连接到MySQL时,提示WSREP has not yet prepared node for application use错误。脑裂可是个不容易忽视的问题啊,严重的导致数据错乱。你懂的,就好比酒驾,迷糊啊,都乱了。下文是本次故障的具体描述。

一、环境说明

当前的集群采用了2个数据节点+1个仲裁节点,两节点均有写入数据
重启了仲裁节点后,数据节点发生脑裂
操作系统CentOS 7
数据库版本Percona XtraDB Cluster 5.7.20-18-57-log

故障现象
Navicate客户端可以连接,但是查询数据报错:WSREP has not yet prepared node for application use
2节点状态查看都显示为non-Primary

mysql> show status like '%wsrep_cluster%';
+--------------------------+--------------------------------------+
| Variable_name            | Value                                |
+--------------------------+--------------------------------------+
| wsrep_cluster_conf_id    | 6                                    |
| wsrep_cluster_size      | 3                                    |
| wsrep_cluster_state_uuid | cd96b06a-0a1d-11e8-99d2-837e6f3b95a9 |
| wsrep_cluster_status    | non-Primary                              |
+--------------------------+--------------------------------------+

二、故障分析

节点1日志

2018-05-29T11:36:56.595913+08:00 0 [Note] WSREP: (45125952, 'tcp://0.0.0.0:4567') connection to peer 17b576f8 with addr tcp://192.168.1.253:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-05-29T11:36:56.602911+08:00 0 [Note] WSREP: (45125952, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.1.253:4567 
2018-05-29T11:37:05.600073+08:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,029e6b40,17) ##此处已提示为非主节点
memb {
    45125952,0
    }
joined {
    }
left {
    }
partitioned {
    029e6b40,0
    17b576f8,0
    }
)
2018-05-29T11:37:05.631952+08:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,45125952,18)
memb {
    45125952,0
    }
joined {
    }
left {
    }
partitioned {
    029e6b40,0
    17b576f8,0
    }
)
2018-05-29T11:37:05.634073+08:00 0 [Note] WSREP: declaring 029e6b40 at tcp://192.168.1.249:4567 stable
2018-05-29T11:37:05.634674+08:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,029e6b40,19)
memb {
    029e6b40,0
    45125952,0
    }
joined {
    }
left {
    }
partitioned {
    17b576f8,0
    }
)
2018-05-29T11:37:05.663943+08:00 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-05-29T11:37:05.670701+08:00 151042 [Warning] WSREP: Send action {(nil), 563, TORDERED} returned -107 (Transport endpoint is not connected)
2018-05-29T11:37:05.670763+08:00 151042 [Note] WSREP: --------- CONFLICT DETECTED --------  ###检查到冲突
2018-05-29T11:37:05.670778+08:00 151042 [Note] WSREP: cluster conflict due to certification failure for threads:

节点2日志

2018-05-28T23:59:07.757523+08:00 0 [Note] WSREP: Member 2.0 (pxc-cluster-node-1) synced with group.
2018-05-29T09:47:36.679991+08:00 6095593 [Note] Aborted connection 6095593 to db: 'DB71' user: 'user71' host: '192.168.1.253' (Got an error reading communication packets)
2018-05-29T09:47:36.680062+08:00 6095607 [Note] Aborted connection 6095607 to db: 'DB71' user: 'user71' host: '192.168.1.253' (Got an error reading communication packets)
            .............................................#Author : Leshami Blog : http://blog.csdn.net/leshami
2018-05-29T11:34:11.981171+08:00 6104858 [Note] Aborted connection 6104858 to db: 'DB71' user: 'user71' host: '192.168.1.253' (Got an error reading communication packets)
2018-05-29T11:36:02.203516+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') connection to peer 17b576f8 with addr tcp://192.168.1.253:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-05-29T11:36:02.203708+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.1.253:4567 
2018-05-29T11:36:03.203906+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') reconnecting to 17b576f8 (tcp://192.168.1.253:4567), attempt 0
2018-05-29T11:36:03.704413+08:00 0 [Note] WSREP: declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2018-05-29T11:36:03.704482+08:00 0 [Note] WSREP: evs::proto(029e6b40, OPERATIONAL, view_id(REG,029e6b40,17)) suspecting node: 17b576f8
2018-05-29T11:36:03.704507+08:00 0 [Note] WSREP: evs::proto(029e6b40, OPERATIONAL, view_id(REG,029e6b40,17)) suspected node without join message, declaring inactive
2018-05-29T11:36:04.204630+08:00 0 [Note] WSREP: declaring node with index 1 inactive (evs.inactive_timeout) 
2018-05-29T11:36:06.204943+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://192.168.1.253:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-05-29T11:36:09.206069+08:00 0 [Note] WSREP: declaring node with index 2 suspected, timeout PT5S (evs.suspect_timeout)
2018-05-29T11:36:09.206168+08:00 0 [Note] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) suspecting node: 45125952
2018-05-29T11:36:09.706352+08:00 0 [Note] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) suspecting node: 45125952
2018-05-29T11:36:10.206318+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://192.168.1.253:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-05-29T11:36:10.206514+08:00 0 [Note] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) suspecting node: 45125952
2018-05-29T11:36:10.706763+08:00 0 [Note] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) suspecting node: 45125952
2018-05-29T11:36:11.204659+08:00 0 [Warning] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) install timer expired
evs::proto(evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)), GATHER) {
current_view=Current view of cluster as seen by this node
view (view_id(REG,029e6b40,17)
memb {
    029e6b40,0
    17b576f8,0
    45125952,0
    }
joined {
    }
left {
    }
partitioned {
    }
),
input_map=evs::input_map: {aru_seq=5278704,safe_seq=5278704,node_index=node: {idx=0,range=[5278710,5278709],safe_seq=5278704} node: {idx=1,range=[5278705,5278704],safe_seq=5278704} node: {idx=2,range=[5278710,5278709],safe_seq=5278704} },
fifo_seq=10557624,
last_sent=5278709,
known:

三、故障解决

对于当前的脑裂,由于在测试环境,因此直接停止其中一个节点(1.248),然后将剩余的节点强制提升为主节点
首先停止仲裁节点,然后停止1.249节点
如下在1.249节点上做如下操作

mysql> SET GLOBAL wsrep_provider_options='pc.bootstrap=YES';
Query OK, 0 rows affected (0.00 sec)

--再次查看此时wsrep_cluster_status为Primary
mysql> show status like 'wsrep_cluster%';
+--------------------------+--------------------------------------+
| Variable_name            | Value                                |
+--------------------------+--------------------------------------+
| wsrep_cluster_conf_id    | 4                                    |
| wsrep_cluster_size      | 1                                    |
| wsrep_cluster_state_uuid | cd96b06a-0a1d-11e8-99d2-837e6f3b95a9 |
| wsrep_cluster_status    | Primary                              |
+--------------------------+--------------------------------------+
4 rows in set (0.01 sec)

在第一个节点248上启动
[root@node248 ~]# systemctl start mysql

--集群状态查看,此时也已正常
mysql> show status like 'wsrep_cluster%';
+--------------------------+--------------------------------------+
| Variable_name            | Value                                |
+--------------------------+--------------------------------------+
| wsrep_cluster_conf_id    | 5                                    |
| wsrep_cluster_size      | 2                                    |
| wsrep_cluster_state_uuid | cd96b06a-0a1d-11e8-99d2-837e6f3b95a9 |
| wsrep_cluster_status    | Primary                              |
+--------------------------+--------------------------------------+

在第三个节点启动仲裁节点
[root@ydq4 ~]# /etc/init.d/garb start
正在启动 /usr/bin/garbd:                                  [确定]

mysql> show status like 'wsrep_cluster%';
+--------------------------+--------------------------------------+
| Variable_name            | Value                                |
+--------------------------+--------------------------------------+
| wsrep_cluster_conf_id    | 6                                    |
| wsrep_cluster_size      | 3                                    |
| wsrep_cluster_state_uuid | cd96b06a-0a1d-11e8-99d2-837e6f3b95a9 |
| wsrep_cluster_status    | Primary                              |
+--------------------------+--------------------------------------+

四、后记

1、此次脑裂时应用程序端已出现节点无法连接,并且有日志记录到mysql error log
2、在数据节点已出现异常时,冒然重启仲裁节点不可取
3、建议使用3个数据节点最佳
4、多节点数据写入,最好按模块将写划分到不同节点
5、节点修复参考:http://galeracluster.com/documentation-webpages/twonode.html

上传日志:

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

相关文章
阿里云服务器怎么设置密码?怎么停机?怎么重启服务器?
如果在创建实例时没有设置密码,或者密码丢失,您可以在控制台上重新设置实例的登录密码。本文仅描述如何在 ECS 管理控制台上修改实例登录密码。
9867 0
使用SSH远程登录阿里云ECS服务器
远程连接服务器以及配置环境
2481 0
阿里云服务器如何登录?阿里云服务器的三种登录方法
购买阿里云ECS云服务器后如何登录?场景不同,阿里云优惠总结大概有三种登录方式: 登录到ECS云服务器控制台 在ECS云服务器控制台用户可以更改密码、更换系.
13595 0
windows server 2008阿里云ECS服务器安全设置
最近我们Sinesafe安全公司在为客户使用阿里云ecs服务器做安全的过程中,发现服务器基础安全性都没有做。为了为站长们提供更加有效的安全基础解决方案,我们Sinesafe将对阿里云服务器win2008 系统进行基础安全部署实战过程! 比较重要的几部分 1.
9117 0
阿里云服务器ECS登录用户名是什么?系统不同默认账号也不同
阿里云服务器Windows系统默认用户名administrator,Linux镜像服务器用户名root
4378 0
阿里云服务器如何登录?阿里云服务器的三种登录方法
购买阿里云ECS云服务器后如何登录?场景不同,云吞铺子总结大概有三种登录方式: 登录到ECS云服务器控制台 在ECS云服务器控制台用户可以更改密码、更换系统盘、创建快照、配置安全组等操作如何登录ECS云服务器控制台? 1、先登录到阿里云ECS服务器控制台 2、点击顶部的“控制台” 3、通过左侧栏,切换到“云服务器ECS”即可,如下图所示 通过ECS控制台的远程连接来登录到云服务器 阿里云ECS云服务器自带远程连接功能,使用该功能可以登录到云服务器,简单且方便,如下图:点击“远程连接”,第一次连接会自动生成6位数字密码,输入密码即可登录到云服务器上。
22272 0
阿里云ECS云服务器初始化设置教程方法
阿里云ECS云服务器初始化是指将云服务器系统恢复到最初状态的过程,阿里云的服务器初始化是通过更换系统盘来实现的,是免费的,阿里云百科网分享服务器初始化教程: 服务器初始化教程方法 本文的服务器初始化是指将ECS云服务器系统恢复到最初状态,服务器中的数据也会被清空,所以初始化之前一定要先备份好。
7238 0
+关注
leshami
传播知识,分享快乐!十年以上数据库,系统运维与管理,性能优化经验。全部文章,欢迎扩散,转载请注明出处!
639
文章
0
问答
文章排行榜
最热
最新
相关电子书
更多
《2021云上架构与运维峰会演讲合集》
立即下载
《零基础CSS入门教程》
立即下载
《零基础HTML入门教程》
立即下载