PXC集群第3个节点无法加入故障处理-阿里云开发者社区

PXC集群第3个节点无法加入故障处理

2022-10-13 490 发布于北京

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云数据库 RDS MySQL，集群系列 2核4GB

RDS MySQL Serverless 基础系列，0.5-2RCU 50GB

RDS MySQL Serverless 高可用系列，价值2615元额度，1个月

简介： PXC集群第3个节点无法加入故障处理

一个PXC 8.0.23集群，因为项目操作导致无法提供服务了，提示信息为：
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
或者
2013 - Lost connection to MySQL server during query
登录各个节点查看集群wsrep_cluster_size均为0，节点状态wsrep_cluster_status都不是Primary状态（好像是not connected），查看grastate.dat文件，3号节点safe_to_bootstrap为1.
因此关闭各个节点，在3号节点启动集群，之后顺利将2号加入，可是在加入1号是遭遇错误如下：

2022-01-12T11:12:43.552286Z 0 [Note] [MY-000000] [WSREP-SST] ............Waiting for SST streaming to complete!
2022-01-12T11:20:32.979860Z 0 [ERROR] [MY-000000] [WSREP-SST] Killing SST (16448) with SIGKILL after stalling for 120 seconds
2022-01-12T11:20:33.010860Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: 行 183: 16450 已杀死               socat -u openssl-listen:4444,reuseaddr,cert=/mysql/pxc/data//server-cert.pem,key=/mysql/pxc/data//server-key.pem,cafile=/mysql/pxc/data//ca.pem,verify=1,retry=30 stdio
2022-01-12T11:20:33.010931Z 0 [Note] [MY-000000] [WSREP-SST]      16451                       | /usr/bin/pxc_extra/pxb-8.0/bin/xbstream -x
2022-01-12T11:20:33.011525Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR **********************
2022-01-12T11:20:33.011676Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node:  exit codes: 137 137
2022-01-12T11:20:33.011756Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1268
2022-01-12T11:20:33.011874Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************************************************
2022-01-12T11:20:33.012861Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2022-01-12T11:20:33.210760Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.222.50.101' --datadir '/mysql/pxc/data/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '15908' --mysqld-version '8.0.23-14.1'   '' : 32 (Broken pipe)
2022-01-12T11:20:33.210898Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2022-01-12T11:20:33.210973Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2022-01-12T11:20:33.211182Z 3 [Note] [MY-000000] [Galera] Processing SST received
2022-01-12T11:20:33.211268Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2022-01-12T11:20:33.211352Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
        
          
        
        
        
          
          AI 代码解读

网搜的文章五花八门，参考过几个文章，均没用。因为看到错误日志信息--address '10.222.50.101'，一度怀疑配置参数wsrep_node_address是否需要显式指定，因为都是默认注释掉的，显式指定后仍然报错如下：

2022-01-13T08:03:32.978322Z 0 [Note] [MY-000000] [WSREP-SST] Proceeding with SST.........
2022-01-13T08:03:33.036563Z 0 [Note] [MY-000000] [WSREP-SST] ............Waiting for SST streaming to complete!
2022-01-13T08:12:38.715388Z 0 [Note] [MY-000000] [Galera] Created page /mysql/pxc/data/gcache.page.000000 of size 592621440 bytes
2022-01-13T08:12:51.193262Z 0 [ERROR] [MY-000000] [WSREP-SST] Killing SST (27632) with SIGKILL after stalling for 120 seconds
2022-01-13T08:12:51.217686Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: line 183: 27634 killed               socat -u openssl-listen:4444,reuseaddr,cert=/mysql/pxc/data//server-cert.pem,key=/mysql/pxc/data//server-key.pem,cafile=/mysql/pxc/data//ca.pem,verify=1,retry=30 stdio
2022-01-13T08:12:51.217754Z 0 [Note] [MY-000000] [WSREP-SST]      27635                       | /usr/bin/pxc_extra/pxb-8.0/bin/xbstream -x
2022-01-13T08:12:51.218372Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR ********************** 
2022-01-13T08:12:51.218550Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node:  exit codes: 137 137
2022-01-13T08:12:51.218628Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1268
2022-01-13T08:12:51.218722Z 0 [ERROR] [MY-000000] [WSREP-SST] ****************************************************** 
2022-01-13T08:12:51.219631Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2022-01-13T08:12:51.431617Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.230.245.214' --datadir '/mysql/pxc/data/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '27097' --mysqld-version '8.0.23-14.1'   '' : 32 (Broken pipe)
2022-01-13T08:12:51.431820Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2022-01-13T08:12:51.431892Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2022-01-13T08:12:51.432257Z 3 [Note] [MY-000000] [Galera] Processing SST received
2022-01-13T08:12:51.432372Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2022-01-13T08:12:51.432458Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
        
          
        
        
        
          
          AI 代码解读

也怀疑过防火墙配置问题，去掉所有的配置，并关闭防火墙还是报错依旧。
为了不影响业务，只好先用2个节点提供服务，恢复业务。
同时到官网提交了这个问题，得到了官方回复如下：

【matthewb Percona】
Your log indicates that port 4444 is not open TCP/UDP to all hosts. Make sure all necessary ports (3306, 4444, 4567, 4568) are open between all nodes.

【liking】
Thanks for your reply, but I am sure I have closed firewall between all nodes. Maybe there is some other issues?

【Evgeniy_Patlan Percona】
"while getting data from donor node: exit codes: 137 137"
Such issue appeared once it is not possible to connect to the needed port. So please recheck your firewall options

【matthewb Percona】
"I am sure I have closed firewall between all nodes"
That’s your problem. You need to OPEN the firewall between nodes, not close it. Use socat or nc to test connectivity between nodes on the ports I mentioned.

【liking】
Many thanks to you all, I will do this according to your suggest
        
          
        
        
        
          
          AI 代码解读

看到了，官方很肯定是网络端口设置的原因，由于目前网络不太方便，择机再试。

数天后，择机重试，在官方论坛回复如下：
It is ok now.
According to your suggest, I modified the netfilter rules on all nodes like this:

Accept all input
Clear all netfilter rules

Now the cluster works fine.
以下是具体的操作步骤：

[root@db-1 ~]#  iptables -P INPUT ACCEPT
[root@db-1 ~]#  iptables -F
[root@db-1 ~]#  iptables -X
[root@db-1 ~]#  iptables -Z
[root@db-1 ~]#  iptables -A INPUT -i lo -j ACCEPT
[root@db-1 ~]#  iptables-save
#Generated by iptables-save v1.4.21 on Mon Jan 24 11:33:23 2022
*filter
:INPUT ACCEPT [884:105489]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [685:162312]
-A INPUT -i lo -j ACCEPT
COMMIT
#Completed on Mon Jan 24 11:33:23 2022
        
          
        
        
        
          
          AI 代码解读

PXC集群第3个节点无法加入故障处理

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

PXC集群第3个节点无法加入故障处理

热门文章

最新文章

相关电子书