PXC集群第3个节点无法加入故障处理

本文涉及的产品
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
RDS MySQL Serverless 高可用系列,价值2615元额度,1个月
简介: PXC集群第3个节点无法加入故障处理

一个PXC 8.0.23集群,因为项目操作导致无法提供服务了,提示信息为:
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
或者
2013 - Lost connection to MySQL server during query
登录各个节点查看集群wsrep_cluster_size均为0,节点状态wsrep_cluster_status都不是Primary状态(好像是not connected),查看grastate.dat文件,3号节点safe_to_bootstrap为1.
因此关闭各个节点,在3号节点启动集群,之后顺利将2号加入,可是在加入1号是遭遇错误如下:

2022-01-12T11:12:43.552286Z 0 [Note] [MY-000000] [WSREP-SST] ............Waiting for SST streaming to complete!
2022-01-12T11:20:32.979860Z 0 [ERROR] [MY-000000] [WSREP-SST] Killing SST (16448) with SIGKILL after stalling for 120 seconds
2022-01-12T11:20:33.010860Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: 行 183: 16450 已杀死               socat -u openssl-listen:4444,reuseaddr,cert=/mysql/pxc/data//server-cert.pem,key=/mysql/pxc/data//server-key.pem,cafile=/mysql/pxc/data//ca.pem,verify=1,retry=30 stdio
2022-01-12T11:20:33.010931Z 0 [Note] [MY-000000] [WSREP-SST]      16451                       | /usr/bin/pxc_extra/pxb-8.0/bin/xbstream -x
2022-01-12T11:20:33.011525Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR **********************
2022-01-12T11:20:33.011676Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node:  exit codes: 137 137
2022-01-12T11:20:33.011756Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1268
2022-01-12T11:20:33.011874Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************************************************
2022-01-12T11:20:33.012861Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2022-01-12T11:20:33.210760Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.222.50.101' --datadir '/mysql/pxc/data/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '15908' --mysqld-version '8.0.23-14.1'   '' : 32 (Broken pipe)
2022-01-12T11:20:33.210898Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2022-01-12T11:20:33.210973Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2022-01-12T11:20:33.211182Z 3 [Note] [MY-000000] [Galera] Processing SST received
2022-01-12T11:20:33.211268Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2022-01-12T11:20:33.211352Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.

网搜的文章五花八门,参考过几个文章,均没用。因为看到错误日志信息--address '10.222.50.101',一度怀疑配置参数wsrep_node_address是否需要显式指定,因为都是默认注释掉的,显式指定后仍然报错如下:

2022-01-13T08:03:32.978322Z 0 [Note] [MY-000000] [WSREP-SST] Proceeding with SST.........
2022-01-13T08:03:33.036563Z 0 [Note] [MY-000000] [WSREP-SST] ............Waiting for SST streaming to complete!
2022-01-13T08:12:38.715388Z 0 [Note] [MY-000000] [Galera] Created page /mysql/pxc/data/gcache.page.000000 of size 592621440 bytes
2022-01-13T08:12:51.193262Z 0 [ERROR] [MY-000000] [WSREP-SST] Killing SST (27632) with SIGKILL after stalling for 120 seconds
2022-01-13T08:12:51.217686Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: line 183: 27634 killed               socat -u openssl-listen:4444,reuseaddr,cert=/mysql/pxc/data//server-cert.pem,key=/mysql/pxc/data//server-key.pem,cafile=/mysql/pxc/data//ca.pem,verify=1,retry=30 stdio
2022-01-13T08:12:51.217754Z 0 [Note] [MY-000000] [WSREP-SST]      27635                       | /usr/bin/pxc_extra/pxb-8.0/bin/xbstream -x
2022-01-13T08:12:51.218372Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR ********************** 
2022-01-13T08:12:51.218550Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node:  exit codes: 137 137
2022-01-13T08:12:51.218628Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1268
2022-01-13T08:12:51.218722Z 0 [ERROR] [MY-000000] [WSREP-SST] ****************************************************** 
2022-01-13T08:12:51.219631Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2022-01-13T08:12:51.431617Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.230.245.214' --datadir '/mysql/pxc/data/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '27097' --mysqld-version '8.0.23-14.1'   '' : 32 (Broken pipe)
2022-01-13T08:12:51.431820Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2022-01-13T08:12:51.431892Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2022-01-13T08:12:51.432257Z 3 [Note] [MY-000000] [Galera] Processing SST received
2022-01-13T08:12:51.432372Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2022-01-13T08:12:51.432458Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.

也怀疑过防火墙配置问题,去掉所有的配置,并关闭防火墙还是报错依旧。
为了不影响业务,只好先用2个节点提供服务,恢复业务。
同时到官网提交了这个问题,得到了官方回复如下:

【matthewb Percona】
Your log indicates that port 4444 is not open TCP/UDP to all hosts. Make sure all necessary ports (3306, 4444, 4567, 4568) are open between all nodes.

【liking】
Thanks for your reply, but I am sure I have closed firewall between all nodes. Maybe there is some other issues?

【Evgeniy_Patlan Percona】
"while getting data from donor node: exit codes: 137 137"
Such issue appeared once it is not possible to connect to the needed port. So please recheck your firewall options

【matthewb Percona】
"I am sure I have closed firewall between all nodes"
That’s your problem. You need to OPEN the firewall between nodes, not close it. Use socat or nc to test connectivity between nodes on the ports I mentioned.

【liking】
Many thanks to you all, I will do this according to your suggest

看到了,官方很肯定是网络端口设置的原因,由于目前网络不太方便,择机再试。

数天后,择机重试,在官方论坛回复如下:
It is ok now.
According to your suggest, I modified the netfilter rules on all nodes like this:

  1. Accept all input
  2. Clear all netfilter rules

Now the cluster works fine.
以下是具体的操作步骤:

[root@db-1 ~]#  iptables -P INPUT ACCEPT
[root@db-1 ~]#  iptables -F
[root@db-1 ~]#  iptables -X
[root@db-1 ~]#  iptables -Z
[root@db-1 ~]#  iptables -A INPUT -i lo -j ACCEPT
[root@db-1 ~]#  iptables-save
#Generated by iptables-save v1.4.21 on Mon Jan 24 11:33:23 2022
*filter
:INPUT ACCEPT [884:105489]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [685:162312]
-A INPUT -i lo -j ACCEPT
COMMIT
#Completed on Mon Jan 24 11:33:23 2022
相关实践学习
如何在云端创建MySQL数据库
开始实验后,系统会自动创建一台自建MySQL的 源数据库 ECS 实例和一台 目标数据库 RDS。
全面了解阿里云能为你做什么
阿里云在全球各地部署高效节能的绿色数据中心,利用清洁计算为万物互联的新世界提供源源不断的能源动力,目前开服的区域包括中国(华北、华东、华南、香港)、新加坡、美国(美东、美西)、欧洲、中东、澳大利亚、日本。目前阿里云的产品涵盖弹性计算、数据库、存储与CDN、分析与搜索、云通信、网络、管理与监控、应用服务、互联网中间件、移动服务、视频服务等。通过本课程,来了解阿里云能够为你的业务带来哪些帮助     相关的阿里云产品:云服务器ECS 云服务器 ECS(Elastic Compute Service)是一种弹性可伸缩的计算服务,助您降低 IT 成本,提升运维效率,使您更专注于核心业务创新。产品详情: https://www.aliyun.com/product/ecs
目录
相关文章
|
4月前
|
存储 Kubernetes 监控
在K8S中,worke节点如何加入K8S高可用集群?
在K8S中,worke节点如何加入K8S高可用集群?
|
4月前
|
存储 缓存 Kubernetes
在K8S中,集群节点宕机,可能由哪些原因造成?
在K8S中,集群节点宕机,可能由哪些原因造成?
|
7月前
|
存储 索引
|
NoSQL Redis 数据中心
Redis 集群偶数节点跨地域部署之高可用测试
你搭建过偶数节点的 Redis 集群吗?有没有想过它是否具备高可用的能力?会不会脑裂呢?实践出真知!现在 docker 太方便了,搭一个集群模拟一下……
168 4
|
负载均衡
本篇关于集群
本篇关于集群
113 1
|
存储 NoSQL
Cassandra集群删除宕机节点
Cassandra集群删除宕机节点
Cassandra集群删除宕机节点
|
NoSQL Redis 开发者
集群-主从下线与主从切换|学习笔记
快速学习集群-主从下线与主从切换
|
SQL 前端开发 关系型数据库
PXC集群脑裂导致节点是无法加入无主的集群
PXC集群脑裂导致节点是无法加入无主的集群
253 0
|
云安全 运维 Kubernetes
如何让你的k8s集群更安全
近期在阿里云容器团队与Palo Alto Networks安全团队的联合调研中发现有大量的自建Kubernetes集群存在不同程度的安全隐患,本文主要介绍了Kubernetes集群使用过程中可能遇到的安全风险,同时如何利用阿里云容器服务,容器镜像服务的安全能力和Palo Alto Networks的容器安全解决方案提升Kubernetes集群的整体安全性。
3358 0
如何让你的k8s集群更安全
|
关系型数据库 MySQL
高可用PXC
1.Percona XtraDB Cluster的搭建 安装环境: 节点1:A: 192.168.91.18 节点2:B:192.168.91.20 节点3:C:192.168.
1250 0