写在开篇的总结
- 总结的前戏
通过这次不断的折腾,各种折腾。先是解析Binlog,找到指定的位置,手动转化为SQL去执行,问题不多还好,问题多了这个办法就悲剧了。而且,手动转化为可执行的SQL未必能执行成功。笔者的互为主从环境问题非常多,只能不断的去跳过有问题的GTID事务ID,这是唯一的办法,而且有问题的GITD数量有多少也是未知的。但笔者又不想重建互为主从的环境,如果是生产环境更不能随便重建。
- 总结的高潮
不管是生产还是测试的互为主从的环境,发生这样的问题而又不能重建或者不想重建,那么请按照以下步骤进行处理:
- 停掉上层应用,不要再往数据库进行读写数据;
- 互为主从的环境,在2台互为slave服务器的角色中不断的去跳过有问题的GTID,直到2台slave角色中的SQL线程都为YES;
- 2台slave角色中的SQL线程都为YES后,还要观察一段时间,2台slave都要观察,通过“show replica status\G;”查看复制状态,说不定还会出现有问题的GTID,按照同样的方法继续跳过处理;
- 观察了一段时间后,2台slave确实不会再出现有问题的GTID之后,按正常顺序停止复制、停止Mysql服务,然后按正常顺序拉起Mysql,继续观察2台MySQL服务器的IO和SQL线程是否都为YES;
- 互为主从的环境确实都没问题了,都在master上创建个测试的库,严重是否能正常同步到slave;
- 互为主从的MySQL环境确实真的真的真的没问题了之后,再拉起上层应用,笔者的上层应用仅需连接其中一台即可,并没有去搞读写分离的骚操作。
好了,下面进入本次排查和解决的全过程,步入主题!!!
主从环境信息
角色 | 主机名 | IP |
master | db01 | 192.168.11.151 |
slave | db02 | 192.168.11.152 |
说明:笔者的环境是启用了GTID模式的主从复制,关于GTID模式和传统的模式,后续会抽时间输出经验进行分享。
问题故障现象
- 查看slave库的SQL线程为NO,具体的信息如下:
mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000005 Read_Source_Log_Pos: 1986340 Relay_Log_File: zbx-db02-relay-bin.000012 Relay_Log_Pos: 630974 Relay_Source_Log_File: mysql-bin.000001 Replica_IO_Running: Yes Replica_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1452 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634234. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
- 主库的binlog:mysql-bin.000001(master log )
- 主库的binlog结束位置:634234(end_log_pos)
- 查看slave库的error
... 2022-05-10T08:06:05.008230Z 69 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634234; Could not execute Write_rows event on table zabbix.event_recovery; Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 634234, Error_code: MY-001452 ...
- 通过错误日志的简单分析:
通过上面的errorlog,发现event_recovery表有外键约束,约束名为c_event_recovery_1,eventid作为外键,参考的是events表中的eventid字段,也就是说:父表是events,子表是event_recovery,现在要往子表插入数据,但是父表没有,所以失败了。
进一步深入分析和排查
- 在slave库上查一下event_recovery表的建表语句
mysql> show create table zabbix.event_recovery\G; *************************** 1. row *************************** Table: event_recovery Create Table: CREATE TABLE `event_recovery` ( `eventid` bigint unsigned NOT NULL, `r_eventid` bigint unsigned NOT NULL, `c_eventid` bigint unsigned DEFAULT NULL, `correlationid` bigint unsigned DEFAULT NULL, `userid` bigint unsigned DEFAULT NULL, PRIMARY KEY (`eventid`), KEY `event_recovery_1` (`r_eventid`), KEY `event_recovery_2` (`c_eventid`), CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE, CONSTRAINT `c_event_recovery_2` FOREIGN KEY (`r_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE, CONSTRAINT `c_event_recovery_3` FOREIGN KEY (`c_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8_bin 1 row in set (0.00 sec) ERROR: No query specified
约束名:c_event_recovery_1,子表是event_recovery,父表是events。子表的字段eventid参考的是父表的eventid
- 在master库上解析mysql-bin.000001文件,并找到结束位置634234
[root@zbx-db01 ~]# mysqlbinlog -v --base64-output=decode-rows --stop-position=634234 /data/mysql_data/mysql-bin.000001 | tail -20 # 找到的634234位置的内容如下: #220507 22:04:25 server id 5 end_log_pos 634234 CRC32 0xf6000142 Write_rows: table id 174 flags: STMT_END_F ### INSERT INTO `zabbix`.`event_recovery` ### SET ### @1=21751 ### @2=22357 ### @3=NULL ### @4=NULL ### @5=NULL ROLLBACK /* added by mysqlbinlog */ /*!*/; SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file
- 通过解析出来的内容,将master库上位置634234的内容人工转化为可执行的语句后,是这样的:
INSERT INTO `zabbix`.`event_recovery` values(21751,22357,NULL,NULL,NULL);
要注意:子表event_recovery的字段eventid参考的是父表events的eventid字段
- 在master库上查下父表events中的eventid是否有21751的记录
mysql> select * from zabbix.events where eventid=21751\G; *************************** 1. row *************************** eventid: 21751 source: 0 object: 0 objectid: 13560 clock: 1651925304 value: 1 acknowledged: 0 ns: 381417199 name: Zabbix task manager processes more than 75% busy severity: 3 1 row in set (0.00 sec)
在master主库上是有的,是存在的呢。
- 那接着在slave库上也查一下父表events中的eventid是否有21751的记录
mysql> select * from zabbix.events where eventid=21751; Empty set (0.01 sec) mysql>
在slave库上,父表events中没有21751这个eventid的记录,因为自动同步的原因,所以自动执行也是失败的
- 在slave库上,尝试执行人工转化后的可执行语句因此,向event_recovery表插入数据时报错,提示无法添加或更新子行,外键约束失败,错误码 ERROR 1452
mysql> INSERT INTO zabbix.event_recovery values(21751,22357,NULL,NULL,NULL); ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE)
发现也是失败的,往子表插入数据,但是父表没有,不管是手动执行还是自动执行,都是失败的
解决办法和过程
- 将master库父表events中的eventid为21751的记录查出来,再构造好可执行的插入数据的语句
# 现在master库上查 mysql> select * from zabbix.events where eventid=21751\G; *************************** 1. row *************************** eventid: 21751 source: 0 object: 0 objectid: 13560 clock: 1651925304 value: 1 acknowledged: 0 ns: 381417199 name: Zabbix task manager processes more than 75% busy severity: 3 1 row in set (0.13 sec) ERROR: No query specified # 构造可执行的插入数据的sql语句 insert into zabbix.events values(21751,0,0,13560,1651925304,1,0,381417199,"Zabbix task manager processes more than 75% busy",3);
- 将构造好的语句,在slave库中执行,插入和master中父表events一样的数据到slave库里的父表events
mysql> insert into zabbix.events values(21751,0,0,13560,1651925304,1,0,381417199,"Zabbix task manager processes more than 75% busy",3); Query OK, 1 row affected (0.00 sec) mysql>
- 接着在slave库中执行原来报错的语句,就是往子表event_recovery插入数据,居然又报错了,这次是不同的错误
mysql> INSERT INTO `zabbix`.`event_recovery` values(21751,22357,NULL,NULL,NULL); ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_2` FOREIGN KEY (`r_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE)
这次是约束名为c_event_recovery_2的问题,这是新的问题,通过查看event_recovery表的创建表语句,约束c_event_recovery_2具体信息如下:
CONSTRAINT `c_event_recovery_2` FOREIGN KEY (`r_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE
- 在master库上,查看子表event_recovery的表结构
mysql> desc zabbix.event_recovery; +---------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+-----------------+------+-----+---------+-------+ | eventid | bigint unsigned | NO | PRI | NULL | | | r_eventid | bigint unsigned | NO | MUL | NULL | | | c_eventid | bigint unsigned | YES | MUL | NULL | | | correlationid | bigint unsigned | YES | | NULL | | | userid | bigint unsigned | YES | | NULL | | +---------------+-----------------+------+-----+---------+-------+ 5 rows in set (0.01 sec)
也就是说,待插入的值values(21751,22357,NULL,NULL,NULL)中的第2个值(值是22357)就是r_eventid字段,目前在slave库也是缺失的呢。
- 分别在master库和slave库上查父表events的eventid字段有没有值为22357的记录,如果slave库上没有,那就要构造了
# master库上查,是有的 mysql> select * from zabbix.events where eventid=22357\G; *************************** 1. row *************************** eventid: 22357 source: 0 object: 0 objectid: 13560 clock: 1651932264 value: 0 acknowledged: 0 ns: 578308822 name: Zabbix task manager processes more than 75% busy severity: 0 1 row in set (0.00 sec) ERROR: No query specified mysql> # 在slave库上,果然是没有 mysql> select * from zabbix.events where eventid=22357; Empty set (0.00 sec) # 构造可执行的插入数据的sql语句 insert into zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,"Zabbix task manager processes more than 75% busy",0);
- 将构造好的语句,在slave库中执行,插入和master中父表events一样的数据到slave库里的父表events
mysql> insert into zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,"Zabbix task manager processes more than 75% busy",0); Query OK, 1 row affected (0.09 sec) mysql>
- 接着在slave库中执行原来报错的语句,就是往子表event_recovery插入数据,成功了。
mysql> INSERT INTO `zabbix`.`event_recovery` values(21751,22357,NULL,NULL,NULL); Query OK, 1 row affected (0.00 sec) mysql>
- 接着启动复制,并查看状态
mysql> start replica; Query OK, 0 rows affected (0.01 sec) mysql> mysql> mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000006 Read_Source_Log_Pos: 236 Relay_Log_File: zbx-db02-relay-bin.000012 Relay_Log_Pos: 630974 Relay_Source_Log_File: mysql-bin.000001 Replica_IO_Running: Yes Replica_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. Skip_Counter: 0
发现SQL线程还是NO,位置信息这时候是 end_log_pos 634116,也就是位置变了,又是另外一个问题造成,具体啥问题,还得去看mysql的error日志和去解析对应的binkog文件,其实处理的套路都是一样的,继续处理它。
续解决位置在634116的问题
- 在slave库上查看mysql的error,过滤Last_Errno: 1062的error信息,看最新时间的那条就好
[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1062 2022-05-11T00:53:03.919190Z 9 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116; Could not execute Write_rows event on table zabbix.events; Duplicate entry '22357' for key 'events.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 634116, Error_code: MY-001062
大概分析的原因是:无法在表 zabbix.events 上执行 写入行记录 事件, 键“events.PRIMARY”的重复条目“22357”,错误代码:1062
- 在master库上解析binglog文件mysql-bin.000001, 找到位置 end_log_pos 634116,并手动转化为可执行的sql语句
# 开始执行解析 mysqlbinlog -v --base64-output=decode-rows --stop-position=634116 /data/mysql_data/mysql-bin.000001 # 解析后,找到的634116位置内容如下: #220507 22:04:25 server id 5 end_log_pos 634116 CRC32 0xf54d313b Write_rows: table id 113 flags: STMT_END_F ### INSERT INTO `zabbix`.`events` ### SET ### @1=22357 ### @2=0 ### @3=0 ### @4=13560 ### @5=1651932264 ### @6=0 ### @7=0 ### @8=578308822 ### @9='Zabbix task manager processes more than 75% busy' ### @10=0 ROLLBACK /* added by mysqlbinlog */ /*!*/; SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file # 手动转化为可执行的sql语句 INSERT INTO zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,'Zabbix task manager processes more than 75% busy',0);
- 开始在slave库执行转化后的sql语句
mysql> INSERT INTO zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,'Zabbix task manager processes more than 75% busy',0); ERROR 1062 (23000): Duplicate entry '22357' for key 'events.PRIMARY' mysql>
执行后报错,刚刚在slave库上查看mysql的error log文件,也是报这个错,所以不管是slave库的SQL线程自动执行还是现在手动执行这条语句,都是报错。events表的主键是eventid(请执行查看表结构便知道),也就是说已经存在这条22357记录了,再插入就是重复了,主键约束的目的就是只能唯一,不能重复,因此报错。
回想了一下,22357这条记录是在 “四、解决办法和过程” 的处理过程中插入进去的。
- 在slave库上尝试解决
mysql> delete from zabbix.events where eventid=22357; Query OK, 1 row affected (0.54 sec) mysql> INSERT INTO zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,'Zabbix task manager processes more than 75% busy',0); Query OK, 1 row affected (0.00 sec) mysql> start replica; Query OK, 0 rows affected (0.02 sec) mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000007 Read_Source_Log_Pos: 236 Relay_Log_File: zbx-db02-relay-bin.000024 Relay_Log_Pos: 324 Relay_Source_Log_File: mysql-bin.000001 Replica_IO_Running: Yes Replica_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. Skip_Counter: 0 Exec_Source_Log_Pos: 633751 Relay_Log_Space: 41924759
- 继续查看slave库的Mysql error日志(看最新的那条就好)
[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1062 2022-05-11T02:43:53.329019Z 23 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116; Could not execute Write_rows event on table zabbix.events; Duplicate entry '22357' for key 'events.PRIMARY', Error_code: 106; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 634116, Error_code: MY-001062
发现是events表中的eventid字段(主键)已经存在了22357的记录,奇怪了,还是这个问题,还是这个主键重复的问题,居然也还是22357这条记录。那刚才第4步骤的解决办法中,岂不是白干了?我....顶!...
新的解决办法
新的解决办法是:跳过指定的GTID事务(忽略slave库上发生的主键冲突),注意,笔者的主从环境是启用了GTID模式的。
- 之前停止过复制,现在拉起来
mysql> start replica; Query OK, 0 rows affected (0.02 sec)
- 查看复制状态
mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000007 Read_Source_Log_Pos: 236 Relay_Log_File: zbx-db02-relay-bin.000024 Relay_Log_Pos: 324 Relay_Source_Log_File: mysql-bin.000001 Replica_IO_Running: Yes Replica_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. Skip_Counter: 0 Exec_Source_Log_Pos: 633751 Relay_Log_Space: 41925179 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Source_SSL_Allowed: No Source_SSL_CA_File: Source_SSL_CA_Path: Source_SSL_Cert: Source_SSL_Cipher: Source_SSL_Key: Seconds_Behind_Source: NULL Source_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1062 Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. Replicate_Ignore_Server_Ids: Source_Server_Id: 5 Source_UUID: 92099aae-4731-11ec-a3da-00505629525b Source_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Replica_SQL_Running_State: Source_Retry_Count: 86400 Source_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 220511 10:43:53 Source_SSL_Crl: Source_SSL_Crlpath: Retrieved_Gtid_Set: 92099aae-4731-11ec-a3da-00505629525b:768-71845 # 检索到的Gtid事务列表 Executed_Gtid_Set: 9208096f-4731-11ec-a23e-005056210589:1-55, # 已执行的Gtid事务列表 92099aae-4731-11ec-a3da-00505629525b:1-767 Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Source_TLS_Version: Source_public_key_path: Get_Source_public_key: 0 Network_Namespace: 1 row in set (0.00 sec) ERROR: No query specified
- Retrieved_Gtid_Set(检索到的Gtid事务列表):92099aae-4731-11ec-a3da-00505629525b:768-71845
- Executed_Gtid_Set(已执行的Gtid事务列表):9208096f-4731-11ec-a23e-005056210589:1-55,92099aae-4731-11ec-a3da-00505629525b:1-767
- 故障深入分析
- 按照正常推断,如下:
上面的信息可以看出,当前从Master库取到了'92099aae-4731-11ec-a3da-00505629525b:768-71845'的事务列表,并且已执行(Executed_Gtid_Set)到了'92099aae-4731-11ec-a3da-00505629525b:1-767'这个事务GTID的位置。
- 根据之前在slave库的Mysql error日志:
[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1062 2022-05-11T02:43:53.329019Z 23 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116; Could not execute Write_rows event on table zabbix.events; Duplicate entry '22357' for key 'events.PRIMARY', Error_code: 106; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 634116, Error_code: MY-001062
注意这条error:Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768'。也就是说执行失败的事务是:'92099aae-4731-11ec-a3da-00505629525b:768'
大胆推测:那是不是可以说,只要跳过这个'92099aae-4731-11ec-a3da-00505629525b:768'事务,就可以了?也就是说主从库出现主键冲突(重复)时(比如现在的问题就是这个情况)可以通过注入空事物的方式进行跳过?于是笔者斗胆一试。
- 在slave库上尝试操作,跳过指定的GTID事务
# 停止复制 stop replica; # 指定下一个事务执行的版本,即想要跳过的GTID,也就是要跳过'92099aae-4731-11ec-a3da-00505629525b:768' set gtid_next='92099aae-4731-11ec-a3da-00505629525b:768'; begin; # 提交,开始注入一个空事物 commit; # 设置自动的寻找GTID事务 set gtid_next='AUTOMATIC'; # 开始同步 start replica;
- 跳过指定的GTID后,继续在slave库上查看复制状态
mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000014 Read_Source_Log_Pos: 236 Relay_Log_File: zbx-db02-relay-bin.000045 Relay_Log_Pos: 324 Relay_Source_Log_File: mysql-bin.000001 Replica_IO_Running: Yes Replica_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1452 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:784' at master log mysql-bin.000001, end_log_pos 643505. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
顶!!!SQL线程还是为NO,看来问题很多啊,慢慢修吧!不过这次出现的是一个新的问题,事务ID也变了,事务ID是'92099aae-4731-11ec-a3da-00505629525b:784',位置也变了,这次的位置是643505
- 在slave库上查看mysql的error日志,error代码是1452(查看最新的那条就好)
[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1452 2022-06-06T02:41:00.836059Z 18 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:784' at master log mysql-bin.000001, end_log_pos 643505; Could not execute Write_rows event on table zabbix.event_recovery; Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 643505, Error_code: MY-001452
这次的引发错误的原因是外键约束失败,而不是主键冲突,GTID事务ID是:92099aae-4731-11ec-a3da-00505629525b:784
继续解决外键约束失败的问题(跳过指定的GTID事务)
根据刚才的错误,binlog信息是:master log mysql-bin.000001, end_log_pos 643505,解决方案:决定继续采用跳过指定的GTID事务的办法
- 在slave库上尝试操作,跳过指定的GTID事务
# 停止复制 stop replica; # 指定下一个事务执行的版本,即想要跳过的GTID,也就是要跳过'92099aae-4731-11ec-a3da-00505629525b:784' set gtid_next='92099aae-4731-11ec-a3da-00505629525b:784'; begin; # 提交,开始注入一个空事物 commit; # 设置自动的寻找GTID事务 set gtid_next='AUTOMATIC'; # 开始同步 start replica;
- 继续查看slave的replica的状态
mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000014 Read_Source_Log_Pos: 236 Relay_Log_File: zbx-db02-relay-bin.000045 Relay_Log_Pos: 22387 Relay_Source_Log_File: mysql-bin.000001 Replica_IO_Running: Yes Replica_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1452 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:1069' at master log mysql-bin.000001, end_log_pos 772696. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
还是未能解决,SQL现存依然为NO,事务ID又变了,这次是'92099aae-4731-11ec-a3da-00505629525b:1069',按照之前的办法继续跳过指定的GTID事务ID
最终放大招:只要有错误的GTID事务都跳过
目前看来,各种问题太多了,主从已经严重不一致,从库各种主键冲突、约束等等问题引发SQL线程为NO。而且到底还有多少个问题不得而知,如果都像之前一样去解析binlog,然后找到指定的位置,手动转化为sql去执行,已经不现实了。所以,现在唯一的,最终放大招的解决办法是:凡是GIID事务有问题的,都跳过指定的GTID事务。注意,笔者的主从环境是启用了GTID模式的。如果不是GITD的模式,那就不适用该大招。
持续跳过指定的GTID事务ID,操作如下:
stop replica; set gtid_next='92099aae-4731-11ec-a3da-00505629525b:1069'; begin; commit; set gtid_next='AUTOMATIC'; start replica; show replica status\G;
只要有问题的GTID,都按照上述的办法跳过指定的事务ID,每个有问题的事务ID都不一样,只需将gtid_next=''写成有问题的GTID,其他指令不变、步骤不变。
经过放大招,互为主从的环境SQL线程已经恢复正常
- master(192.168.11.152)、slave(192.168.11.152)
mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.152 Source_User: syn_b Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000010 Read_Source_Log_Pos: 42583848 Relay_Log_File: zbx-db01-relay-bin.000050 Relay_Log_Pos: 922 Relay_Source_Log_File: mysql-bin.000010 Replica_IO_Running: Yes Replica_SQL_Running: Yes Replicate_Do_DB:
- master(192.168.11.152)、slave(192.168.11.151)
mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000014 Read_Source_Log_Pos: 2178 Relay_Log_File: zbx-db02-relay-bin.000084 Relay_Log_Pos: 404 Relay_Source_Log_File: mysql-bin.000014 Replica_IO_Running: Yes Replica_SQL_Running: Yes Replicate_Do_DB:
写在最后的自我反省
为什么会出现这样的问题?经过笔者的自我反省,核心原因就是:笔者的是虚拟机环境,正是因为是虚拟机环境,在每次关机的时候都不注重启停顺序。甚至为了方便,直接强制关掉运行了MySQL和上层应用的虚拟机电源,最终引发了数据库的这一些列问题。也正好是因为这次的测试环境,给了笔者一次莫大的教训。笔者认为,如果连测试环境都抱着随便维护的心态、不严谨,一旦养成这种陋习,维护的生产环境总有一天会毁在自身手里。
关于正确的启停顺序
假设应用的后端数据库环境是互为主从架构,笔者的测试环境就是该架构,且不涉及其他数据库或者中间件。
启动
- 启动互为主从的数据库(master、slave),并检查replica的状态是否正常;
- 启动上层应用;
停止
- 停止上层应用;
- 停止互为主从数据库(master、slave)的replica,再停止mysql服务;