写在开篇的总结

总结的前戏

通过这次不断的折腾，各种折腾。先是解析Binlog，找到指定的位置，手动转化为SQL去执行，问题不多还好，问题多了这个办法就悲剧了。而且，手动转化为可执行的SQL未必能执行成功。笔者的互为主从环境问题非常多，只能不断的去跳过有问题的GTID事务ID，这是唯一的办法，而且有问题的GITD数量有多少也是未知的。但笔者又不想重建互为主从的环境，如果是生产环境更不能随便重建。

总结的高潮

不管是生产还是测试的互为主从的环境，发生这样的问题而又不能重建或者不想重建，那么请按照以下步骤进行处理：

停掉上层应用，不要再往数据库进行读写数据；
互为主从的环境，在2台互为slave服务器的角色中不断的去跳过有问题的GTID，直到2台slave角色中的SQL线程都为YES；
2台slave角色中的SQL线程都为YES后，还要观察一段时间，2台slave都要观察，通过“show replica status\G;”查看复制状态，说不定还会出现有问题的GTID，按照同样的方法继续跳过处理；
观察了一段时间后，2台slave确实不会再出现有问题的GTID之后，按正常顺序停止复制、停止Mysql服务，然后按正常顺序拉起Mysql，继续观察2台MySQL服务器的IO和SQL线程是否都为YES；
互为主从的环境确实都没问题了，都在master上创建个测试的库，严重是否能正常同步到slave；
互为主从的MySQL环境确实真的真的真的没问题了之后，再拉起上层应用，笔者的上层应用仅需连接其中一台即可，并没有去搞读写分离的骚操作。

好了，下面进入本次排查和解决的全过程，步入主题！！！

主从环境信息

角色	主机名	IP
master	db01	192.168.11.151
slave	db02	192.168.11.152

说明：笔者的环境是启用了GTID模式的主从复制，关于GTID模式和传统的模式，后续会抽时间输出经验进行分享。

问题故障现象

查看slave库的SQL线程为NO，具体的信息如下：

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000005
          Read_Source_Log_Pos: 1986340
               Relay_Log_File: zbx-db02-relay-bin.000012
                Relay_Log_Pos: 630974
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1452
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634234. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

主库的binlog：mysql-bin.000001（master log ）
主库的binlog结束位置：634234（end_log_pos）

查看slave库的error

...
2022-05-10T08:06:05.008230Z 69 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634234; Could not execute Write_rows event on table zabbix.event_recovery; Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 634234, Error_code: MY-001452
...

通过错误日志的简单分析：

通过上面的errorlog，发现event_recovery表有外键约束，约束名为c_event_recovery_1，eventid作为外键，参考的是events表中的eventid字段，也就是说：父表是events，子表是event_recovery，现在要往子表插入数据，但是父表没有，所以失败了。

进一步深入分析和排查

在slave库上查一下event_recovery表的建表语句

mysql> show create table zabbix.event_recovery\G;
*************************** 1. row ***************************
       Table: event_recovery
Create Table: CREATE TABLE `event_recovery` (
  `eventid` bigint unsigned NOT NULL,
  `r_eventid` bigint unsigned NOT NULL,
  `c_eventid` bigint unsigned DEFAULT NULL,
  `correlationid` bigint unsigned DEFAULT NULL,
  `userid` bigint unsigned DEFAULT NULL,
  PRIMARY KEY (`eventid`),
  KEY `event_recovery_1` (`r_eventid`),
  KEY `event_recovery_2` (`c_eventid`),
  CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE,
  CONSTRAINT `c_event_recovery_2` FOREIGN KEY (`r_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE,
  CONSTRAINT `c_event_recovery_3` FOREIGN KEY (`c_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8_bin
1 row in set (0.00 sec)
ERROR: 
No query specified

约束名：c_event_recovery_1，子表是event_recovery，父表是events。子表的字段eventid参考的是父表的eventid

在master库上解析mysql-bin.000001文件，并找到结束位置634234

[root@zbx-db01 ~]# mysqlbinlog -v --base64-output=decode-rows --stop-position=634234 /data/mysql_data/mysql-bin.000001 | tail -20
# 找到的634234位置的内容如下：
#220507 22:04:25 server id 5  end_log_pos 634234 CRC32 0xf6000142       Write_rows: table id 174 flags: STMT_END_F
### INSERT INTO `zabbix`.`event_recovery`
### SET
###   @1=21751
###   @2=22357
###   @3=NULL
###   @4=NULL
###   @5=NULL
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file

通过解析出来的内容，将master库上位置634234的内容人工转化为可执行的语句后，是这样的：

INSERT INTO `zabbix`.`event_recovery` values(21751,22357,NULL,NULL,NULL);

要注意：子表event_recovery的字段eventid参考的是父表events的eventid字段

在master库上查下父表events中的eventid是否有21751的记录

mysql> select * from zabbix.events where eventid=21751\G;
*************************** 1. row ***************************
     eventid: 21751
      source: 0
      object: 0
    objectid: 13560
       clock: 1651925304
       value: 1
acknowledged: 0
          ns: 381417199
        name: Zabbix task manager processes more than 75% busy
    severity: 3
1 row in set (0.00 sec)

在master主库上是有的，是存在的呢。

那接着在slave库上也查一下父表events中的eventid是否有21751的记录

mysql> select * from zabbix.events where eventid=21751;
Empty set (0.01 sec)
mysql>

在slave库上，父表events中没有21751这个eventid的记录，因为自动同步的原因，所以自动执行也是失败的

在slave库上，尝试执行人工转化后的可执行语句因此，向event_recovery表插入数据时报错，提示无法添加或更新子行，外键约束失败，错误码 ERROR 1452

mysql> INSERT INTO zabbix.event_recovery values(21751,22357,NULL,NULL,NULL);
ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE)

发现也是失败的，往子表插入数据，但是父表没有，不管是手动执行还是自动执行，都是失败的

解决办法和过程

将master库父表events中的eventid为21751的记录查出来，再构造好可执行的插入数据的语句

# 现在master库上查
mysql> select * from zabbix.events where eventid=21751\G;
*************************** 1. row ***************************
     eventid: 21751
      source: 0
      object: 0
    objectid: 13560
       clock: 1651925304
       value: 1
acknowledged: 0
          ns: 381417199
        name: Zabbix task manager processes more than 75% busy
    severity: 3
1 row in set (0.13 sec)
ERROR: 
No query specified
# 构造可执行的插入数据的sql语句
insert into zabbix.events values(21751,0,0,13560,1651925304,1,0,381417199,"Zabbix task manager processes more than 75% busy",3);

将构造好的语句，在slave库中执行，插入和master中父表events一样的数据到slave库里的父表events

mysql> insert into zabbix.events values(21751,0,0,13560,1651925304,1,0,381417199,"Zabbix task manager processes more than 75% busy",3);
Query OK, 1 row affected (0.00 sec)
mysql>

接着在slave库中执行原来报错的语句，就是往子表event_recovery插入数据，居然又报错了，这次是不同的错误

mysql> INSERT INTO `zabbix`.`event_recovery` values(21751,22357,NULL,NULL,NULL);
ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_2` FOREIGN KEY (`r_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE)

这次是约束名为c_event_recovery_2的问题，这是新的问题，通过查看event_recovery表的创建表语句，约束c_event_recovery_2具体信息如下：

CONSTRAINT `c_event_recovery_2` FOREIGN KEY (`r_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE

在master库上，查看子表event_recovery的表结构

mysql> desc zabbix.event_recovery;
+---------------+-----------------+------+-----+---------+-------+
| Field         | Type            | Null | Key | Default | Extra |
+---------------+-----------------+------+-----+---------+-------+
| eventid       | bigint unsigned | NO   | PRI | NULL    |       |
| r_eventid     | bigint unsigned | NO   | MUL | NULL    |       |
| c_eventid     | bigint unsigned | YES  | MUL | NULL    |       |
| correlationid | bigint unsigned | YES  |     | NULL    |       |
| userid        | bigint unsigned | YES  |     | NULL    |       |
+---------------+-----------------+------+-----+---------+-------+
5 rows in set (0.01 sec)

也就是说，待插入的值values(21751,22357,NULL,NULL,NULL)中的第2个值（值是22357）就是r_eventid字段，目前在slave库也是缺失的呢。

分别在master库和slave库上查父表events的eventid字段有没有值为22357的记录，如果slave库上没有，那就要构造了

# master库上查，是有的
mysql> select * from zabbix.events where eventid=22357\G;
*************************** 1. row ***************************
     eventid: 22357
      source: 0
      object: 0
    objectid: 13560
       clock: 1651932264
       value: 0
acknowledged: 0
          ns: 578308822
        name: Zabbix task manager processes more than 75% busy
    severity: 0
1 row in set (0.00 sec)
ERROR: 
No query specified
mysql> 
# 在slave库上，果然是没有
mysql> select * from zabbix.events where eventid=22357; 
Empty set (0.00 sec)
# 构造可执行的插入数据的sql语句
insert into zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,"Zabbix task manager processes more than 75% busy",0);

将构造好的语句，在slave库中执行，插入和master中父表events一样的数据到slave库里的父表events

mysql> insert into zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,"Zabbix task manager processes more than 75% busy",0);
Query OK, 1 row affected (0.09 sec)
mysql>

接着在slave库中执行原来报错的语句，就是往子表event_recovery插入数据，成功了。

mysql> INSERT INTO `zabbix`.`event_recovery` values(21751,22357,NULL,NULL,NULL);
Query OK, 1 row affected (0.00 sec)
mysql>

接着启动复制，并查看状态

mysql> start replica;
Query OK, 0 rows affected (0.01 sec)
mysql> 
mysql> 
mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000006
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000012
                Relay_Log_Pos: 630974
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0

发现SQL线程还是NO，位置信息这时候是 end_log_pos 634116，也就是位置变了，又是另外一个问题造成，具体啥问题，还得去看mysql的error日志和去解析对应的binkog文件，其实处理的套路都是一样的，继续处理它。

续解决位置在634116的问题

在slave库上查看mysql的error，过滤Last_Errno: 1062的error信息，看最新时间的那条就好

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1062
2022-05-11T00:53:03.919190Z 9 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116; Could not execute Write_rows event on table zabbix.events; Duplicate entry '22357' for key 'events.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 634116, Error_code: MY-001062

大概分析的原因是：无法在表 zabbix.events 上执行写入行记录事件，键“events.PRIMARY”的重复条目“22357”，错误代码：1062

在master库上解析binglog文件mysql-bin.000001, 找到位置 end_log_pos 634116，并手动转化为可执行的sql语句

# 开始执行解析
mysqlbinlog -v --base64-output=decode-rows --stop-position=634116 /data/mysql_data/mysql-bin.000001
# 解析后，找到的634116位置内容如下：
#220507 22:04:25 server id 5  end_log_pos 634116 CRC32 0xf54d313b       Write_rows: table id 113 flags: STMT_END_F
### INSERT INTO `zabbix`.`events`
### SET
###   @1=22357
###   @2=0
###   @3=0
###   @4=13560
###   @5=1651932264
###   @6=0
###   @7=0
###   @8=578308822
###   @9='Zabbix task manager processes more than 75% busy'
###   @10=0
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
# 手动转化为可执行的sql语句
INSERT INTO zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,'Zabbix task manager processes more than 75% busy',0);

开始在slave库执行转化后的sql语句

mysql> INSERT INTO zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,'Zabbix task manager processes more than 75% busy',0);
ERROR 1062 (23000): Duplicate entry '22357' for key 'events.PRIMARY'
mysql>

执行后报错，刚刚在slave库上查看mysql的error log文件，也是报这个错，所以不管是slave库的SQL线程自动执行还是现在手动执行这条语句，都是报错。events表的主键是eventid（请执行查看表结构便知道），也就是说已经存在这条22357记录了，再插入就是重复了，主键约束的目的就是只能唯一，不能重复，因此报错。

回想了一下，22357这条记录是在 “四、解决办法和过程” 的处理过程中插入进去的。

在slave库上尝试解决

mysql> delete from zabbix.events where eventid=22357;
Query OK, 1 row affected (0.54 sec)
mysql> INSERT INTO zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,'Zabbix task manager processes more than 75% busy',0);
Query OK, 1 row affected (0.00 sec)
mysql> start replica;
Query OK, 0 rows affected (0.02 sec)
mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000007
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000024
                Relay_Log_Pos: 324
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 633751
              Relay_Log_Space: 41924759

继续查看slave库的Mysql error日志（看最新的那条就好）

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1062
2022-05-11T02:43:53.329019Z 23 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116; Could not execute Write_rows event on table zabbix.events; Duplicate entry '22357' for key 'events.PRIMARY', Error_code: 106; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 634116, Error_code: MY-001062

发现是events表中的eventid字段（主键）已经存在了22357的记录，奇怪了，还是这个问题，还是这个主键重复的问题，居然也还是22357这条记录。那刚才第4步骤的解决办法中，岂不是白干了？我....顶！...

新的解决办法

新的解决办法是：跳过指定的GTID事务（忽略slave库上发生的主键冲突），注意，笔者的主从环境是启用了GTID模式的。

之前停止过复制，现在拉起来

mysql> start replica;
Query OK, 0 rows affected (0.02 sec)

查看复制状态

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000007
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000024
                Relay_Log_Pos: 324
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 633751
              Relay_Log_Space: 41925179
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File: 
           Source_SSL_CA_Path: 
              Source_SSL_Cert: 
            Source_SSL_Cipher: 
               Source_SSL_Key: 
        Seconds_Behind_Source: NULL
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1062
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Source_Server_Id: 5
                  Source_UUID: 92099aae-4731-11ec-a3da-00505629525b
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: 
           Source_Retry_Count: 86400
                  Source_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 220511 10:43:53
               Source_SSL_Crl: 
           Source_SSL_Crlpath: 
           Retrieved_Gtid_Set: 92099aae-4731-11ec-a3da-00505629525b:768-71845 # 检索到的Gtid事务列表
            Executed_Gtid_Set: 9208096f-4731-11ec-a23e-005056210589:1-55, # 已执行的Gtid事务列表
92099aae-4731-11ec-a3da-00505629525b:1-767
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Source_TLS_Version: 
       Source_public_key_path: 
        Get_Source_public_key: 0
            Network_Namespace: 
1 row in set (0.00 sec)
ERROR: 
No query specified

Retrieved_Gtid_Set（检索到的Gtid事务列表）：92099aae-4731-11ec-a3da-00505629525b:768-71845
Executed_Gtid_Set（已执行的Gtid事务列表）：9208096f-4731-11ec-a23e-005056210589:1-55,92099aae-4731-11ec-a3da-00505629525b:1-767

故障深入分析

按照正常推断，如下：

上面的信息可以看出，当前从Master库取到了'92099aae-4731-11ec-a3da-00505629525b:768-71845'的事务列表，并且已执行（Executed_Gtid_Set)到了'92099aae-4731-11ec-a3da-00505629525b:1-767'这个事务GTID的位置。

根据之前在slave库的Mysql error日志：

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1062
2022-05-11T02:43:53.329019Z 23 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116; Could not execute Write_rows event on table zabbix.events; Duplicate entry '22357' for key 'events.PRIMARY', Error_code: 106; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 634116, Error_code: MY-001062

注意这条error：Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768'。也就是说执行失败的事务是：'92099aae-4731-11ec-a3da-00505629525b:768'

大胆推测：那是不是可以说，只要跳过这个'92099aae-4731-11ec-a3da-00505629525b:768'事务，就可以了？也就是说主从库出现主键冲突（重复）时（比如现在的问题就是这个情况）可以通过注入空事物的方式进行跳过？于是笔者斗胆一试。

在slave库上尝试操作，跳过指定的GTID事务

# 停止复制
stop replica;
# 指定下一个事务执行的版本，即想要跳过的GTID，也就是要跳过'92099aae-4731-11ec-a3da-00505629525b:768'
set gtid_next='92099aae-4731-11ec-a3da-00505629525b:768';
begin;
# 提交，开始注入一个空事物
commit;
# 设置自动的寻找GTID事务
set gtid_next='AUTOMATIC';
# 开始同步
start replica;

跳过指定的GTID后，继续在slave库上查看复制状态

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000014
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000045
                Relay_Log_Pos: 324
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1452
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:784' at master log mysql-bin.000001, end_log_pos 643505. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

顶！！！SQL线程还是为NO，看来问题很多啊，慢慢修吧！不过这次出现的是一个新的问题，事务ID也变了，事务ID是'92099aae-4731-11ec-a3da-00505629525b:784'，位置也变了，这次的位置是643505

在slave库上查看mysql的error日志，error代码是1452（查看最新的那条就好）

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1452
2022-06-06T02:41:00.836059Z 18 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:784' at master log mysql-bin.000001, end_log_pos 643505; Could not execute Write_rows event on table zabbix.event_recovery; Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 643505, Error_code: MY-001452

这次的引发错误的原因是外键约束失败，而不是主键冲突，GTID事务ID是：92099aae-4731-11ec-a3da-00505629525b:784

继续解决外键约束失败的问题（跳过指定的GTID事务）

根据刚才的错误，binlog信息是：master log mysql-bin.000001, end_log_pos 643505，解决方案：决定继续采用跳过指定的GTID事务的办法

在slave库上尝试操作，跳过指定的GTID事务

# 停止复制
stop replica;
# 指定下一个事务执行的版本，即想要跳过的GTID，也就是要跳过'92099aae-4731-11ec-a3da-00505629525b:784'
set gtid_next='92099aae-4731-11ec-a3da-00505629525b:784';
begin;
# 提交，开始注入一个空事物
commit;
# 设置自动的寻找GTID事务
set gtid_next='AUTOMATIC';
# 开始同步
start replica;

继续查看slave的replica的状态

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000014
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000045
                Relay_Log_Pos: 22387
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1452
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:1069' at master log mysql-bin.000001, end_log_pos 772696. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

还是未能解决，SQL现存依然为NO，事务ID又变了，这次是'92099aae-4731-11ec-a3da-00505629525b:1069'，按照之前的办法继续跳过指定的GTID事务ID

最终放大招：只要有错误的GTID事务都跳过

目前看来，各种问题太多了，主从已经严重不一致，从库各种主键冲突、约束等等问题引发SQL线程为NO。而且到底还有多少个问题不得而知，如果都像之前一样去解析binlog，然后找到指定的位置，手动转化为sql去执行，已经不现实了。所以，现在唯一的，最终放大招的解决办法是：凡是GIID事务有问题的，都跳过指定的GTID事务。注意，笔者的主从环境是启用了GTID模式的。如果不是GITD的模式，那就不适用该大招。

持续跳过指定的GTID事务ID，操作如下：

stop replica;
set gtid_next='92099aae-4731-11ec-a3da-00505629525b:1069';
begin;
commit;
set gtid_next='AUTOMATIC';
start replica;
show replica status\G;

只要有问题的GTID，都按照上述的办法跳过指定的事务ID，每个有问题的事务ID都不一样，只需将gtid_next=''写成有问题的GTID，其他指令不变、步骤不变。

经过放大招，互为主从的环境SQL线程已经恢复正常

master(192.168.11.152)、slave（192.168.11.152）

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.152
                  Source_User: syn_b
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000010
          Read_Source_Log_Pos: 42583848
               Relay_Log_File: zbx-db01-relay-bin.000050
                Relay_Log_Pos: 922
        Relay_Source_Log_File: mysql-bin.000010
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
              Replicate_Do_DB:

master(192.168.11.152)、slave（192.168.11.151）

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000014
          Read_Source_Log_Pos: 2178
               Relay_Log_File: zbx-db02-relay-bin.000084
                Relay_Log_Pos: 404
        Relay_Source_Log_File: mysql-bin.000014
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
              Replicate_Do_DB:

写在最后的自我反省

为什么会出现这样的问题？经过笔者的自我反省，核心原因就是：笔者的是虚拟机环境，正是因为是虚拟机环境，在每次关机的时候都不注重启停顺序。甚至为了方便，直接强制关掉运行了MySQL和上层应用的虚拟机电源，最终引发了数据库的这一些列问题。也正好是因为这次的测试环境，给了笔者一次莫大的教训。笔者认为，如果连测试环境都抱着随便维护的心态、不严谨，一旦养成这种陋习，维护的生产环境总有一天会毁在自身手里。

关于正确的启停顺序

假设应用的后端数据库环境是互为主从架构，笔者的测试环境就是该架构，且不涉及其他数据库或者中间件。

启动

启动互为主从的数据库（master、slave），并检查replica的状态是否正常；
启动上层应用；

停止

停止上层应用；
停止互为主从数据库（master、slave）的replica，再停止mysql服务；

MySQL主主SQL线程异常修复大作战，一失足成千古恨啊！

写在开篇的总结

主从环境信息

问题故障现象

进一步深入分析和排查

解决办法和过程

续解决位置在634116的问题

新的解决办法

继续解决外键约束失败的问题（跳过指定的GTID事务）

最终放大招：只要有错误的GTID事务都跳过

写在最后的自我反省

关于正确的启停顺序

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

MySQL主主SQL线程异常修复大作战，一失足成千古恨啊！

写在开篇的总结

主从环境信息

问题故障现象

进一步深入分析和排查

解决办法和过程

续解决位置在634116的问题

新的解决办法

继续解决外键约束失败的问题（跳过指定的GTID事务）

最终放大招：只要有错误的GTID事务都跳过

写在最后的自我反省

关于正确的启停顺序

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像