一、问题或故障现象

主从节点信息

角色	主机名	ip
master	db01	192.168.11.151
slave	db02	192.168.11.152

错误信息

发现slave的SQL线程为NO，Replica_SQL_Running: No，IO线程是正常的，通过分析主从复制的原理来看，SQL线程的作用是读取slave自身的中继日志（relay log）的更新事件并执行，所以，推测故障本身是slave自身出了问题，接下来进一步深入分析原因。

在slave主机查看复制状态，具体信息如下：

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000003
          Read_Source_Log_Pos: 148550
               Relay_Log_File: zbx-db02-relay-bin.000003
                Relay_Log_Pos: 371
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 156
              Relay_Log_Space: 34184871
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File: 
           Source_SSL_CA_Path: 
              Source_SSL_Cert: 
            Source_SSL_Cipher: 
               Source_SSL_Key: 
        Seconds_Behind_Source: NULL
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Source_Server_Id: 5
                  Source_UUID: 92099aae-4731-11ec-a3da-00505629525b
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: 
           Source_Retry_Count: 86400
                  Source_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 220509 16:02:22
               Source_SSL_Crl: 
           Source_SSL_Crlpath: 
           Retrieved_Gtid_Set: 92099aae-4731-11ec-a3da-00505629525b:1-55370
            Executed_Gtid_Set: 92099aae-4731-11ec-a3da-00505629525b:1-2
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Source_TLS_Version: 
       Source_public_key_path: 
        Get_Source_public_key: 0
            Network_Namespace: 
1 row in set (0.00 sec)
ERROR: 
No query specified

从查出来的复制状态信息中可以看出，关键的错误信息有如下2点：

错误代码：Last_SQL_Errno: 1032
错误描述：Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

二、问题或故障分析

查看slave主机上的Mysql服务本身的error日志

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1032
2022-05-09T08:16:04.039794Z 30 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451; Could not execute Update_rows event on table zabbix.trends; Can't find record in 'trends', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 3451, Error_code: MY-001032
# 有6一条一样的
...

通过error日志继续剖析

关键信息：master log mysql-bin.000001，end_log_pos 3451。就是说master主机上的binlog文件是mysql-bin.000001，结束位置是3451。

通过error日志大概可以看出是，由于工作线程1在执行事务的时候失败，master log是mysql-bin.000001，mysql-bin.000001的end_log_pos（文件结束日志位置）是：3451，事务执行失败的具体原因是，在master中的trends表上执行了Update_rows事件，但在slave主机这边的“trends表”中找不到相应记录，所以SQL线程发生异常。说白了就是主库更新了trends表，但在从库上又不存在，所以引起了从库的SQL线程为NO。

初步结论

结论1：slave主机读取master的binlog文件是对的上的，所以IO线程没问题，是YES
结论2：在master更新trends表记录，但在slave又找不到，因此SQL线程有问题，所以为NO

三、解决办法分析

根据笔者以往的经验，处理这种SQL线程为NO的错误，有两个办法：

是直接跳过错误执行语句，也就是忽略它
根据提示，在master的binlog日志文件中，通过位置找到错误执行的语句，然后转换为可执行的sql语句，然后去slave库手动执行，从而修复从库的SQL线程

需要注意：第1种解决办法会造成主从不一致可能，推荐第2种方法。因此，笔者采用第2种解决办法。

四、开始解决

特别建议：在解决之前，最好先将上层应用停掉，不然可能会发生不可预知错误，大概就是还会继续出现同样问题：SQL线程异常，只是位置（end_log_pos）变了而已（此问题可能会陷入死循环，处理好一个位置（end_log_pos），可能还有下一个位置（end_log_pos）有问题）。所以，处理该问题之前，还是建议先将上层应用停掉，不要对数据库有读写的IO操作，这样会保险一点，等SQL线程都为YES了，再拉起上层应用。

通过在slave主机 的错误日志确定主库的BinLog文件和位置

根据slave主机上的Mysql服务本身的error日志，如下：

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1032
2022-05-09T08:16:04.039794Z 30 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451; Could not execute Update_rows event on table zabbix.trends; Can't find record in 'trends', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 3451, Error_code: MY-001032
# 有6一条一样的
...

经再次确定，主库上的binlog文件是mysql-bin.000001，结束位置是（end_log_pos）：3451

在主库上操作， 使用mysqlbinlog工具解析mysql-bin.000001文件，查找出这条从库上缺失的数据

1）在主库上开始使用mysqlbinlog工具解析对应的二进制日志文件

# 执行解析命令
mysqlbinlog -v --base64-output=decode-rows --stop-position=3451 /data/mysql_data/mysql-bin.000001
# 解析后的结果：
...
...
#220507 22:00:36 server id 5  end_log_pos 3451 CRC32 0x10d1b42d         Update_rows: table id 122 flags: STMT_END_F
### UPDATE `zabbix`.`trends`
### WHERE
###   @1=32661
###   @2=1651921200
###   @3=15
###   @4=0.016909029421711192775
###   @5=3.5216953554840366714
###   @6=19.720232002729442655
### SET
###   @1=32661
###   @2=1651921200
###   @3=30
###   @4=0.016909029421711192775
###   @5=3.5216953554840366714
###   @6=19.720232002729442655
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
...
...

说明，解析后的内容其实很多很多，笔者是通过结束位置进行查找，也就是找到 end_log_pos 3451有关的内容即可，一直到End of log file就是这段3451位置的内容了

注意：mysqlbinlog的几个参数，因为二进制的日志格式默认是row，所以这里为--base64-output=decode-rows。指定位置--stop-position=3451，指定二进制日志文件/data/mysql_data/mysql-bin.000001

2）分析binlog的内容，转化为可执行的sql语句

经过分析，数据库名是zabbix，表名是trends，通过@1-@6说明有6个值

查看trends表的表结构

mysql> desc zabbix.trends;
+-----------+-----------------+------+-----+---------+-------+
| Field     | Type            | Null | Key | Default | Extra |
+-----------+-----------------+------+-----+---------+-------+
| itemid    | bigint unsigned | NO   | PRI | NULL    |       |
| clock     | int             | NO   | PRI | 0       |       |
| num       | int             | NO   |     | 0       |       |
| value_min | double          | NO   |     | 0       |       |
| value_avg | double          | NO   |     | 0       |       |
| value_max | double          | NO   |     | 0       |       |
+-----------+-----------------+------+-----+---------+-------+
6 rows in set (0.00 sec)

通过查看表结构，确实是有6个字段，所以结合binlog内容中的@1-@6，说明有6个值是正确的

3）开始手动转化为可执行的sql语句

insert into zabbix.trends values(32661,1651921200,15,0.016909029421711192775,3.5216953554840366714,19.720232002729442655);

在从库上操作， 执行转化后的sql

1）开始执行插入

mysql> insert into zabbix.trends values(32661,1651921200,15,0.016909029421711192775,3.5216953554840366714,19.720232002729442655);
Query OK, 1 row affected (0.10 sec)

2）插入后，启动复制，继续查看复制状态，发现SQL线程还是NO，但仔细发现end_log_pos变了，这次的位置是3633

mysql> start replica;
Query OK, 0 rows affected (0.03 sec)
mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000005
          Read_Source_Log_Pos: 1219239
               Relay_Log_File: zbx-db02-relay-bin.000012
                Relay_Log_Pos: 324
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3633. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

继续解决3633位置的问题，回到主库操作

1）继续通过mysqlbinlog工具解析Binlog文件，位置是3633

# 开始解析
mysqlbinlog -v --base64-output=decode-rows --stop-position=3633 /data/mysql_data/mysql-bin.000001
# 解析后的内容
#220507 22:00:37 server id 5  end_log_pos 3633 CRC32 0x9835e511         Update_rows: table id 123 flags: STMT_END_F
### UPDATE `zabbix`.`trends_uint`
### WHERE
###   @1=29181
###   @2=1651921200
###   @3=15
###   @4=2147209216
###   @5=2147227238
###   @6=2147479552
### SET
###   @1=29181
###   @2=1651921200
###   @3=30
###   @4=2147209216
###   @5=2147227238
###   @6=2147479552
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file

2）继续分析binlog解析后的内容

经分析，数据库是zabbix，表是trends_uint，值为6个

查看trends_uint表的表结构：

mysql> desc zabbix.trends_uint;
+-----------+-----------------+------+-----+---------+-------+
| Field     | Type            | Null | Key | Default | Extra |
+-----------+-----------------+------+-----+---------+-------+
| itemid    | bigint unsigned | NO   | PRI | NULL    |       |
| clock     | int             | NO   | PRI | 0       |       |
| num       | int             | NO   |     | 0       |       |
| value_min | bigint unsigned | NO   |     | 0       |       |
| value_avg | bigint unsigned | NO   |     | 0       |       |
| value_max | bigint unsigned | NO   |     | 0       |       |
+-----------+-----------------+------+-----+---------+-------+
6 rows in set (0.00 sec)

查看表结构发现，共计6个字段，根据分析binlog内容，对应的6个值是没错的

3）继续手动转化为可执行的sql语句

insert into zabbix.trends_uint values(29181,1651921200,15,2147209216,2147227238,2147479552);

继续回到从库上操作，继续执行转化后的sql

# 执行转化后的sql，插入数据
mysql> insert into zabbix.trends_uint values(29181,1651921200,15,2147209216,2147227238,2147479552);
Query OK, 1 row affected (0.00 sec)
# 停止复制
mysql> stop replica;
Query OK, 0 rows affected (0.01 sec)
# 再启动复制
mysql> start replica;      
Query OK, 0 rows affected (0.02 sec)
# 接着查看复制状态，大喜啊！此时SQL线程为YES了。
mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000005
          Read_Source_Log_Pos: 1361161
               Relay_Log_File: zbx-db02-relay-bin.000012
                Relay_Log_Pos: 201080
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes

五、最后总结

问题产生的本质原因总结：在本次的问题中，主要引发的核心原因就是主库更新了相关的表，但在从库上又不存在，所以引起了从库的SQL线程为NO。

本次故障解决的总结：在slave主机上的error错误先是发现有1个位置3451有错误，解决后又爆出了3633位置，再按照同样的套路继续解决，之后SQL线程恢复。

关于处理故障思路的总结：就一句话，一定要看相关error日志。

MySQL主从复制“死掉”！引发Slave库SQL线程异常的一次“血案”追踪

一、问题或故障现象

二、问题或故障分析

三、解决办法分析

四、开始解决

五、最后总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

MySQL主从复制“死掉”！引发Slave库SQL线程异常的一次“血案”追踪

一、问题或故障现象

二、问题或故障分析

三、解决办法分析

四、开始解决

五、最后总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像