一、问题或故障现象
- 主从节点信息
角色 | 主机名 | ip |
master | db01 | 192.168.11.151 |
slave | db02 | 192.168.11.152 |
- 错误信息
发现slave的SQL线程为NO,Replica_SQL_Running: No,IO线程是正常的,通过分析主从复制的原理来看,SQL线程的作用是读取slave自身的中继日志(relay log)的更新事件并执行,所以,推测故障本身是slave自身出了问题,接下来进一步深入分析原因。
在slave主机查看复制状态,具体信息如下:
mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000003 Read_Source_Log_Pos: 148550 Relay_Log_File: zbx-db02-relay-bin.000003 Relay_Log_Pos: 371 Relay_Source_Log_File: mysql-bin.000001 Replica_IO_Running: Yes Replica_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1032 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. Skip_Counter: 0 Exec_Source_Log_Pos: 156 Relay_Log_Space: 34184871 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Source_SSL_Allowed: No Source_SSL_CA_File: Source_SSL_CA_Path: Source_SSL_Cert: Source_SSL_Cipher: Source_SSL_Key: Seconds_Behind_Source: NULL Source_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1032 Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. Replicate_Ignore_Server_Ids: Source_Server_Id: 5 Source_UUID: 92099aae-4731-11ec-a3da-00505629525b Source_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Replica_SQL_Running_State: Source_Retry_Count: 86400 Source_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 220509 16:02:22 Source_SSL_Crl: Source_SSL_Crlpath: Retrieved_Gtid_Set: 92099aae-4731-11ec-a3da-00505629525b:1-55370 Executed_Gtid_Set: 92099aae-4731-11ec-a3da-00505629525b:1-2 Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Source_TLS_Version: Source_public_key_path: Get_Source_public_key: 0 Network_Namespace: 1 row in set (0.00 sec) ERROR: No query specified
从查出来的复制状态信息中可以看出,关键的错误信息有如下2点:
- 错误代码:Last_SQL_Errno: 1032
- 错误描述:Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
二、问题或故障分析
- 查看slave主机上的Mysql服务本身的error日志
[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1032 2022-05-09T08:16:04.039794Z 30 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451; Could not execute Update_rows event on table zabbix.trends; Can't find record in 'trends', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 3451, Error_code: MY-001032 # 有6一条一样的 ...
通过error日志继续剖析
关键信息:master log mysql-bin.000001,end_log_pos 3451。就是说master主机上的binlog文件是mysql-bin.000001,结束位置是3451。
通过error日志大概可以看出是,由于工作线程1在执行事务的时候失败,master log是mysql-bin.000001,mysql-bin.000001的end_log_pos(文件结束日志位置)是:3451,事务执行失败的具体原因是,在master中的trends表上执行了Update_rows事件,但在slave主机这边的“trends表”中找不到相应记录,所以SQL线程发生异常。说白了就是主库更新了trends表,但在从库上又不存在,所以引起了从库的SQL线程为NO。
- 初步结论
- 结论1:slave主机读取master的binlog文件是对的上的,所以IO线程没问题,是YES
- 结论2:在master更新trends表记录,但在slave又找不到,因此SQL线程有问题,所以为NO
三、解决办法分析
根据笔者以往的经验,处理这种SQL线程为NO的错误,有两个办法:
- 是直接跳过错误执行语句,也就是忽略它
- 根据提示,在master的binlog日志文件中,通过位置找到错误执行的语句,然后转换为可执行的sql语句,然后去slave库手动执行,从而修复从库的SQL线程
需要注意:第1种解决办法会造成主从不一致可能,推荐第2种方法。因此,笔者采用第2种解决办法。
四、开始解决
特别建议:在解决之前,最好先将上层应用停掉,不然可能会发生不可预知错误,大概就是还会继续出现同样问题:SQL线程异常,只是位置(end_log_pos)变了而已(此问题可能会陷入死循环,处理好一个位置(end_log_pos),可能还有下一个位置(end_log_pos)有问题)。所以,处理该问题之前,还是建议先将上层应用停掉,不要对数据库有读写的IO操作,这样会保险一点,等SQL线程都为YES了,再拉起上层应用。
- 通过在slave主机 的错误日志确定主库的BinLog文件和位置
根据slave主机上的Mysql服务本身的error日志,如下:
[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1032 2022-05-09T08:16:04.039794Z 30 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3451; Could not execute Update_rows event on table zabbix.trends; Can't find record in 'trends', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 3451, Error_code: MY-001032 # 有6一条一样的 ...
经再次确定,主库上的binlog文件是mysql-bin.000001,结束位置是(end_log_pos):3451
- 在主库上操作, 使用mysqlbinlog工具解析mysql-bin.000001文件,查找出这条从库上缺失的数据
1)在主库上开始使用mysqlbinlog工具解析对应的二进制日志文件
# 执行解析命令 mysqlbinlog -v --base64-output=decode-rows --stop-position=3451 /data/mysql_data/mysql-bin.000001 # 解析后的结果: ... ... #220507 22:00:36 server id 5 end_log_pos 3451 CRC32 0x10d1b42d Update_rows: table id 122 flags: STMT_END_F ### UPDATE `zabbix`.`trends` ### WHERE ### @1=32661 ### @2=1651921200 ### @3=15 ### @4=0.016909029421711192775 ### @5=3.5216953554840366714 ### @6=19.720232002729442655 ### SET ### @1=32661 ### @2=1651921200 ### @3=30 ### @4=0.016909029421711192775 ### @5=3.5216953554840366714 ### @6=19.720232002729442655 ROLLBACK /* added by mysqlbinlog */ /*!*/; SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file ... ...
说明,解析后的内容其实很多很多,笔者是通过结束位置进行查找,也就是找到 end_log_pos 3451有关的内容即可,一直到End of log file就是这段3451位置的内容了
注意:mysqlbinlog的几个参数,因为二进制的日志格式默认是row,所以这里为--base64-output=decode-rows。指定位置--stop-position=3451,指定二进制日志文件/data/mysql_data/mysql-bin.000001
2)分析binlog的内容,转化为可执行的sql语句
经过分析,数据库名是zabbix,表名是trends,通过@1-@6说明有6个值
查看trends表的表结构
mysql> desc zabbix.trends; +-----------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+-----------------+------+-----+---------+-------+ | itemid | bigint unsigned | NO | PRI | NULL | | | clock | int | NO | PRI | 0 | | | num | int | NO | | 0 | | | value_min | double | NO | | 0 | | | value_avg | double | NO | | 0 | | | value_max | double | NO | | 0 | | +-----------+-----------------+------+-----+---------+-------+ 6 rows in set (0.00 sec)
通过查看表结构,确实是有6个字段,所以结合binlog内容中的@1-@6,说明有6个值是正确的
3)开始手动转化为可执行的sql语句
insert into zabbix.trends values(32661,1651921200,15,0.016909029421711192775,3.5216953554840366714,19.720232002729442655);
- 在从库上操作, 执行转化后的sql
1)开始执行插入
mysql> insert into zabbix.trends values(32661,1651921200,15,0.016909029421711192775,3.5216953554840366714,19.720232002729442655); Query OK, 1 row affected (0.10 sec)
2)插入后,启动复制,继续查看复制状态,发现SQL线程还是NO,但仔细发现end_log_pos变了,这次的位置是3633
mysql> start replica; Query OK, 0 rows affected (0.03 sec) mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000005 Read_Source_Log_Pos: 1219239 Relay_Log_File: zbx-db02-relay-bin.000012 Relay_Log_Pos: 324 Relay_Source_Log_File: mysql-bin.000001 Replica_IO_Running: Yes Replica_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1032 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:3' at master log mysql-bin.000001, end_log_pos 3633. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
- 继续解决3633位置的问题,回到主库操作
1)继续通过mysqlbinlog工具解析Binlog文件,位置是3633
# 开始解析 mysqlbinlog -v --base64-output=decode-rows --stop-position=3633 /data/mysql_data/mysql-bin.000001 # 解析后的内容 #220507 22:00:37 server id 5 end_log_pos 3633 CRC32 0x9835e511 Update_rows: table id 123 flags: STMT_END_F ### UPDATE `zabbix`.`trends_uint` ### WHERE ### @1=29181 ### @2=1651921200 ### @3=15 ### @4=2147209216 ### @5=2147227238 ### @6=2147479552 ### SET ### @1=29181 ### @2=1651921200 ### @3=30 ### @4=2147209216 ### @5=2147227238 ### @6=2147479552 ROLLBACK /* added by mysqlbinlog */ /*!*/; SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file
2)继续分析binlog解析后的内容
经分析,数据库是zabbix,表是trends_uint,值为6个
查看trends_uint表的表结构:
mysql> desc zabbix.trends_uint; +-----------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+-----------------+------+-----+---------+-------+ | itemid | bigint unsigned | NO | PRI | NULL | | | clock | int | NO | PRI | 0 | | | num | int | NO | | 0 | | | value_min | bigint unsigned | NO | | 0 | | | value_avg | bigint unsigned | NO | | 0 | | | value_max | bigint unsigned | NO | | 0 | | +-----------+-----------------+------+-----+---------+-------+ 6 rows in set (0.00 sec)
查看表结构发现,共计6个字段,根据分析binlog内容,对应的6个值是没错的
3)继续手动转化为可执行的sql语句
insert into zabbix.trends_uint values(29181,1651921200,15,2147209216,2147227238,2147479552);
- 继续回到从库上操作,继续执行转化后的sql
# 执行转化后的sql,插入数据 mysql> insert into zabbix.trends_uint values(29181,1651921200,15,2147209216,2147227238,2147479552); Query OK, 1 row affected (0.00 sec) # 停止复制 mysql> stop replica; Query OK, 0 rows affected (0.01 sec) # 再启动复制 mysql> start replica; Query OK, 0 rows affected (0.02 sec) # 接着查看复制状态,大喜啊!此时SQL线程为YES了。 mysql> show replica status\G; *************************** 1. row *************************** Replica_IO_State: Waiting for source to send event Source_Host: 192.168.11.151 Source_User: syn_a Source_Port: 3306 Connect_Retry: 60 Source_Log_File: mysql-bin.000005 Read_Source_Log_Pos: 1361161 Relay_Log_File: zbx-db02-relay-bin.000012 Relay_Log_Pos: 201080 Relay_Source_Log_File: mysql-bin.000001 Replica_IO_Running: Yes Replica_SQL_Running: Yes
五、最后总结
问题产生的本质原因总结:在本次的问题中,主要引发的核心原因就是主库更新了相关的表,但在从库上又不存在,所以引起了从库的SQL线程为NO。
本次故障解决的总结:在slave主机上的error错误先是发现有1个位置3451有错误,解决后又爆出了3633位置,再按照同样的套路继续解决,之后SQL线程恢复。
关于处理故障思路的总结:就一句话,一定要看相关error日志。