一、问题:发现RDS实例主从数据不一致
二、分析:可能的原因
1.主从复制中断
2.主从延迟(DDL语句,大事务,只读实例负载高等)
3.实时性查询高
4.丢数据(binlog row格式可能性很小)
三、排查:
获取只读实例的复制状态,show slave status\
G:
1. row **
Slave_IO_State: Waiting for master to send event
Master_Host: XXXXXXX
Master_User: replicator
Master_Port: XXXX
Connect_Retry: 60
Master_Log_File: mysql-bin.001798
Read_Master_Log_Pos: 15683249
Relay_Log_File: slave-relay.002090
Relay_Log_Pos: 88207647
Relay_Master_Log_File: mysql-bin.001797
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 128759954
Relay_Log_Space: 147040856
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 9366
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: XXXX
Master_UUID: XXXX
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay:
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Last_SQL_Error_Gtid:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 64f5764b-abfd-11e8-b97f-6c92bf4645d4:6903-76788620
Executed_Gtid_Set: 64f5764b-abfd-11e8-b97f-6c92bf4645d4:1-76781839,
73e54b6b-abfd-11e8-9bb5-6c92bf3a3bee:1-131695420
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
解读下:看到IO Thread和SQL Thread线程正常都是YES 复制正常。
Seconds_Behind_Master: 9366, Relay_Master_Log_File: mysql-bin.001797 和 Master_Log_File: mysql-bin.001798 可以判断主从复制相差一个binlog日志有延迟,延迟时间大概9366秒。
继续分析只读实例负载不高
慢日志里有delete操作,但执行时间不是特别长,可以先忽略。
show processlist 也并未发现DDL语句执行
继续看复制信息 Slave_SQL_Running_State: Waiting for dependent transaction to commit 等待依赖的事务提交,这里提醒了我们relay log在回放某个event的时候,在等待某个事务提交。所以查询下infomation_schema.innodb_trx 这个系统表,里面会记录未提交事务执行信息
从结果看确实有2个未提交事务已经跑了很久了,为了尽快解决生产问题,建议客户kill 掉了这2个sql,这里说下trx_mysql_thread_id 对应的是processlist表里的id,所以kill时候,可以kill trx_mysql_thread_id。
结果:kill掉这2个未提交事务之后,延迟恢复。
四.复现客户问题
观察到当时SQL线程回放event时候,在等待table flush
资料:
FLUSH TABLES
Closes all open tables, forces all tables in use to be closed, and flushes the query cache and prepared statement cache. FLUSH TABLES also removes all query results from the query cache, like the RESET QUERY CACHE statement. For information about query caching and prepared statement caching, see Section 8.10.3, “The MySQL Query Cache”. and Section 8.10.4, “Caching of Prepared Statements and Stored Programs”.
mysql官网说明:https://dev.mysql.com/doc/refman/5.6/en/flush.html#flush-tables
说明下,flush tables遇到前面有大事务执行不完,还会产生阻塞,线程状态就是waiting for table flush(阻塞仅限于该表)
基于这些信息我们还原下问题
4.1先在只读实例,开启一个性能差的大查询事务,SQL一直执行不完
4.2在只读实例另外开启一个会话执行flush tables tablename
4.3主实例插入一条测试数据
4.4查询只读实例的sql线程,可以看到SQL线程的状态waiting for table flush,第一个查询一值执行不完不提交,
第二步flush tables tablename就阻塞了第三步的sql
4.5查询只读实例的复制状态,同样可以看到:Waiting for dependent transaction to commit
4.6同样也可以查到未提交的这个大事务
总结:未提交事务和flush tables导致relay log回放event时候,产生了阻塞,复制不能继续,产生主从延迟
客户的问题还可以通过解析binlog定位到当时relay log 回放到binlog的event了,可以佐证该事务在等待这2个事务commit或rollback。解析时候分析这个位置Relay_Master_Log_File: mysql-bin.001797,Exec_Master_Log_Pos: 128759954,row格式使用命令:mysqlbinlog -vv --base64-output=decode-rows --start-position=128759954 mysql-bin.001797 | less