MHA failover NON-GTID 专题
这里以masterha_master_switch为背景详解各种可能遇到的场景
假定环境(经典三节点)
host_1(host_1:3306) (current master)
+--host_2(host_2:3306 slave[candidate master])
+--host_3(host_3:3306 etl)
一、Master : MySQL down
1.1 etl 延迟8小时
配置文件中加上no_check_delay=0 即可忽略报错
1.2 slave(候选master)比etl还要落后更多
- 1.2.1 当master的部分日志还没传递两个slave,这时候master 上的MySQL挂了
### 模拟现场,现场的3台DB binlog状态
* master host_2
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000002 | 1920 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
* slave (candidate master) host_1
Master_Log_File: host_2.000002
Read_Master_Log_Pos: 150
Exec_Master_Log_Pos: 150
* etl (other slave)
Master_Log_File: host_2.000002
Read_Master_Log_Pos: 1035
Exec_Master_Log_Pos: 1035
### 切换日志
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Mon Nov 13 16:24:03 2017 - [info] MHA::MasterFailover version 0.56.
Mon Nov 13 16:24:03 2017 - [info] Starting master failover.
Mon Nov 13 16:24:03 2017 - [info]
Mon Nov 13 16:24:03 2017 - [info] * Phase 1: Configuration Check Phase..
Mon Nov 13 16:24:03 2017 - [info]
Mon Nov 13 16:24:03 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Mon Nov 13 16:24:03 2017 - [info] Binlog server host_2 is reachable.
Mon Nov 13 16:24:06 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Mon Nov 13 16:24:06 2017 - [info] Binlog server host_3 is reachable.
Mon Nov 13 16:24:06 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Mon Nov 13 16:24:06 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Mon Nov 13 16:24:06 2017 - [info] GTID failover mode = 0
Mon Nov 13 16:24:06 2017 - [info] Dead Servers:
Mon Nov 13 16:24:06 2017 - [info] host_2(host_2:3306)
Mon Nov 13 16:24:06 2017 - [info] Checking master reachability via MySQL(double check)...
Mon Nov 13 16:24:06 2017 - [info] ok.
Mon Nov 13 16:24:06 2017 - [info] Alive Servers:
Mon Nov 13 16:24:06 2017 - [info] host_1(host_1:3306)
Mon Nov 13 16:24:06 2017 - [info] host_3(host_3:3306)
Mon Nov 13 16:24:06 2017 - [info] Alive Slaves:
Mon Nov 13 16:24:06 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:06 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:06 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Nov 13 16:24:06 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:06 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:06 2017 - [info] Not candidate for the new Master (no_master is set)
Mon Nov 13 16:24:06 2017 - [info] Starting SQL thread on host_1(host_1:3306) ..
Mon Nov 13 16:24:06 2017 - [info] done.
Mon Nov 13 16:24:06 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Mon Nov 13 16:24:06 2017 - [info] done.
Mon Nov 13 16:24:06 2017 - [info] Starting Non-GTID based failover.
Mon Nov 13 16:24:06 2017 - [info]
Mon Nov 13 16:24:06 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Nov 13 16:24:06 2017 - [info]
Mon Nov 13 16:24:06 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Nov 13 16:24:06 2017 - [info]
Mon Nov 13 16:24:06 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Mon Nov 13 16:24:06 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Mon Nov 13 16:24:06 2017 - [info] Executing master IP deactivation script:
Mon Nov 13 16:24:06 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-13 16:24:07-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 11.2M=0s
2017-11-13 16:24:11 (11.2 MB/s) - 已写入标准输出 [38]
Mon Nov 13 16:24:11 2017 - [info] done.
Mon Nov 13 16:24:11 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Mon Nov 13 16:24:11 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] * Phase 3: Master Recovery Phase..
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] The latest binary log file/position on all slaves is host_2.000002:1035
Mon Nov 13 16:24:11 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Nov 13 16:24:11 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:11 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:11 2017 - [info] Not candidate for the new Master (no_master is set)
Mon Nov 13 16:24:11 2017 - [info] The oldest binary log file/position on all slaves is host_2.000002:150
Mon Nov 13 16:24:11 2017 - [info] Oldest slaves:
Mon Nov 13 16:24:11 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:11 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:11 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] Fetching dead master's binary logs..
Mon Nov 13 16:24:11 2017 - [info] Executing command on the dead master host_2(host_2:3306): save_binary_logs --command=save --start_file=host_2.000002 --start_pos=1035 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_2.000002 pos 1035 to host_2.000002 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog ..
Dumping binlog format description event, from position 0 to 150.. ok.
Dumping effective binlog data from /data/mysql.bin/host_2.000002 position 1035 to tail(1939).. ok.
Concat succeeded.
Mon Nov 13 16:24:12 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog succeeded.
Mon Nov 13 16:24:12 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Mon Nov 13 16:24:12 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] * Phase 3.3: Determining New Master Phase..
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Mon Nov 13 16:24:13 2017 - [info] Checking whether host_3 has relay logs from the oldest position..
Mon Nov 13 16:24:13 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000002 --latest_rmlp=1035 --target_mlf=host_2.000002 --target_rmlp=150 --server_id=1261261666 --workdir=/var/log/masterha/mha_test --timestamp=20171113162403 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000005 :
Relay log found at /data/mysql_data, up to host_3-relay-bin.000005
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000005, start_pos:269.
Target relay log FOUND!
Mon Nov 13 16:24:13 2017 - [info] OK. host_3 has all relay logs.
Mon Nov 13 16:24:13 2017 - [info] Searching new master from slaves..
Mon Nov 13 16:24:13 2017 - [info] Candidate masters from the configuration file:
Mon Nov 13 16:24:13 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:13 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:13 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Nov 13 16:24:13 2017 - [info] Non-candidate masters:
Mon Nov 13 16:24:13 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:13 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:13 2017 - [info] Not candidate for the new Master (no_master is set)
Mon Nov 13 16:24:13 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Mon Nov 13 16:24:13 2017 - [info] Not found.
Mon Nov 13 16:24:13 2017 - [info] Searching from all candidate_master slaves..
Mon Nov 13 16:24:13 2017 - [info] New master is host_1(host_1:3306)
Mon Nov 13 16:24:13 2017 - [info] Starting master failover..
Mon Nov 13 16:24:13 2017 - [info]
From:
host_2(host_2:3306) (current master)
+--host_1(host_1:3306)
+--host_3(host_3:3306)
To:
host_1(host_1:3306) (new master)
+--host_3(host_3:3306)
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] Server host_1 received relay logs up to: host_2.000002:150
Mon Nov 13 16:24:13 2017 - [info] Need to get diffs from the latest slave(host_3) up to: host_2.000002:1035 (using the latest slave's relay logs)
Mon Nov 13 16:24:13 2017 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Mon Nov 13 16:24:13 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_1 --latest_mlf=host_2.000002 --latest_rmlp=1035 --target_mlf=host_2.000002 --target_rmlp=150 --server_id=1261261666 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171113162403 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000005
Mon Nov 13 16:24:15 2017 - [info]
Relay log found at /data/mysql_data, up to host_3-relay-bin.000005
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000005, start_pos:269.
Concat binary/relay logs from host_3-relay-bin.000005 pos 269 to host_3-relay-bin.000005 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog ..
Dumping binlog format description event, from position 0 to 269.. ok.
Dumping effective binlog data from /data/mysql_data/host_3-relay-bin.000005 position 269 to tail(1154).. ok.
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog .
scp host_3.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog to root@host_1(22) succeeded.
Mon Nov 13 16:24:15 2017 - [info] Generating diff files succeeded.
Mon Nov 13 16:24:15 2017 - [info] Sending binlog..
Mon Nov 13 16:24:16 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog to root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog succeeded.
Mon Nov 13 16:24:16 2017 - [info]
Mon Nov 13 16:24:16 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Mon Nov 13 16:24:16 2017 - [info]
Mon Nov 13 16:24:16 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Mon Nov 13 16:24:16 2017 - [info] Starting recovery on host_1(host_1:3306)..
Mon Nov 13 16:24:16 2017 - [info] Generating diffs succeeded.
Mon Nov 13 16:24:16 2017 - [info] Waiting until all relay logs are applied.
Mon Nov 13 16:24:16 2017 - [info] done.
Mon Nov 13 16:24:16 2017 - [info] Getting slave status..
Mon Nov 13 16:24:16 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000002:150). No need to recover from Exec_Master_Log_Pos.
Mon Nov 13 16:24:16 2017 - [info] Connecting to the target slave host host_1, running recover script..
Mon Nov 13 16:24:16 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171113162403 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Mon Nov 13 16:24:16 2017 - [info]
Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171113162403.binlog ..
Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171113162403.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog.. dumped up to pos 150. ok.
/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog has effective binlog events from pos 150.
Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog position 150 to tail(1054).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171113162403.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Mon Nov 13 16:24:16 2017 - [info] All relay logs were successfully applied.
Mon Nov 13 16:24:16 2017 - [info] Getting new master's binlog name and position..
Mon Nov 13 16:24:16 2017 - [info] host_1.000001:3232
Mon Nov 13 16:24:16 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000001', MASTER_LOG_POS=3232, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Nov 13 16:24:16 2017 - [info] Executing master IP activate script:
Mon Nov 13 16:24:16 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Mon Nov 13 16:24:21 2017 - [info] OK.
Mon Nov 13 16:24:21 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Mon Nov 13 16:24:21 2017 - [info] ok.
Mon Nov 13 16:24:21 2017 - [info] ** Finished master recovery successfully.
Mon Nov 13 16:24:21 2017 - [info] * Phase 3: Master Recovery Phase completed.
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 4: Slaves Recovery Phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 104955. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171113162403.log if it takes time..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Log messages from host_3 ...
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Mon Nov 13 16:24:21 2017 - [info] End of log messages from host_3.
Mon Nov 13 16:24:21 2017 - [info] -- host_3(host_3:3306) has the latest relay log events.
Mon Nov 13 16:24:21 2017 - [info] Generating relay diff files from the latest slave succeeded.
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 104966. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171113162403.log if it takes time..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Log messages from host_3 ...
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Sending binlog..
Mon Nov 13 16:24:21 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog succeeded.
Mon Nov 13 16:24:21 2017 - [info] Starting recovery on host_3(host_3:3306)..
Mon Nov 13 16:24:21 2017 - [info] Generating diffs succeeded.
Mon Nov 13 16:24:21 2017 - [info] Waiting until all relay logs are applied.
Mon Nov 13 16:24:21 2017 - [info] done.
Mon Nov 13 16:24:21 2017 - [info] Getting slave status..
Mon Nov 13 16:24:21 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000002:1035). No need to recover from Exec_Master_Log_Pos.
Mon Nov 13 16:24:21 2017 - [info] Connecting to the target slave host host_3, running recover script..
Mon Nov 13 16:24:21 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171113162403 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Mon Nov 13 16:24:21 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Mon Nov 13 16:24:21 2017 - [info] All relay logs were successfully applied.
Mon Nov 13 16:24:21 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Mon Nov 13 16:24:21 2017 - [info] Executed CHANGE MASTER.
Mon Nov 13 16:24:21 2017 - [info] Slave started.
Mon Nov 13 16:24:21 2017 - [info] End of log messages from host_3.
Mon Nov 13 16:24:21 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Mon Nov 13 16:24:21 2017 - [info] All new slave servers recovered successfully.
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 5: New master cleanup phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Resetting slave info on the new master..
Mon Nov 13 16:24:21 2017 - [info] host_1: Resetting slave info succeeded.
Mon Nov 13 16:24:21 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Mon Nov 13 16:24:21 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
Master host_2(host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
Mon Nov 13 16:24:21 2017 - [info] Sending mail..
- 1.2.2 当master的所有日志已经传递到1个etl,这时候master 上的MySQL挂了
### 模拟现场,现场的3台DB binlog状态
* master host_2
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000010 | 1694 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 164 | 1 |
| 165 | 2 |
| 166 | 3 |
+-----+------+
3 rows in set (0.00 sec)
* slave (candidate master) host_1
Master_Log_File: host_2.000010
Exec_Master_Log_Pos: 806
dba:lc> select * from t_char_2;
Empty set (0.00 sec)
* etl (other slave) host_3
Master_Log_File: host_2.000010
Exec_Master_Log_Pos: 1694
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 164 | 1 |
| 165 | 2 |
| 166 | 3 |
+-----+------+
3 rows in set (0.00 sec)
### 切换日志
Wed Nov 15 10:25:50 2017 - [info] MHA::MasterFailover version 0.56.
Wed Nov 15 10:25:50 2017 - [info] Starting master failover.
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] * Phase 1: Configuration Check Phase..
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Wed Nov 15 10:25:50 2017 - [info] GTID failover mode = 0
Wed Nov 15 10:25:50 2017 - [info] Dead Servers:
Wed Nov 15 10:25:50 2017 - [info] host_2(host_2:3306)
Wed Nov 15 10:25:50 2017 - [info] Checking master reachability via MySQL(double check)...
Wed Nov 15 10:25:50 2017 - [info] ok.
Wed Nov 15 10:25:50 2017 - [info] Alive Servers:
Wed Nov 15 10:25:50 2017 - [info] host_1(host_1:3306)
Wed Nov 15 10:25:50 2017 - [info] host_3(host_3:3306)
Wed Nov 15 10:25:50 2017 - [info] Alive Slaves:
Wed Nov 15 10:25:50 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:50 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:50 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:25:50 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:50 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:50 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:25:50 2017 - [info] Starting SQL thread on host_1(host_1:3306) ..
Wed Nov 15 10:25:50 2017 - [info] done.
Wed Nov 15 10:25:50 2017 - [info] Starting Non-GTID based failover.
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Wed Nov 15 10:25:51 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Wed Nov 15 10:25:51 2017 - [info] Executing master IP deactivation script:
Wed Nov 15 10:25:51 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-15 10:25:51-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 11.1M=0s
2017-11-15 10:25:53 (11.1 MB/s) - 已写入标准输出 [38]
Wed Nov 15 10:25:53 2017 - [info] done.
Wed Nov 15 10:25:53 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Wed Nov 15 10:25:53 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] * Phase 3: Master Recovery Phase..
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] The latest binary log file/position on all slaves is host_2.000010:1694
Wed Nov 15 10:25:53 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Wed Nov 15 10:25:53 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:53 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:53 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:25:53 2017 - [info] The oldest binary log file/position on all slaves is host_2.000010:806
Wed Nov 15 10:25:53 2017 - [info] Oldest slaves:
Wed Nov 15 10:25:53 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:53 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:53 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] Fetching dead master's binary logs..
Wed Nov 15 10:25:53 2017 - [info] Executing command on the dead master host_2(host_2:3306): save_binary_logs --command=save --start_file=host_2.000010 --start_pos=1694 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_2.000010 pos 1694 to host_2.000010 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog ..
Dumping binlog format description event, from position 0 to 150.. ok.
Dumping effective binlog data from /data/mysql.bin/host_2.000010 position 1694 to tail(1713).. ok.
Concat succeeded.
Wed Nov 15 10:25:53 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog succeeded.
Wed Nov 15 10:25:54 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Wed Nov 15 10:25:54 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] * Phase 3.3: Determining New Master Phase..
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Wed Nov 15 10:25:54 2017 - [info] Checking whether host_3 has relay logs from the oldest position..
Wed Nov 15 10:25:54 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000010 --latest_rmlp=1694 --target_mlf=host_2.000010 --target_rmlp=806 --server_id=1261261666 --workdir=/var/log/masterha/mha_test --timestamp=20171115102550 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004 :
Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:1017.
Target relay log FOUND!
Wed Nov 15 10:25:54 2017 - [info] OK. host_3 has all relay logs.
Wed Nov 15 10:25:54 2017 - [info] Searching new master from slaves..
Wed Nov 15 10:25:54 2017 - [info] Candidate masters from the configuration file:
Wed Nov 15 10:25:54 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:54 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:54 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:25:54 2017 - [info] Non-candidate masters:
Wed Nov 15 10:25:54 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:54 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:54 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:25:54 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Wed Nov 15 10:25:54 2017 - [info] Not found.
Wed Nov 15 10:25:54 2017 - [info] Searching from all candidate_master slaves..
Wed Nov 15 10:25:54 2017 - [info] New master is host_1(host_1:3306)
Wed Nov 15 10:25:54 2017 - [info] Starting master failover..
Wed Nov 15 10:25:54 2017 - [info]
From:
host_2(host_2:3306) (current master)
+--host_1(host_1:3306)
+--host_3(host_3:3306)
To:
host_1(host_1:3306) (new master)
+--host_3(host_3:3306)
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] Server host_1 received relay logs up to: host_2.000010:806
Wed Nov 15 10:25:54 2017 - [info] Need to get diffs from the latest slave(host_3) up to: host_2.000010:1694 (using the latest slave's relay logs)
Wed Nov 15 10:25:55 2017 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Wed Nov 15 10:25:55 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_1 --latest_mlf=host_2.000010 --latest_rmlp=1694 --target_mlf=host_2.000010 --target_rmlp=806 --server_id=1261261666 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171115102550 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004
Wed Nov 15 10:25:55 2017 - [info]
Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:1017.
Concat binary/relay logs from host_3-relay-bin.000004 pos 1017 to host_3-relay-bin.000004 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog ..
Dumping binlog format description event, from position 0 to 361.. ok.
Dumping effective binlog data from /data/mysql_data/host_3-relay-bin.000004 position 1017 to tail(1905).. ok.
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog .
scp host_3.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog to root@host_1(22) succeeded.
Wed Nov 15 10:25:55 2017 - [info] Generating diff files succeeded.
Wed Nov 15 10:25:55 2017 - [info] Sending binlog..
Wed Nov 15 10:25:56 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog to root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog succeeded.
Wed Nov 15 10:25:56 2017 - [info]
Wed Nov 15 10:25:56 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Wed Nov 15 10:25:56 2017 - [info]
Wed Nov 15 10:25:56 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Wed Nov 15 10:25:56 2017 - [info] Starting recovery on host_1(host_1:3306)..
Wed Nov 15 10:25:56 2017 - [info] Generating diffs succeeded.
Wed Nov 15 10:25:56 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:25:56 2017 - [info] done.
Wed Nov 15 10:25:56 2017 - [info] Getting slave status..
Wed Nov 15 10:25:56 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000010:806). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:25:56 2017 - [info] Connecting to the target slave host host_1, running recover script..
Wed Nov 15 10:25:56 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115102550 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:25:56 2017 - [info]
Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171115102550.binlog ..
Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171115102550.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog.. dumped up to pos 150. ok.
/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog has effective binlog events from pos 150.
Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog position 150 to tail(169).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171115102550.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:25:56 2017 - [info] All relay logs were successfully applied.
Wed Nov 15 10:25:56 2017 - [info] Getting new master's binlog name and position..
Wed Nov 15 10:25:56 2017 - [info] host_1.000010:2060
Wed Nov 15 10:25:56 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000010', MASTER_LOG_POS=2060, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Wed Nov 15 10:25:56 2017 - [info] Executing master IP activate script:
Wed Nov 15 10:25:56 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Wed Nov 15 10:25:59 2017 - [info] OK.
Wed Nov 15 10:25:59 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Wed Nov 15 10:25:59 2017 - [info] ok.
Wed Nov 15 10:25:59 2017 - [info] ** Finished master recovery successfully.
Wed Nov 15 10:25:59 2017 - [info] * Phase 3: Master Recovery Phase completed.
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 4: Slaves Recovery Phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 125962. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115102550.log if it takes time..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Wed Nov 15 10:25:59 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:25:59 2017 - [info] -- host_3(host_3:3306) has the latest relay log events.
Wed Nov 15 10:25:59 2017 - [info] Generating relay diff files from the latest slave succeeded.
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 125967. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115102550.log if it takes time..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Sending binlog..
Wed Nov 15 10:25:59 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog succeeded.
Wed Nov 15 10:25:59 2017 - [info] Starting recovery on host_3(host_3:3306)..
Wed Nov 15 10:25:59 2017 - [info] Generating diffs succeeded.
Wed Nov 15 10:25:59 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:25:59 2017 - [info] done.
Wed Nov 15 10:25:59 2017 - [info] Getting slave status..
Wed Nov 15 10:25:59 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000010:1694). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:25:59 2017 - [info] Connecting to the target slave host host_3, running recover script..
Wed Nov 15 10:25:59 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115102550 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:25:59 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:25:59 2017 - [info] All relay logs were successfully applied.
Wed Nov 15 10:25:59 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Wed Nov 15 10:25:59 2017 - [info] Executed CHANGE MASTER.
Wed Nov 15 10:25:59 2017 - [info] Slave started.
Wed Nov 15 10:25:59 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:25:59 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Wed Nov 15 10:25:59 2017 - [info] All new slave servers recovered successfully.
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 5: New master cleanup phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Resetting slave info on the new master..
Wed Nov 15 10:25:59 2017 - [info] host_1: Resetting slave info succeeded.
Wed Nov 15 10:25:59 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Wed Nov 15 10:25:59 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
Master host_2(host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
Wed Nov 15 10:25:59 2017 - [info] Sending mail..
### 切换后的结果
* new master 和 new etl 数据一致
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 164 | 1 |
| 165 | 2 |
| 166 | 3 |
+-----+------+
3 rows in set (0.00 sec)
1.3 slave(候选master)的日志是最新的,比etl要多
- 1.3.1 当master的部分日志还没传递两个slave,这时候master 上的MySQL挂了
### 模拟现场,现场的3台DB binlog状态
* master host_2
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000004 | 4577 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 114 | 1 |
| 115 | 2 |
| 116 | 3 |
| 117 | 4 |
| 118 | 5 |
| 119 | 6 |
| 120 | 7 |
| 121 | 8 |
| 122 | 10 |
| 123 | 11 |
| 124 | 12 |
| 125 | 13 |
| 126 | 14 |
| 127 | 15 |
+-----+------+
14 rows in set (0.00 sec)
* slave (candidate master) host_1
Master_Log_File: host_2.000004
Exec_Master_Log_Pos: 3683
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 114 | 1 |
| 115 | 2 |
| 116 | 3 |
| 117 | 4 |
| 118 | 5 |
| 119 | 6 |
| 120 | 7 |
| 121 | 8 |
| 122 | 10 |
| 123 | 11 |
| 124 | 12 |
+-----+------+
11 rows in set (0.00 sec)
* etl (other slave) host_3
Master_Log_File: host_2.000004
Exec_Master_Log_Pos: 2789
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 114 | 1 |
| 115 | 2 |
| 116 | 3 |
| 117 | 4 |
| 118 | 5 |
| 119 | 6 |
| 120 | 7 |
| 121 | 8 |
+-----+------+
8 rows in set (0.00 sec)
### 切换日志
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Tue Nov 14 17:13:08 2017 - [info] MHA::MasterFailover version 0.56.
Tue Nov 14 17:13:08 2017 - [info] Starting master failover.
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] * Phase 1: Configuration Check Phase..
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Tue Nov 14 17:13:08 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Tue Nov 14 17:13:08 2017 - [info] GTID failover mode = 0
Tue Nov 14 17:13:08 2017 - [info] Dead Servers:
Tue Nov 14 17:13:08 2017 - [info] host_2(host_2:3306)
Tue Nov 14 17:13:08 2017 - [info] Checking master reachability via MySQL(double check)...
Tue Nov 14 17:13:08 2017 - [info] ok.
Tue Nov 14 17:13:08 2017 - [info] Alive Servers:
Tue Nov 14 17:13:08 2017 - [info] host_1(host_1:3306)
Tue Nov 14 17:13:08 2017 - [info] host_3(host_3:3306)
Tue Nov 14 17:13:08 2017 - [info] Alive Slaves:
Tue Nov 14 17:13:08 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:08 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:08 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 14 17:13:08 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:08 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:08 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 14 17:13:08 2017 - [info] Starting SQL thread on host_1(host_1:3306) ..
Tue Nov 14 17:13:08 2017 - [info] done.
Tue Nov 14 17:13:08 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Tue Nov 14 17:13:08 2017 - [info] done.
Tue Nov 14 17:13:08 2017 - [info] Starting Non-GTID based failover.
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Tue Nov 14 17:13:09 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Nov 14 17:13:09 2017 - [info] Executing master IP deactivation script:
Tue Nov 14 17:13:09 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-14 17:13:09-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 7.19M=0s
2017-11-14 17:13:11 (7.19 MB/s) - 已写入标准输出 [38]
Tue Nov 14 17:13:11 2017 - [info] done.
Tue Nov 14 17:13:11 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Nov 14 17:13:11 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] * Phase 3: Master Recovery Phase..
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] The latest binary log file/position on all slaves is host_2.000004:3683
Tue Nov 14 17:13:11 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Nov 14 17:13:11 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:11 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:11 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 14 17:13:11 2017 - [info] The oldest binary log file/position on all slaves is host_2.000004:2789
Tue Nov 14 17:13:11 2017 - [info] Oldest slaves:
Tue Nov 14 17:13:11 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:11 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:11 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] Fetching dead master's binary logs..
Tue Nov 14 17:13:11 2017 - [info] Executing command on the dead master host_2(host_2:3306): save_binary_logs --command=save --start_file=host_2.000004 --start_pos=3683 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_2.000004 pos 3683 to host_2.000004 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog ..
Dumping binlog format description event, from position 0 to 150.. ok.
Dumping effective binlog data from /data/mysql.bin/host_2.000004 position 3683 to tail(4596).. ok.
Concat succeeded.
Tue Nov 14 17:13:12 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog succeeded.
Tue Nov 14 17:13:12 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Tue Nov 14 17:13:12 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Tue Nov 14 17:13:12 2017 - [info]
Tue Nov 14 17:13:12 2017 - [info] * Phase 3.3: Determining New Master Phase..
Tue Nov 14 17:13:12 2017 - [info]
Tue Nov 14 17:13:12 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Tue Nov 14 17:13:12 2017 - [info] Checking whether host_1 has relay logs from the oldest position..
Tue Nov 14 17:13:12 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000004 --latest_rmlp=3683 --target_mlf=host_2.000004 --target_rmlp=2789 --server_id=1261261646 --workdir=/var/log/masterha/mha_test --timestamp=20171114171308 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_1-relay-bin.000003 :
Relay log found at /data/mysql_data, up to host_1-relay-bin.000003
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_1-relay-bin.000003, start_pos:269.
Target relay log FOUND!
Tue Nov 14 17:13:13 2017 - [info] OK. host_1 has all relay logs.
Tue Nov 14 17:13:13 2017 - [info] Searching new master from slaves..
Tue Nov 14 17:13:13 2017 - [info] Candidate masters from the configuration file:
Tue Nov 14 17:13:13 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:13 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:13 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 14 17:13:13 2017 - [info] Non-candidate masters:
Tue Nov 14 17:13:13 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:13 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:13 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 14 17:13:13 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Tue Nov 14 17:13:13 2017 - [info] New master is host_1(host_1:3306)
Tue Nov 14 17:13:13 2017 - [info] Starting master failover..
Tue Nov 14 17:13:13 2017 - [info]
From:
host_2(host_2:3306) (current master)
+--host_1(host_1:3306)
+--host_3(host_3:3306)
To:
host_1(host_1:3306) (new master)
+--host_3(host_3:3306)
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Tue Nov 14 17:13:13 2017 - [info] Sending binlog..
Tue Nov 14 17:13:13 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog to root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog succeeded.
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Tue Nov 14 17:13:13 2017 - [info] Starting recovery on host_1(host_1:3306)..
Tue Nov 14 17:13:13 2017 - [info] Generating diffs succeeded.
Tue Nov 14 17:13:13 2017 - [info] Waiting until all relay logs are applied.
Tue Nov 14 17:13:13 2017 - [info] done.
Tue Nov 14 17:13:13 2017 - [info] Getting slave status..
Tue Nov 14 17:13:13 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000004:3683). No need to recover from Exec_Master_Log_Pos.
Tue Nov 14 17:13:13 2017 - [info] Connecting to the target slave host host_1, running recover script..
Tue Nov 14 17:13:13 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171114171308 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Tue Nov 14 17:13:13 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Tue Nov 14 17:13:13 2017 - [info] All relay logs were successfully applied.
Tue Nov 14 17:13:13 2017 - [info] Getting new master's binlog name and position..
Tue Nov 14 17:13:13 2017 - [info] host_1.000003:2347
Tue Nov 14 17:13:13 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000003', MASTER_LOG_POS=2347, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Nov 14 17:13:13 2017 - [info] Executing master IP activate script:
Tue Nov 14 17:13:13 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Tue Nov 14 17:13:16 2017 - [info] OK.
Tue Nov 14 17:13:16 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Tue Nov 14 17:13:16 2017 - [info] ok.
Tue Nov 14 17:13:16 2017 - [info] ** Finished master recovery successfully.
Tue Nov 14 17:13:16 2017 - [info] * Phase 3: Master Recovery Phase completed.
Tue Nov 14 17:13:16 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] * Phase 4: Slaves Recovery Phase..
Tue Nov 14 17:13:16 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Tue Nov 14 17:13:16 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 29885. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171114171308.log if it takes time..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Log messages from host_3 ...
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] Server host_3 received relay logs up to: host_2.000004:2789
Tue Nov 14 17:13:16 2017 - [info] Need to get diffs from the latest slave(host_1) up to: host_2.000004:3683 (using the latest slave's relay logs)
Tue Nov 14 17:13:16 2017 - [info] Connecting to the latest slave host host_1, generating diff relay log files..
Tue Nov 14 17:13:16 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_3 --latest_mlf=host_2.000004 --latest_rmlp=3683 --target_mlf=host_2.000004 --target_rmlp=2789 --server_id=1261261646 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171114171308 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_1-relay-bin.000003
Tue Nov 14 17:13:17 2017 - [info]
Relay log found at /data/mysql_data, up to host_1-relay-bin.000003
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_1-relay-bin.000003, start_pos:269.
Concat binary/relay logs from host_1-relay-bin.000003 pos 269 to host_1-relay-bin.000003 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog ..
Dumping binlog format description event, from position 0 to 269.. ok.
Dumping effective binlog data from /data/mysql_data/host_1-relay-bin.000003 position 269 to tail(1163).. ok.
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog .
scp host_1.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog to root@host_3(22) succeeded.
Tue Nov 14 17:13:17 2017 - [info] Generating diff files succeeded.
Tue Nov 14 17:13:17 2017 - [info] End of log messages from host_3.
Tue Nov 14 17:13:17 2017 - [info] -- Slave diff log generation on host host_3(host_3:3306) succeeded.
Tue Nov 14 17:13:17 2017 - [info] Generating relay diff files from the latest slave succeeded.
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 31393. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171114171308.log if it takes time..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Log messages from host_3 ...
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Sending binlog..
Tue Nov 14 17:13:17 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog succeeded.
Tue Nov 14 17:13:17 2017 - [info] Starting recovery on host_3(host_3:3306)..
Tue Nov 14 17:13:17 2017 - [info] Generating diffs succeeded.
Tue Nov 14 17:13:17 2017 - [info] Waiting until all relay logs are applied.
Tue Nov 14 17:13:17 2017 - [info] done.
Tue Nov 14 17:13:17 2017 - [info] Getting slave status..
Tue Nov 14 17:13:17 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000004:2789). No need to recover from Exec_Master_Log_Pos.
Tue Nov 14 17:13:17 2017 - [info] Connecting to the target slave host host_3, running recover script..
Tue Nov 14 17:13:17 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171114171308 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Tue Nov 14 17:13:17 2017 - [info]
Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171114171308.binlog ..
Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171114171308.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog.. dumped up to pos 150. ok.
/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog has effective binlog events from pos 150.
Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog position 150 to tail(1063).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171114171308.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Tue Nov 14 17:13:17 2017 - [info] All relay logs were successfully applied.
Tue Nov 14 17:13:17 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Tue Nov 14 17:13:17 2017 - [info] Executed CHANGE MASTER.
Tue Nov 14 17:13:17 2017 - [info] Slave started.
Tue Nov 14 17:13:17 2017 - [info] End of log messages from host_3.
Tue Nov 14 17:13:17 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Tue Nov 14 17:13:17 2017 - [info] All new slave servers recovered successfully.
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] * Phase 5: New master cleanup phase..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Resetting slave info on the new master..
Tue Nov 14 17:13:17 2017 - [info] host_1: Resetting slave info succeeded.
Tue Nov 14 17:13:17 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Tue Nov 14 17:13:17 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
Master host_2(host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_1(host_1:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): Generating differential relay logs up to host_1(host_1:3306)succeeded.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
Tue Nov 14 17:13:17 2017 - [info] Sending mail..
## 切换结果
new master 和 new etl
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 114 | 1 |
| 115 | 2 |
| 116 | 3 |
| 117 | 4 |
| 118 | 5 |
| 119 | 6 |
| 120 | 7 |
| 121 | 8 |
| 122 | 10 |
| 123 | 11 |
| 124 | 12 |
| 125 | 13 |
| 126 | 14 |
| 127 | 15 |
+-----+------+
14 rows in set (0.00 sec)
- 1.3.2 当master的所有日志已经传递slave,这时候master 上的MySQL挂了
### 模拟现场,现场的3台DB binlog状态
* master host_1
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_1.000010 | 3341 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 167 | 1 |
| 168 | 2 |
| 169 | 3 |
+-----+------+
3 rows in set (0.00 sec)
* slave host_2
Master_Log_File: host_1.000010
Exec_Master_Log_Pos: 3341
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 167 | 1 |
| 168 | 2 |
| 169 | 3 |
+-----+------+
3 rows in set (0.00 sec)
* etl host_3
Master_Log_File: host_1.000010
Exec_Master_Log_Pos: 2381
### 切换日志
Wed Nov 15 10:39:36 2017 - [info] MHA::MasterFailover version 0.56.
Wed Nov 15 10:39:36 2017 - [info] Starting master failover.
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] * Phase 1: Configuration Check Phase..
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Wed Nov 15 10:39:36 2017 - [info] GTID failover mode = 0
Wed Nov 15 10:39:36 2017 - [info] Dead Servers:
Wed Nov 15 10:39:36 2017 - [info] host_1(host_1:3306)
Wed Nov 15 10:39:36 2017 - [info] Checking master reachability via MySQL(double check)...
Wed Nov 15 10:39:36 2017 - [info] ok.
Wed Nov 15 10:39:36 2017 - [info] Alive Servers:
Wed Nov 15 10:39:36 2017 - [info] host_2(host_2:3306)
Wed Nov 15 10:39:36 2017 - [info] host_3(host_3:3306)
Wed Nov 15 10:39:36 2017 - [info] Alive Slaves:
Wed Nov 15 10:39:36 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:36 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:36 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:39:36 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:36 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:36 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:39:36 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Wed Nov 15 10:39:36 2017 - [info] done.
Wed Nov 15 10:39:36 2017 - [info] Starting Non-GTID based failover.
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Wed Nov 15 10:39:37 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Wed Nov 15 10:39:37 2017 - [info] Executing master IP deactivation script:
Wed Nov 15 10:39:37 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
=================== swift vip : tgw_vip from host_1 is deleted ==============================
--2017-11-15 10:39:37-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 11.4M=0s
2017-11-15 10:39:39 (11.4 MB/s) - 已写入标准输出 [38]
Wed Nov 15 10:39:39 2017 - [info] done.
Wed Nov 15 10:39:39 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Wed Nov 15 10:39:39 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] * Phase 3: Master Recovery Phase..
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] The latest binary log file/position on all slaves is host_1.000010:3341
Wed Nov 15 10:39:39 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Wed Nov 15 10:39:39 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:39 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:39 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:39:39 2017 - [info] The oldest binary log file/position on all slaves is host_1.000010:2381
Wed Nov 15 10:39:39 2017 - [info] Oldest slaves:
Wed Nov 15 10:39:39 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:39 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:39 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] Fetching dead master's binary logs..
Wed Nov 15 10:39:39 2017 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1.000010 --start_pos=3341 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_1.000010 pos 3341 to host_1.000010 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 154.. ok.
Dumping effective binlog data from /data/mysql.bin/host_1.000010 position 3341 to tail(3364).. ok.
Binlog Checksum enabled
Concat succeeded.
Wed Nov 15 10:39:39 2017 - [info] scp from root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog succeeded.
Wed Nov 15 10:39:40 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Wed Nov 15 10:39:40 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] * Phase 3.3: Determining New Master Phase..
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Wed Nov 15 10:39:40 2017 - [info] Checking whether host_2 has relay logs from the oldest position..
Wed Nov 15 10:39:40 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_1.000010 --latest_rmlp=3341 --target_mlf=host_1.000010 --target_rmlp=2381 --server_id=1261261656 --workdir=/var/log/masterha/mha_test --timestamp=20171115103936 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002 :
Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:636.
Target relay log FOUND!
Wed Nov 15 10:39:40 2017 - [info] OK. host_2 has all relay logs.
Wed Nov 15 10:39:40 2017 - [info] Searching new master from slaves..
Wed Nov 15 10:39:40 2017 - [info] Candidate masters from the configuration file:
Wed Nov 15 10:39:40 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:40 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:40 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:39:40 2017 - [info] Non-candidate masters:
Wed Nov 15 10:39:40 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:40 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:40 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:39:40 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Wed Nov 15 10:39:40 2017 - [info] New master is host_2(host_2:3306)
Wed Nov 15 10:39:40 2017 - [info] Starting master failover..
Wed Nov 15 10:39:40 2017 - [info]
From:
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
To:
host_2(host_2:3306) (new master)
+--host_3(host_3:3306)
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Wed Nov 15 10:39:40 2017 - [info] Sending binlog..
Wed Nov 15 10:39:41 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog to root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog succeeded.
Wed Nov 15 10:39:41 2017 - [info]
Wed Nov 15 10:39:41 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Wed Nov 15 10:39:41 2017 - [info]
Wed Nov 15 10:39:41 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Wed Nov 15 10:39:41 2017 - [info] Starting recovery on host_2(host_2:3306)..
Wed Nov 15 10:39:41 2017 - [info] Generating diffs succeeded.
Wed Nov 15 10:39:41 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:39:41 2017 - [info] done.
Wed Nov 15 10:39:41 2017 - [info] Getting slave status..
Wed Nov 15 10:39:41 2017 - [info] This slave(host_2)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1.000010:3341). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:39:41 2017 - [info] Connecting to the target slave host host_2, running recover script..
Wed Nov 15 10:39:41 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_2 --slave_ip=host_2 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115103936 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:39:41 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog on host_2:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:39:41 2017 - [info] All relay logs were successfully applied.
Wed Nov 15 10:39:41 2017 - [info] Getting new master's binlog name and position..
Wed Nov 15 10:39:41 2017 - [info] host_2.000011:1307
Wed Nov 15 10:39:41 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2.000011', MASTER_LOG_POS=1307, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Wed Nov 15 10:39:41 2017 - [info] Executing master IP activate script:
Wed Nov 15 10:39:41 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_2 is added ==============================
Wed Nov 15 10:39:44 2017 - [info] OK.
Wed Nov 15 10:39:44 2017 - [info] Setting read_only=0 on host_2(host_2:3306)..
Wed Nov 15 10:39:44 2017 - [info] ok.
Wed Nov 15 10:39:44 2017 - [info] ** Finished master recovery successfully.
Wed Nov 15 10:39:44 2017 - [info] * Phase 3: Master Recovery Phase completed.
Wed Nov 15 10:39:44 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] * Phase 4: Slaves Recovery Phase..
Wed Nov 15 10:39:44 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Wed Nov 15 10:39:44 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 11760. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115103936.log if it takes time..
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] Server host_3 received relay logs up to: host_1.000010:2381
Wed Nov 15 10:39:44 2017 - [info] Need to get diffs from the latest slave(host_2) up to: host_1.000010:3341 (using the latest slave's relay logs)
Wed Nov 15 10:39:44 2017 - [info] Connecting to the latest slave host host_2, generating diff relay log files..
Wed Nov 15 10:39:44 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_3 --latest_mlf=host_1.000010 --latest_rmlp=3341 --target_mlf=host_1.000010 --target_rmlp=2381 --server_id=1261261656 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171115103936 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002
Wed Nov 15 10:39:45 2017 - [info]
Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:636.
Concat binary/relay logs from host_2-relay-bin.000002 pos 636 to host_2-relay-bin.000002 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 315.. ok.
Dumping effective binlog data from /data/mysql_data/host_2-relay-bin.000002 position 636 to tail(1596).. ok.
Binlog Checksum enabled
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog .
scp host_2.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog to root@host_3(22) succeeded.
Wed Nov 15 10:39:45 2017 - [info] Generating diff files succeeded.
Wed Nov 15 10:39:45 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:39:45 2017 - [info] -- Slave diff log generation on host host_3(host_3:3306) succeeded.
Wed Nov 15 10:39:45 2017 - [info] Generating relay diff files from the latest slave succeeded.
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 12881. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115103936.log if it takes time..
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:40:45 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] Sending binlog..
Wed Nov 15 10:39:45 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog succeeded.
Wed Nov 15 10:39:45 2017 - [info] Starting recovery on host_3(host_3:3306)..
Wed Nov 15 10:39:45 2017 - [info] Generating diffs succeeded.
Wed Nov 15 10:39:45 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:39:45 2017 - [info] done.
Wed Nov 15 10:39:45 2017 - [info] Getting slave status..
Wed Nov 15 10:39:45 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1.000010:2381). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:39:45 2017 - [info] Connecting to the target slave host host_3, running recover script..
Wed Nov 15 10:39:45 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115103936 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:40:45 2017 - [info]
Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171115103936.binlog ..
Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171115103936.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog.. Binlog Checksum enabled
dumped up to pos 154. ok.
/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog has effective binlog events from pos 154.
Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog position 154 to tail(177).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171115103936.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:40:45 2017 - [info] All relay logs were successfully applied.
Wed Nov 15 10:40:45 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Wed Nov 15 10:40:45 2017 - [info] Executed CHANGE MASTER.
Wed Nov 15 10:40:45 2017 - [info] Slave started.
Wed Nov 15 10:40:45 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:40:45 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Wed Nov 15 10:40:45 2017 - [info] All new slave servers recovered successfully.
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:40:45 2017 - [info] * Phase 5: New master cleanup phase..
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:40:45 2017 - [info] Resetting slave info on the new master..
Wed Nov 15 10:40:45 2017 - [info] host_2: Resetting slave info succeeded.
Wed Nov 15 10:40:45 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
Wed Nov 15 10:40:45 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): Generating differential relay logs up to host_2(host_2:3306)succeeded.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.
Wed Nov 15 10:40:45 2017 - [info] Sending mail..
### 切换结果
new master & new etl 数据一致
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 167 | 1 |
| 168 | 2 |
| 169 | 3 |
1.4 slave(候选master)上面有大事务在跑
- 1000s的大查询
无影响,正常切换
1.5 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
- 90%的场景都是可以重新执行的
- 5%的场景不能执行
* NO。1
问题: dead master上面的vip已经delete,再次failover 这一步会报错
方案: a)将dead master上的vip再重新add上去 或者 b)直接过滤掉 delete dead master vip这一步
- 5%的场景不能再次执行,执行会报错
一般这种场景就是:已经failover到最后的change master阶段,这样主从结构已经变更( candidate slave <==rep== etl ),MHA无法重新走一遍。
好在,change master的语句已经生成,其余slave再次执行一遍日志里面的change master即可
Thu Nov 9 16:49:39 2017 - [info] GTID failover mode = 0
Thu Nov 9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln169] Detected dead master host_1(host_1:3306) does not match with specified dead master host_2(host_2:3306)!
Thu Nov 9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53
1.7 Master:MySQL down小结
1. failover最终命令
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=$dead_master_ip --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
二、Master : Server down
2.1 etl 延迟8小时
同1.1 结论
2.2 slave(候选master)比etl还要落后更多
- 2.2.1 当master的部分日志还没传递两个slave,这时候master server挂了
### 3台DB的binlog状态
* master
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000012 | 1651 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc>
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 172 | 1 |
| 173 | 2 |
| 174 | 3 |
| 175 | 4 |
+-----+------+
4 rows in set (0.00 sec)
* slave
Master_Log_File: host_2.000012
Exec_Master_Log_Pos: 467
dba:lc> select * from t_char_2;
Empty set (0.00 sec)
* etl
Master_Log_File: host_2.000012
Exec_Master_Log_Pos: 1059
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 172 | 1 |
| 173 | 2 |
+-----+------+
2 rows in set (0.00 sec)
### 模拟故障场景
* 隔离master的网络,让其等同于down机
master> iptables -A INPUT -p tcp -s other_ip --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
### 切换日志
Thu Nov 16 16:54:40 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 16 16:54:40 2017 - [info] Starting master failover.
Thu Nov 16 16:54:40 2017 - [info]
Thu Nov 16 16:54:40 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 16 16:54:40 2017 - [info]
Thu Nov 16 16:54:40 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Thu Nov 16 16:54:40 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Thu Nov 16 16:54:40 2017 - [info] GTID failover mode = 0
Thu Nov 16 16:54:40 2017 - [info] Dead Servers:
Thu Nov 16 16:54:40 2017 - [info] host_2(host_2:3306)
Thu Nov 16 16:54:40 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 16 16:54:41 2017 - [info] ok.
Thu Nov 16 16:54:41 2017 - [info] Alive Servers:
Thu Nov 16 16:54:41 2017 - [info] host_1(host_1:3306)
Thu Nov 16 16:54:41 2017 - [info] host_3(host_3:3306)
Thu Nov 16 16:54:41 2017 - [info] Alive Slaves:
Thu Nov 16 16:54:41 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:54:41 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:54:41 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 16:54:41 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:54:41 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:54:41 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 16:54:41 2017 - [info] Starting SQL thread on host_1(host_1:3306) ..
Thu Nov 16 16:54:41 2017 - [info] done.
Thu Nov 16 16:54:41 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Thu Nov 16 16:54:41 2017 - [info] done.
Thu Nov 16 16:54:41 2017 - [info] Starting Non-GTID based failover.
Thu Nov 16 16:54:41 2017 - [info]
Thu Nov 16 16:54:41 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 16 16:54:41 2017 - [info]
Thu Nov 16 16:54:41 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 16 16:54:41 2017 - [info]
Thu Nov 16 16:55:31 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Thu Nov 16 16:55:31 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 16 16:55:31 2017 - [info] Executing master IP deactivation script:
Thu Nov 16 16:55:31 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stop
ssh: connect to host host_2 port 22: Connection timed out
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-16 16:55:38-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 12.1M=0s
2017-11-16 16:57:36 (12.1 MB/s) - 已写入标准输出 [38]
Thu Nov 16 16:57:36 2017 - [info] done.
Thu Nov 16 16:57:36 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 16 16:57:36 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] The latest binary log file/position on all slaves is host_2.000012:1059
Thu Nov 16 16:57:36 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 16 16:57:36 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:36 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:36 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 16:57:36 2017 - [info] The oldest binary log file/position on all slaves is host_2.000012:467
Thu Nov 16 16:57:36 2017 - [info] Oldest slaves:
Thu Nov 16 16:57:36 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:36 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:36 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [warning] Dead Master is not SSH reachable. Could not save it's binlogs. Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost.
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Nov 16 16:57:37 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Thu Nov 16 16:57:37 2017 - [info] Checking whether host_3 has relay logs from the oldest position..
Thu Nov 16 16:57:37 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000012 --latest_rmlp=1059 --target_mlf=host_2.000012 --target_rmlp=467 --server_id=1261261666 --workdir=/var/log/masterha/mha_test --timestamp=20171116165440 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004 :
Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:678.
Target relay log FOUND!
Thu Nov 16 16:57:37 2017 - [info] OK. host_3 has all relay logs.
Thu Nov 16 16:57:37 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Nov 16 16:57:37 2017 - [info] Searching new master from slaves..
Thu Nov 16 16:57:37 2017 - [info] Candidate masters from the configuration file:
Thu Nov 16 16:57:37 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:37 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:37 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 16:57:37 2017 - [info] Non-candidate masters:
Thu Nov 16 16:57:37 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:37 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:37 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 16:57:37 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 16 16:57:37 2017 - [info] Not found.
Thu Nov 16 16:57:37 2017 - [info] Searching from all candidate_master slaves..
Thu Nov 16 16:57:37 2017 - [info] New master is host_1(host_1:3306)
Thu Nov 16 16:57:37 2017 - [info] Starting master failover..
Thu Nov 16 16:57:37 2017 - [info]
From:
host_2(host_2:3306) (current master)
+--host_1(host_1:3306)
+--host_3(host_3:3306)
To:
host_1(host_1:3306) (new master)
+--host_3(host_3:3306)
Thu Nov 16 16:57:37 2017 - [info]
Thu Nov 16 16:57:37 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Nov 16 16:57:37 2017 - [info]
Thu Nov 16 16:57:37 2017 - [info] Server host_1 received relay logs up to: host_2.000012:467
Thu Nov 16 16:57:37 2017 - [info] Need to get diffs from the latest slave(host_3) up to: host_2.000012:1059 (using the latest slave's relay logs)
Thu Nov 16 16:57:38 2017 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Thu Nov 16 16:57:38 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_1 --latest_mlf=host_2.000012 --latest_rmlp=1059 --target_mlf=host_2.000012 --target_rmlp=467 --server_id=1261261666 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171116165440 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004
Thu Nov 16 16:57:38 2017 - [info]
Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:678.
Concat binary/relay logs from host_3-relay-bin.000004 pos 678 to host_3-relay-bin.000004 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog ..
Dumping binlog format description event, from position 0 to 361.. ok.
Dumping effective binlog data from /data/mysql_data/host_3-relay-bin.000004 position 678 to tail(1270).. ok.
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog .
scp host_3.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog to root@host_1(22) succeeded.
Thu Nov 16 16:57:38 2017 - [info] Generating diff files succeeded.
Thu Nov 16 16:57:38 2017 - [info]
Thu Nov 16 16:57:38 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Nov 16 16:57:38 2017 - [info]
Thu Nov 16 16:57:38 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Nov 16 16:57:38 2017 - [info] Starting recovery on host_1(host_1:3306)..
Thu Nov 16 16:57:38 2017 - [info] Generating diffs succeeded.
Thu Nov 16 16:57:38 2017 - [info] Waiting until all relay logs are applied.
Thu Nov 16 16:57:38 2017 - [info] done.
Thu Nov 16 16:57:38 2017 - [info] Getting slave status..
Thu Nov 16 16:57:38 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000012:467). No need to recover from Exec_Master_Log_Pos.
Thu Nov 16 16:57:38 2017 - [info] Connecting to the target slave host host_1, running recover script..
Thu Nov 16 16:57:38 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171116165440 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Thu Nov 16 16:57:39 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Thu Nov 16 16:57:39 2017 - [info] All relay logs were successfully applied.
Thu Nov 16 16:57:39 2017 - [info] Getting new master's binlog name and position..
Thu Nov 16 16:57:39 2017 - [info] host_1.000012:1310
Thu Nov 16 16:57:39 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000012', MASTER_LOG_POS=1310, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 16 16:57:39 2017 - [info] Executing master IP activate script:
Thu Nov 16 16:57:39 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Thu Nov 16 16:57:41 2017 - [info] OK.
Thu Nov 16 16:57:41 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Thu Nov 16 16:57:41 2017 - [info] ok.
Thu Nov 16 16:57:41 2017 - [info] ** Finished master recovery successfully.
Thu Nov 16 16:57:41 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 123011. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116165440.log if it takes time..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Log messages from host_3 ...
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Nov 16 16:57:41 2017 - [info] End of log messages from host_3.
Thu Nov 16 16:57:41 2017 - [info] -- host_3(host_3:3306) has the latest relay log events.
Thu Nov 16 16:57:41 2017 - [info] Generating relay diff files from the latest slave succeeded.
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 123044. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116165440.log if it takes time..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Log messages from host_3 ...
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Starting recovery on host_3(host_3:3306)..
Thu Nov 16 16:57:41 2017 - [info] This server has all relay logs. Waiting all logs to be applied..
Thu Nov 16 16:57:41 2017 - [info] done.
Thu Nov 16 16:57:41 2017 - [info] All relay logs were successfully applied.
Thu Nov 16 16:57:41 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Thu Nov 16 16:57:41 2017 - [info] Executed CHANGE MASTER.
Thu Nov 16 16:57:41 2017 - [info] Slave started.
Thu Nov 16 16:57:41 2017 - [info] End of log messages from host_3.
Thu Nov 16 16:57:41 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Thu Nov 16 16:57:41 2017 - [info] All new slave servers recovered successfully.
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Resetting slave info on the new master..
Thu Nov 16 16:57:41 2017 - [info] host_1: Resetting slave info succeeded.
Thu Nov 16 16:57:41 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Thu Nov 16 16:57:41 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
Master host_2(host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
### 切换后的结果
new_master & new_etl 结果一致,由于master上面未传过来的binlog彻底丢失,所以相应的新集群也缺失这些数据。
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 172 | 1 |
| 173 | 2 |
+-----+------+
2 rows in set (0.00 sec)
### 最后一步很重要
如果dead master之后又活过来了,那么这一步要做
dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==》 tgw章节详细描述
- 2.2.2 当master的所有日志已经传递到1个etl,这时候master server挂了
测试省略,和2.2.1基本一样
结论:由于master上的所有日志全部传递到etl,所以最后是不会丢失master上任何数据的。
2.3 slave(候选master)的日志是最新的,比etl要多
- 2.3.1 当master的部分日志还没传递两个slave,这时候master server挂了
### 3台DB的binlog状态
* master
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_1.000012 | 3860 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 176 | 1 |
| 177 | 2 |
| 178 | 10 |
| 179 | 20 |
+-----+------+
4 rows in set (0.00 sec)
* slave
Master_Log_File: host_1.000012
Exec_Master_Log_Pos: 3216
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 176 | 1 |
| 177 | 2 |
+-----+------+
2 rows in set (0.00 sec)
* etl
Master_Log_File: host_1.000012
Exec_Master_Log_Pos: 2576
dba:lc> select * from t_char_2;
Empty set (0.00 sec)
### 模拟故障场景
* 隔离master的网络,让其等同于down机
master> iptables -A INPUT -p tcp -s other_ip --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
### 切换日志
Thu Nov 16 17:17:59 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 16 17:17:59 2017 - [info] Starting master failover.
Thu Nov 16 17:17:59 2017 - [info]
Thu Nov 16 17:17:59 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 16 17:17:59 2017 - [info]
Thu Nov 16 17:17:59 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)
Thu Nov 16 17:17:59 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Thu Nov 16 17:17:59 2017 - [info] GTID failover mode = 0
Thu Nov 16 17:17:59 2017 - [info] Dead Servers:
Thu Nov 16 17:17:59 2017 - [info] host_1(host_1:3306)
Thu Nov 16 17:17:59 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 16 17:18:00 2017 - [info] ok.
Thu Nov 16 17:18:00 2017 - [info] Alive Servers:
Thu Nov 16 17:18:00 2017 - [info] host_2(host_2:3306)
Thu Nov 16 17:18:00 2017 - [info] host_3(host_3:3306)
Thu Nov 16 17:18:00 2017 - [info] Alive Slaves:
Thu Nov 16 17:18:00 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:18:00 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:18:00 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 17:18:00 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:18:00 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:18:00 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 17:18:00 2017 - [info] Starting SQL thread on host_2(host_2:3306) ..
Thu Nov 16 17:18:00 2017 - [info] done.
Thu Nov 16 17:18:00 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Thu Nov 16 17:18:00 2017 - [info] done.
Thu Nov 16 17:18:00 2017 - [info] Starting Non-GTID based failover.
Thu Nov 16 17:18:00 2017 - [info]
Thu Nov 16 17:18:00 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 16 17:18:00 2017 - [info]
Thu Nov 16 17:18:00 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 16 17:18:00 2017 - [info]
Thu Nov 16 17:18:50 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Thu Nov 16 17:18:50 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 16 17:18:50 2017 - [info] Executing master IP deactivation script:
Thu Nov 16 17:18:50 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stop
ssh: connect to host host_1 port 22: Connection timed out
=================== swift vip : tgw_vip from host_1 is deleted ==============================
--2017-11-16 17:18:57-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 8.61M=0s
2017-11-16 17:20:55 (8.61 MB/s) - 已写入标准输出 [38]
Thu Nov 16 17:20:55 2017 - [info] done.
Thu Nov 16 17:20:55 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 16 17:20:55 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] The latest binary log file/position on all slaves is host_1.000012:3216
Thu Nov 16 17:20:55 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 16 17:20:55 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:55 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:55 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 17:20:55 2017 - [info] The oldest binary log file/position on all slaves is host_1.000012:2576
Thu Nov 16 17:20:55 2017 - [info] Oldest slaves:
Thu Nov 16 17:20:55 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:55 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:55 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [warning] Dead Master is not SSH reachable. Could not save it's binlogs. Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost.
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Nov 16 17:20:55 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Thu Nov 16 17:20:55 2017 - [info] Checking whether host_2 has relay logs from the oldest position..
Thu Nov 16 17:20:55 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_1.000012 --latest_rmlp=3216 --target_mlf=host_1.000012 --target_rmlp=2576 --server_id=1261261656 --workdir=/var/log/masterha/mha_test --timestamp=20171116171759 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002 :
Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:1581.
Target relay log FOUND!
Thu Nov 16 17:20:56 2017 - [info] OK. host_2 has all relay logs.
Thu Nov 16 17:20:56 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Thu Nov 16 17:20:56 2017 - [info] Searching new master from slaves..
Thu Nov 16 17:20:56 2017 - [info] Candidate masters from the configuration file:
Thu Nov 16 17:20:56 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:56 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:56 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 17:20:56 2017 - [info] Non-candidate masters:
Thu Nov 16 17:20:56 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:56 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:56 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 17:20:56 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 16 17:20:56 2017 - [info] New master is host_2(host_2:3306)
Thu Nov 16 17:20:56 2017 - [info] Starting master failover..
Thu Nov 16 17:20:56 2017 - [info]
From:
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
To:
host_2(host_2:3306) (new master)
+--host_3(host_3:3306)
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Nov 16 17:20:56 2017 - [info] Starting recovery on host_2(host_2:3306)..
Thu Nov 16 17:20:56 2017 - [info] This server has all relay logs. Waiting all logs to be applied..
Thu Nov 16 17:20:56 2017 - [info] done.
Thu Nov 16 17:20:56 2017 - [info] All relay logs were successfully applied.
Thu Nov 16 17:20:56 2017 - [info] Getting new master's binlog name and position..
Thu Nov 16 17:20:56 2017 - [info] host_2.000012:3959
Thu Nov 16 17:20:56 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2.000012', MASTER_LOG_POS=3959, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 16 17:20:56 2017 - [info] Executing master IP activate script:
Thu Nov 16 17:20:56 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_2 is added ==============================
Thu Nov 16 17:20:59 2017 - [info] OK.
Thu Nov 16 17:20:59 2017 - [info] ** Finished master recovery successfully.
Thu Nov 16 17:20:59 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 16 17:20:59 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 16 17:20:59 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Nov 16 17:20:59 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 77007. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116171759.log if it takes time..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Log messages from host_3 ...
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] Server host_3 received relay logs up to: host_1.000012:2576
Thu Nov 16 17:20:59 2017 - [info] Need to get diffs from the latest slave(host_2) up to: host_1.000012:3216 (using the latest slave's relay logs)
Thu Nov 16 17:20:59 2017 - [info] Connecting to the latest slave host host_2, generating diff relay log files..
Thu Nov 16 17:20:59 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_3 --latest_mlf=host_1.000012 --latest_rmlp=3216 --target_mlf=host_1.000012 --target_rmlp=2576 --server_id=1261261656 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171116171759 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002
Thu Nov 16 17:21:00 2017 - [info]
Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:1581.
Concat binary/relay logs from host_2-relay-bin.000002 pos 1581 to host_2-relay-bin.000002 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 315.. ok.
Dumping effective binlog data from /data/mysql_data/host_2-relay-bin.000002 position 1581 to tail(2221).. ok.
Binlog Checksum enabled
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog .
scp host_2.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog to root@host_3(22) succeeded.
Thu Nov 16 17:21:00 2017 - [info] Generating diff files succeeded.
Thu Nov 16 17:21:00 2017 - [info] End of log messages from host_3.
Thu Nov 16 17:21:00 2017 - [info] -- Slave diff log generation on host host_3(host_3:3306) succeeded.
Thu Nov 16 17:21:00 2017 - [info] Generating relay diff files from the latest slave succeeded.
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 78627. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116171759.log if it takes time..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Log messages from host_3 ...
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Starting recovery on host_3(host_3:3306)..
Thu Nov 16 17:21:00 2017 - [info] Generating diffs succeeded.
Thu Nov 16 17:21:00 2017 - [info] Waiting until all relay logs are applied.
Thu Nov 16 17:21:00 2017 - [info] done.
Thu Nov 16 17:21:00 2017 - [info] Getting slave status..
Thu Nov 16 17:21:00 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1.000012:2576). No need to recover from Exec_Master_Log_Pos.
Thu Nov 16 17:21:00 2017 - [info] Connecting to the target slave host host_3, running recover script..
Thu Nov 16 17:21:00 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171116171759 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Thu Nov 16 17:21:00 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Thu Nov 16 17:21:00 2017 - [info] All relay logs were successfully applied.
Thu Nov 16 17:21:00 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Thu Nov 16 17:21:00 2017 - [info] Executed CHANGE MASTER.
Thu Nov 16 17:21:00 2017 - [info] Slave started.
Thu Nov 16 17:21:00 2017 - [info] End of log messages from host_3.
Thu Nov 16 17:21:00 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Thu Nov 16 17:21:00 2017 - [info] All new slave servers recovered successfully.
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Resetting slave info on the new master..
Thu Nov 16 17:21:00 2017 - [info] host_2: Resetting slave info succeeded.
Thu Nov 16 17:21:00 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
Thu Nov 16 17:21:00 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): Generating differential relay logs up to host_2(host_2:3306)succeeded.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.
### 切换后的结果
new_master & new_etl 结果一致,由于master上面未传过来的binlog彻底丢失,所以相应的新集群也缺失这些数据。
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 176 | 1 |
| 177 | 2 |
+-----+------+
2 rows in set (0.00 sec)
### 最后一步很重要
如果dead master之后又活过来了,那么这一步要做
dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==》 tgw章节详细描述
结论: 由于master 已挂,然而最后的日志没有传递到其他服务器,所以会丢失master没有传递过来的事务日志
- 2.3.2 当master的所有日志已经传递slave,这时候master server挂了
测试省略,和2.3.1基本一样
结论:由于master上的所有日志全部传递到slave,所以最后是不会丢失master上任何数据的。
2.4 slave(候选master)上面有大事务在跑
- 1000s的大查询
同1.4结论
- flush tables with readlock
同1.4结论
2.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
同1.6结论
三、遇到的坑
3.1 交互模式下,如果没有及时敲'YES',则终止切换
3.2 如果在non-gtid模式的机器上,配置了binlog server,会有什么影响呢?
无影响
3.3 不要在relay-log的地方伪造日志
xx-relay-bin.000001
xx-relay-bin.000002
xx-relay-bin.000002.lc --这个是伪造的,当时这个日志是自己解析的 mysqlbinlog -vv xx-relay-bin.000002 > xx-relay-bin.000002.lc
MHA在切换的时候报错如下:
Reading xx-relay-bin.000002.lc
Event too large: pos: 4, event_length: 1163083840, event_type: 32
at /usr/share/perl5/vendor_perl/MHA/BinlogPosFindManager.pm line 103
Tue Nov 14 10:39:08 2017 - [warning] xx doesn't have all relay logs. Maybe some logs were purged.
Tue Nov 14 10:39:08 2017 - [warning] None of latest servers have enough relay logs from oldest position. We can't recover oldest slaves.
Tue Nov 14 10:39:08 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln947] None of the latest slaves has enough relay logs for recovery.
Tue Nov 14 10:39:08 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53
3.4 flush tables with read lock 会阻塞
dba:(none)> show processlist;
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| 63 | dba | localhost | NULL | Query | 0 | starting | show processlist |
| 65 | dba | xx:11164 | NULL | Sleep | 121 | | NULL |
| 83 | dba | new master:49022 | NULL | Query | 176 | Waiting for global read lock | BINLOG '
GpAKWhNYUy1LMAAAAGYHAAAAAG0AAAAAAAEAAmxjAAh0X2NoYXJfMgACAw8CLAEC
GpAKWh5YUy1LJwAAAI0HAAAAAG |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
会卡在这一步
Tue Nov 14 15:16:23 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
--------------Waiting for global read lock-----------------
那么后面的步骤都会卡住
* Phase 4.2: Starting Parallel Slave Log Apply Phase..
--将 当前需要恢复的slave 和 latest slave之间的diff日志 传送给 需要恢复的 slave
--将 latest slave 和 dead master之间的diff日志 传送给 需要恢复的 slave
--将 刚刚的差异日志依次apply 到需要恢复的slave
--Resetting slave , Executed CHANGE MASTER to new_master
* Phase 5: New master cleanup phase..
Resetting slave info on the new master.
3.5 binlog伪造
Tue Nov 14 17:23:10 2017 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1.000003 --start_pos=4042 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171114172304.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_1.000003 pos 4042 to host_1.000050 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171114172304.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 154.. ok.
Dumping effective binlog data from /data/mysql.bin/host_1.000003 position 4042 to tail(4065).. ok.
Failed to save binary log: Target file /data/mysql.bin/host_1.000004 not found!
at /usr/bin/save_binary_logs line 176
Tue Nov 14 17:23:10 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln760] Failed to save binary log events from the orig master. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
测试:
有flush + 无binlog.xx + 1伪造 = save binlog --成功
有flush + 有binlog.xx + 1伪造 = save binlog --失败
无flush + 无binlog.xx + 1伪造 = save binlog --成功
无flush + 有binlog.xx + 1伪造 = save binlog --失败
说明,如果有伪造的binlog,且后缀比之前的大,那么就会告警
解析:
/usr/bin/save_binary_logs ==》
generate_diff_binary_log =>
/usr/share/perl5/vendor_perl/MHA/BinlogManager.pm ==》
concat_all_binlogs_from():: =>
start_num , end_num(如果出现伪造的binlog,则会报错)
四、总结
MHA + NON-GTID 模式,重点配置和用法如下:
1. command
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=$dead_master_ip --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
2. tgw 清理
dead master 如果还可以起来,那么必须在上面执行: /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
原因可参看:http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==> TGW 章节
五、 流程简介
* Phase 1: Configuration Check Phase..
HealthCheck: SSH N台DB是否reachable
Binlog server: N台DB 是否reachable
GTID failover mode = ?
Dead Servers is ?
Primary candidate for the new Master (candidate_master is set) ?
* Phase 2: Dead Master Shutdown Phase..
Executing master IP deactivation script:
TGW-vip delete操作
shutdown_script: ?
* Phase 3: Master Recovery Phase..
* Phase 3.1: Getting Latest Slaves Phase..
Latest slaves ,file position ?
Oldest slaves , file position ?
* Phase 3.2: Saving Dead Master's Binlog Phase..
Executing command on the dead master : save_binary_logs --command=save
Concat binary/relay logs from latest_slave file_pos to master's binlog EOF
scp from root@master_ip: binlog to local: xx.binlog --将master最新的,latest缺失的binlog传递到manager机器
* Phase 3.3: Determining New Master Phase..
选择哪个slave为new master
* Phase 3.3: New Master Diff Log Generation Phase..
Need to get diffs from the latest slave up to: xx (using the latest slave's relay logs) --生成new master和latest slave之间的diff日志
scp latest_ip:xx to new_master:xx --将new master和latest slave之间的diff日志 传送给 new master
scp from local: xx to new_master: xx --将latest slave 和 dead master之间的diff日志 传送给 new master
* Phase 3.4: Master Log Apply Phase..
new master 将之前的两段diff日志合并,然后开始apply
All other slaves should start replication from here --开始生成change master语句
Executing master IP activate script: TGW-vip 激活操作,并且设置readonly=0
* Phase 4: Slaves Recovery Phase.. (并行操作)
* Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. --开始生成其余slave 和 latest 之间的diff日志
* Phase 4.2: Starting Parallel Slave Log Apply Phase..
--将 当前需要恢复的slave 和 latest slave之间的diff日志 传送给 需要恢复的 slave
--将 latest slave 和 dead master之间的diff日志 传送给 需要恢复的 slave
--将 刚刚的差异日志依次apply 到需要恢复的slave
--Resetting slave , Executed CHANGE MASTER to new_master
* Phase 5: New master cleanup phase..
Resetting slave info on the new master.