2017年11月聚合文章--MHA failover NON-GTID 专题

MHA failover NON-GTID 专题

这里以masterha_master_switch为背景详解各种可能遇到的场景

假定环境(经典三节点)

host_1(host_1:3306)(current master)
 +--host_2(host_2:3306slave[candidate master])
 +--host_3(host_3:3306etl)

一、Master : MySQL down

1.1 etl 延迟8小时

配置文件中加上no_check_delay=0 即可忽略报错

1.2 slave(候选master)比etl还要落后更多

1.2.1 当master的部分日志还没传递两个slave，这时候master 上的MySQL挂了

### 模拟现场，现场的3台DB binlog状态

* master host_2

dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000002 |     1920 |              |                  |                   |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)


* slave (candidate master) host_1

              Master_Log_File: host_2.000002
          Read_Master_Log_Pos: 150
          Exec_Master_Log_Pos: 150


* etl (other slave)

              Master_Log_File: host_2.000002
          Read_Master_Log_Pos: 1035
          Exec_Master_Log_Pos: 1035


### 切换日志

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error


Mon Nov 13 16:24:03 2017 - [info] MHA::MasterFailover version 0.56.
Mon Nov 13 16:24:03 2017 - [info] Starting master failover.
Mon Nov 13 16:24:03 2017 - [info]
Mon Nov 13 16:24:03 2017 - [info] * Phase 1: Configuration Check Phase..
Mon Nov 13 16:24:03 2017 - [info]
Mon Nov 13 16:24:03 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Mon Nov 13 16:24:03 2017 - [info] Binlog server host_2 is reachable.
Mon Nov 13 16:24:06 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Mon Nov 13 16:24:06 2017 - [info] Binlog server host_3 is reachable.
Mon Nov 13 16:24:06 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Mon Nov 13 16:24:06 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Mon Nov 13 16:24:06 2017 - [info] GTID failover mode = 0
Mon Nov 13 16:24:06 2017 - [info] Dead Servers:
Mon Nov 13 16:24:06 2017 - [info]   host_2(host_2:3306)
Mon Nov 13 16:24:06 2017 - [info] Checking master reachability via MySQL(double check)...
Mon Nov 13 16:24:06 2017 - [info]  ok.
Mon Nov 13 16:24:06 2017 - [info] Alive Servers:
Mon Nov 13 16:24:06 2017 - [info]   host_1(host_1:3306)
Mon Nov 13 16:24:06 2017 - [info]   host_3(host_3:3306)
Mon Nov 13 16:24:06 2017 - [info] Alive Slaves:
Mon Nov 13 16:24:06 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:06 2017 - [info]     Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:06 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Nov 13 16:24:06 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:06 2017 - [info]     Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:06 2017 - [info]     Not candidate for the new Master (no_master is set)
Mon Nov 13 16:24:06 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..
Mon Nov 13 16:24:06 2017 - [info]   done.
Mon Nov 13 16:24:06 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
Mon Nov 13 16:24:06 2017 - [info]   done.
Mon Nov 13 16:24:06 2017 - [info] Starting Non-GTID based failover.
Mon Nov 13 16:24:06 2017 - [info]
Mon Nov 13 16:24:06 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Nov 13 16:24:06 2017 - [info]
Mon Nov 13 16:24:06 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Nov 13 16:24:06 2017 - [info]
Mon Nov 13 16:24:06 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Mon Nov 13 16:24:06 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Mon Nov 13 16:24:06 2017 - [info] Executing master IP deactivation script:
Mon Nov 13 16:24:06 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
===================    swift vip :  tgw_vip from host_2 is deleted  ==============================


--2017-11-13 16:24:07-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：未指定 [text/html]
正在保存至: “STDOUT”

     0K                                                        11.2M=0s

2017-11-13 16:24:11 (11.2 MB/s) - 已写入标准输出 [38]

Mon Nov 13 16:24:11 2017 - [info]  done.
Mon Nov 13 16:24:11 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Mon Nov 13 16:24:11 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] * Phase 3: Master Recovery Phase..
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] The latest binary log file/position on all slaves is host_2.000002:1035
Mon Nov 13 16:24:11 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Nov 13 16:24:11 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:11 2017 - [info]     Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:11 2017 - [info]     Not candidate for the new Master (no_master is set)
Mon Nov 13 16:24:11 2017 - [info] The oldest binary log file/position on all slaves is host_2.000002:150
Mon Nov 13 16:24:11 2017 - [info] Oldest slaves:
Mon Nov 13 16:24:11 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:11 2017 - [info]     Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:11 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] Fetching dead master's binary logs..
Mon Nov 13 16:24:11 2017 - [info] Executing command on the dead master host_2(host_2:3306): save_binary_logs --command=save --start_file=host_2.000002 --start_pos=1035 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
  Creating /var/log/masterha/mha_test if not exists..    ok.
 Concat binary/relay logs from host_2.000002 pos 1035 to host_2.000002 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog ..
  Dumping binlog format description event, from position 0 to 150.. ok.
  Dumping effective binlog data from /data/mysql.bin/host_2.000002 position 1035 to tail(1939).. ok.
 Concat succeeded.
Mon Nov 13 16:24:12 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog succeeded.
Mon Nov 13 16:24:12 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Mon Nov 13 16:24:12 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] * Phase 3.3: Determining New Master Phase..
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Mon Nov 13 16:24:13 2017 - [info] Checking whether host_3 has relay logs from the oldest position..
Mon Nov 13 16:24:13 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000002 --latest_rmlp=1035 --target_mlf=host_2.000002 --target_rmlp=150 --server_id=1261261666 --workdir=/var/log/masterha/mha_test --timestamp=20171113162403 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000005 :
    Relay log found at /data/mysql_data, up to host_3-relay-bin.000005
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_3-relay-bin.000005, start_pos:269.
Target relay log FOUND!
Mon Nov 13 16:24:13 2017 - [info] OK. host_3 has all relay logs.
Mon Nov 13 16:24:13 2017 - [info] Searching new master from slaves..
Mon Nov 13 16:24:13 2017 - [info]  Candidate masters from the configuration file:
Mon Nov 13 16:24:13 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:13 2017 - [info]     Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:13 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Nov 13 16:24:13 2017 - [info]  Non-candidate masters:
Mon Nov 13 16:24:13 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:13 2017 - [info]     Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:13 2017 - [info]     Not candidate for the new Master (no_master is set)
Mon Nov 13 16:24:13 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Mon Nov 13 16:24:13 2017 - [info]   Not found.
Mon Nov 13 16:24:13 2017 - [info]  Searching from all candidate_master slaves..
Mon Nov 13 16:24:13 2017 - [info] New master is host_1(host_1:3306)
Mon Nov 13 16:24:13 2017 - [info] Starting master failover..
Mon Nov 13 16:24:13 2017 - [info]
From:
host_2(host_2:3306) (current master)
 +--host_1(host_1:3306)
 +--host_3(host_3:3306)

To:
host_1(host_1:3306) (new master)
 +--host_3(host_3:3306)
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] Server host_1 received relay logs up to: host_2.000002:150
Mon Nov 13 16:24:13 2017 - [info] Need to get diffs from the latest slave(host_3) up to: host_2.000002:1035 (using the latest slave's relay logs)
Mon Nov 13 16:24:13 2017 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Mon Nov 13 16:24:13 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_1 --latest_mlf=host_2.000002 --latest_rmlp=1035 --target_mlf=host_2.000002 --target_rmlp=150 --server_id=1261261666 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171113162403 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000005
Mon Nov 13 16:24:15 2017 - [info]
    Relay log found at /data/mysql_data, up to host_3-relay-bin.000005
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_3-relay-bin.000005, start_pos:269.
 Concat binary/relay logs from host_3-relay-bin.000005 pos 269 to host_3-relay-bin.000005 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog ..
  Dumping binlog format description event, from position 0 to 269.. ok.
  Dumping effective binlog data from /data/mysql_data/host_3-relay-bin.000005 position 269 to tail(1154).. ok.
 Concat succeeded.
 Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog .
 scp host_3.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog to root@host_1(22) succeeded.
Mon Nov 13 16:24:15 2017 - [info]  Generating diff files succeeded.
Mon Nov 13 16:24:15 2017 - [info] Sending binlog..
Mon Nov 13 16:24:16 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog to root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog succeeded.
Mon Nov 13 16:24:16 2017 - [info]
Mon Nov 13 16:24:16 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Mon Nov 13 16:24:16 2017 - [info]
Mon Nov 13 16:24:16 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Mon Nov 13 16:24:16 2017 - [info] Starting recovery on host_1(host_1:3306)..
Mon Nov 13 16:24:16 2017 - [info]  Generating diffs succeeded.
Mon Nov 13 16:24:16 2017 - [info] Waiting until all relay logs are applied.
Mon Nov 13 16:24:16 2017 - [info]  done.
Mon Nov 13 16:24:16 2017 - [info] Getting slave status..
Mon Nov 13 16:24:16 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000002:150). No need to recover from Exec_Master_Log_Pos.
Mon Nov 13 16:24:16 2017 - [info] Connecting to the target slave host host_1, running recover script..
Mon Nov 13 16:24:16 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171113162403 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Mon Nov 13 16:24:16 2017 - [info]
 Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171113162403.binlog ..
 Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171113162403.binlog.. ok.
  Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog.. dumped up to pos 150. ok.
 /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog has effective binlog events from pos 150.
  Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog position 150 to tail(1054).. ok.
 Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171113162403.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Mon Nov 13 16:24:16 2017 - [info]  All relay logs were successfully applied.
Mon Nov 13 16:24:16 2017 - [info] Getting new master's binlog name and position..
Mon Nov 13 16:24:16 2017 - [info]  host_1.000001:3232
Mon Nov 13 16:24:16 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000001', MASTER_LOG_POS=3232, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Nov 13 16:24:16 2017 - [info] Executing master IP activate script:
Mon Nov 13 16:24:16 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
===================    swift vip :  tgw_vip to host_1  is added  ==============================


Mon Nov 13 16:24:21 2017 - [info]  OK.
Mon Nov 13 16:24:21 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Mon Nov 13 16:24:21 2017 - [info]  ok.
Mon Nov 13 16:24:21 2017 - [info] ** Finished master recovery successfully.
Mon Nov 13 16:24:21 2017 - [info] * Phase 3: Master Recovery Phase completed.
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 4: Slaves Recovery Phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 104955. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171113162403.log if it takes time..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Log messages from host_3 ...
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Mon Nov 13 16:24:21 2017 - [info] End of log messages from host_3.
Mon Nov 13 16:24:21 2017 - [info] -- host_3(host_3:3306) has the latest relay log events.
Mon Nov 13 16:24:21 2017 - [info] Generating relay diff files from the latest slave succeeded.
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 104966. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171113162403.log if it takes time..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Log messages from host_3 ...
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Sending binlog..
Mon Nov 13 16:24:21 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog succeeded.
Mon Nov 13 16:24:21 2017 - [info] Starting recovery on host_3(host_3:3306)..
Mon Nov 13 16:24:21 2017 - [info]  Generating diffs succeeded.
Mon Nov 13 16:24:21 2017 - [info] Waiting until all relay logs are applied.
Mon Nov 13 16:24:21 2017 - [info]  done.
Mon Nov 13 16:24:21 2017 - [info] Getting slave status..
Mon Nov 13 16:24:21 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000002:1035). No need to recover from Exec_Master_Log_Pos.
Mon Nov 13 16:24:21 2017 - [info] Connecting to the target slave host host_3, running recover script..
Mon Nov 13 16:24:21 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171113162403 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Mon Nov 13 16:24:21 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Mon Nov 13 16:24:21 2017 - [info]  All relay logs were successfully applied.
Mon Nov 13 16:24:21 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Mon Nov 13 16:24:21 2017 - [info]  Executed CHANGE MASTER.
Mon Nov 13 16:24:21 2017 - [info]  Slave started.
Mon Nov 13 16:24:21 2017 - [info] End of log messages from host_3.
Mon Nov 13 16:24:21 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Mon Nov 13 16:24:21 2017 - [info] All new slave servers recovered successfully.
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 5: New master cleanup phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Resetting slave info on the new master..
Mon Nov 13 16:24:21 2017 - [info]  host_1: Resetting slave info succeeded.
Mon Nov 13 16:24:21 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Mon Nov 13 16:24:21 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded

Master host_2(host_2:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
Mon Nov 13 16:24:21 2017 - [info] Sending mail..

1.2.2 当master的所有日志已经传递到1个etl，这时候master 上的MySQL挂了

### 模拟现场，现场的3台DB binlog状态

* master host_2

dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000010 |     1694 |              |                  |                   |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 164 | 1    |
| 165 | 2    |
| 166 | 3    |
+-----+------+
3 rows in set (0.00 sec)

* slave (candidate master) host_1

              Master_Log_File: host_2.000010
          Exec_Master_Log_Pos: 806

dba:lc> select * from t_char_2;
Empty set (0.00 sec)


* etl (other slave) host_3

              Master_Log_File: host_2.000010
          Exec_Master_Log_Pos: 1694

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 164 | 1    |
| 165 | 2    |
| 166 | 3    |
+-----+------+
3 rows in set (0.00 sec)

### 切换日志

Wed Nov 15 10:25:50 2017 - [info] MHA::MasterFailover version 0.56.
Wed Nov 15 10:25:50 2017 - [info] Starting master failover.
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] * Phase 1: Configuration Check Phase..
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Wed Nov 15 10:25:50 2017 - [info] GTID failover mode = 0
Wed Nov 15 10:25:50 2017 - [info] Dead Servers:
Wed Nov 15 10:25:50 2017 - [info]   host_2(host_2:3306)
Wed Nov 15 10:25:50 2017 - [info] Checking master reachability via MySQL(double check)...
Wed Nov 15 10:25:50 2017 - [info]  ok.
Wed Nov 15 10:25:50 2017 - [info] Alive Servers:
Wed Nov 15 10:25:50 2017 - [info]   host_1(host_1:3306)
Wed Nov 15 10:25:50 2017 - [info]   host_3(host_3:3306)
Wed Nov 15 10:25:50 2017 - [info] Alive Slaves:
Wed Nov 15 10:25:50 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:50 2017 - [info]     Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:50 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:25:50 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:50 2017 - [info]     Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:50 2017 - [info]     Not candidate for the new Master (no_master is set)
Wed Nov 15 10:25:50 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..
Wed Nov 15 10:25:50 2017 - [info]   done.
Wed Nov 15 10:25:50 2017 - [info] Starting Non-GTID based failover.
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Wed Nov 15 10:25:51 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Wed Nov 15 10:25:51 2017 - [info] Executing master IP deactivation script:
Wed Nov 15 10:25:51 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
===================    swift vip :  tgw_vip from host_2 is deleted  ==============================


--2017-11-15 10:25:51-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：未指定 [text/html]
正在保存至: “STDOUT”

     0K                                                        11.1M=0s

2017-11-15 10:25:53 (11.1 MB/s) - 已写入标准输出 [38]

Wed Nov 15 10:25:53 2017 - [info]  done.
Wed Nov 15 10:25:53 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Wed Nov 15 10:25:53 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] * Phase 3: Master Recovery Phase..
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] The latest binary log file/position on all slaves is host_2.000010:1694
Wed Nov 15 10:25:53 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Wed Nov 15 10:25:53 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:53 2017 - [info]     Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:53 2017 - [info]     Not candidate for the new Master (no_master is set)
Wed Nov 15 10:25:53 2017 - [info] The oldest binary log file/position on all slaves is host_2.000010:806
Wed Nov 15 10:25:53 2017 - [info] Oldest slaves:
Wed Nov 15 10:25:53 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:53 2017 - [info]     Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:53 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] Fetching dead master's binary logs..
Wed Nov 15 10:25:53 2017 - [info] Executing command on the dead master host_2(host_2:3306): save_binary_logs --command=save --start_file=host_2.000010 --start_pos=1694 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
  Creating /var/log/masterha/mha_test if not exists..    ok.
 Concat binary/relay logs from host_2.000010 pos 1694 to host_2.000010 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog ..
  Dumping binlog format description event, from position 0 to 150.. ok.
  Dumping effective binlog data from /data/mysql.bin/host_2.000010 position 1694 to tail(1713).. ok.
 Concat succeeded.
Wed Nov 15 10:25:53 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog succeeded.
Wed Nov 15 10:25:54 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Wed Nov 15 10:25:54 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] * Phase 3.3: Determining New Master Phase..
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Wed Nov 15 10:25:54 2017 - [info] Checking whether host_3 has relay logs from the oldest position..
Wed Nov 15 10:25:54 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000010 --latest_rmlp=1694 --target_mlf=host_2.000010 --target_rmlp=806 --server_id=1261261666 --workdir=/var/log/masterha/mha_test --timestamp=20171115102550 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004 :
    Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:1017.
Target relay log FOUND!
Wed Nov 15 10:25:54 2017 - [info] OK. host_3 has all relay logs.
Wed Nov 15 10:25:54 2017 - [info] Searching new master from slaves..
Wed Nov 15 10:25:54 2017 - [info]  Candidate masters from the configuration file:
Wed Nov 15 10:25:54 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:54 2017 - [info]     Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:54 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:25:54 2017 - [info]  Non-candidate masters:
Wed Nov 15 10:25:54 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:54 2017 - [info]     Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:54 2017 - [info]     Not candidate for the new Master (no_master is set)
Wed Nov 15 10:25:54 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Wed Nov 15 10:25:54 2017 - [info]   Not found.
Wed Nov 15 10:25:54 2017 - [info]  Searching from all candidate_master slaves..
Wed Nov 15 10:25:54 2017 - [info] New master is host_1(host_1:3306)
Wed Nov 15 10:25:54 2017 - [info] Starting master failover..
Wed Nov 15 10:25:54 2017 - [info]
From:
host_2(host_2:3306) (current master)
 +--host_1(host_1:3306)
 +--host_3(host_3:3306)

To:
host_1(host_1:3306) (new master)
 +--host_3(host_3:3306)
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] Server host_1 received relay logs up to: host_2.000010:806
Wed Nov 15 10:25:54 2017 - [info] Need to get diffs from the latest slave(host_3) up to: host_2.000010:1694 (using the latest slave's relay logs)
Wed Nov 15 10:25:55 2017 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Wed Nov 15 10:25:55 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_1 --latest_mlf=host_2.000010 --latest_rmlp=1694 --target_mlf=host_2.000010 --target_rmlp=806 --server_id=1261261666 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171115102550 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004
Wed Nov 15 10:25:55 2017 - [info]
    Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:1017.
 Concat binary/relay logs from host_3-relay-bin.000004 pos 1017 to host_3-relay-bin.000004 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog ..
  Dumping binlog format description event, from position 0 to 361.. ok.
  Dumping effective binlog data from /data/mysql_data/host_3-relay-bin.000004 position 1017 to tail(1905).. ok.
 Concat succeeded.
 Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog .
 scp host_3.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog to root@host_1(22) succeeded.
Wed Nov 15 10:25:55 2017 - [info]  Generating diff files succeeded.
Wed Nov 15 10:25:55 2017 - [info] Sending binlog..
Wed Nov 15 10:25:56 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog to root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog succeeded.
Wed Nov 15 10:25:56 2017 - [info]
Wed Nov 15 10:25:56 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Wed Nov 15 10:25:56 2017 - [info]
Wed Nov 15 10:25:56 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Wed Nov 15 10:25:56 2017 - [info] Starting recovery on host_1(host_1:3306)..
Wed Nov 15 10:25:56 2017 - [info]  Generating diffs succeeded.
Wed Nov 15 10:25:56 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:25:56 2017 - [info]  done.
Wed Nov 15 10:25:56 2017 - [info] Getting slave status..
Wed Nov 15 10:25:56 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000010:806). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:25:56 2017 - [info] Connecting to the target slave host host_1, running recover script..
Wed Nov 15 10:25:56 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115102550 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:25:56 2017 - [info]
 Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171115102550.binlog ..
 Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171115102550.binlog.. ok.
  Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog.. dumped up to pos 150. ok.
 /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog has effective binlog events from pos 150.
  Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog position 150 to tail(169).. ok.
 Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171115102550.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:25:56 2017 - [info]  All relay logs were successfully applied.
Wed Nov 15 10:25:56 2017 - [info] Getting new master's binlog name and position..
Wed Nov 15 10:25:56 2017 - [info]  host_1.000010:2060
Wed Nov 15 10:25:56 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000010', MASTER_LOG_POS=2060, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Wed Nov 15 10:25:56 2017 - [info] Executing master IP activate script:
Wed Nov 15 10:25:56 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
===================    swift vip :  tgw_vip to host_1  is added  ==============================


Wed Nov 15 10:25:59 2017 - [info]  OK.
Wed Nov 15 10:25:59 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Wed Nov 15 10:25:59 2017 - [info]  ok.
Wed Nov 15 10:25:59 2017 - [info] ** Finished master recovery successfully.
Wed Nov 15 10:25:59 2017 - [info] * Phase 3: Master Recovery Phase completed.
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 4: Slaves Recovery Phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 125962. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115102550.log if it takes time..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Wed Nov 15 10:25:59 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:25:59 2017 - [info] -- host_3(host_3:3306) has the latest relay log events.
Wed Nov 15 10:25:59 2017 - [info] Generating relay diff files from the latest slave succeeded.
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 125967. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115102550.log if it takes time..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Sending binlog..
Wed Nov 15 10:25:59 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog succeeded.
Wed Nov 15 10:25:59 2017 - [info] Starting recovery on host_3(host_3:3306)..
Wed Nov 15 10:25:59 2017 - [info]  Generating diffs succeeded.
Wed Nov 15 10:25:59 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:25:59 2017 - [info]  done.
Wed Nov 15 10:25:59 2017 - [info] Getting slave status..
Wed Nov 15 10:25:59 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000010:1694). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:25:59 2017 - [info] Connecting to the target slave host host_3, running recover script..
Wed Nov 15 10:25:59 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115102550 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:25:59 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:25:59 2017 - [info]  All relay logs were successfully applied.
Wed Nov 15 10:25:59 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Wed Nov 15 10:25:59 2017 - [info]  Executed CHANGE MASTER.
Wed Nov 15 10:25:59 2017 - [info]  Slave started.
Wed Nov 15 10:25:59 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:25:59 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Wed Nov 15 10:25:59 2017 - [info] All new slave servers recovered successfully.
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 5: New master cleanup phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Resetting slave info on the new master..
Wed Nov 15 10:25:59 2017 - [info]  host_1: Resetting slave info succeeded.
Wed Nov 15 10:25:59 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Wed Nov 15 10:25:59 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded

Master host_2(host_2:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
Wed Nov 15 10:25:59 2017 - [info] Sending mail..

### 切换后的结果

* new master 和 new etl  数据一致

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 164 | 1    |
| 165 | 2    |
| 166 | 3    |
+-----+------+
3 rows in set (0.00 sec)

1.3 slave(候选master)的日志是最新的，比etl要多

1.3.1 当master的部分日志还没传递两个slave，这时候master 上的MySQL挂了

### 模拟现场，现场的3台DB binlog状态

* master host_2

dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000004 |     4577 |              |                  |                   |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 114 | 1    |
| 115 | 2    |
| 116 | 3    |
| 117 | 4    |
| 118 | 5    |
| 119 | 6    |
| 120 | 7    |
| 121 | 8    |
| 122 | 10   |
| 123 | 11   |
| 124 | 12   |
| 125 | 13   |
| 126 | 14   |
| 127 | 15   |
+-----+------+
14 rows in set (0.00 sec)


* slave (candidate master) host_1


              Master_Log_File: host_2.000004
          Exec_Master_Log_Pos: 3683

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 114 | 1    |
| 115 | 2    |
| 116 | 3    |
| 117 | 4    |
| 118 | 5    |
| 119 | 6    |
| 120 | 7    |
| 121 | 8    |
| 122 | 10   |
| 123 | 11   |
| 124 | 12   |
+-----+------+
11 rows in set (0.00 sec)




* etl (other slave) host_3

              Master_Log_File: host_2.000004
          Exec_Master_Log_Pos: 2789

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 114 | 1    |
| 115 | 2    |
| 116 | 3    |
| 117 | 4    |
| 118 | 5    |
| 119 | 6    |
| 120 | 7    |
| 121 | 8    |
+-----+------+
8 rows in set (0.00 sec)



### 切换日志


masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error


Tue Nov 14 17:13:08 2017 - [info] MHA::MasterFailover version 0.56.
Tue Nov 14 17:13:08 2017 - [info] Starting master failover.
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] * Phase 1: Configuration Check Phase..
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Tue Nov 14 17:13:08 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Tue Nov 14 17:13:08 2017 - [info] GTID failover mode = 0
Tue Nov 14 17:13:08 2017 - [info] Dead Servers:
Tue Nov 14 17:13:08 2017 - [info]   host_2(host_2:3306)
Tue Nov 14 17:13:08 2017 - [info] Checking master reachability via MySQL(double check)...
Tue Nov 14 17:13:08 2017 - [info]  ok.
Tue Nov 14 17:13:08 2017 - [info] Alive Servers:
Tue Nov 14 17:13:08 2017 - [info]   host_1(host_1:3306)
Tue Nov 14 17:13:08 2017 - [info]   host_3(host_3:3306)
Tue Nov 14 17:13:08 2017 - [info] Alive Slaves:
Tue Nov 14 17:13:08 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:08 2017 - [info]     Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:08 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Nov 14 17:13:08 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:08 2017 - [info]     Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:08 2017 - [info]     Not candidate for the new Master (no_master is set)
Tue Nov 14 17:13:08 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..
Tue Nov 14 17:13:08 2017 - [info]   done.
Tue Nov 14 17:13:08 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
Tue Nov 14 17:13:08 2017 - [info]   done.
Tue Nov 14 17:13:08 2017 - [info] Starting Non-GTID based failover.
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Tue Nov 14 17:13:09 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Nov 14 17:13:09 2017 - [info] Executing master IP deactivation script:
Tue Nov 14 17:13:09 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
===================    swift vip :  tgw_vip from host_2 is deleted  ==============================


--2017-11-14 17:13:09-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：未指定 [text/html]
正在保存至: “STDOUT”

     0K                                                        7.19M=0s

2017-11-14 17:13:11 (7.19 MB/s) - 已写入标准输出 [38]

Tue Nov 14 17:13:11 2017 - [info]  done.
Tue Nov 14 17:13:11 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Nov 14 17:13:11 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] * Phase 3: Master Recovery Phase..
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] The latest binary log file/position on all slaves is host_2.000004:3683
Tue Nov 14 17:13:11 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Nov 14 17:13:11 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:11 2017 - [info]     Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:11 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Nov 14 17:13:11 2017 - [info] The oldest binary log file/position on all slaves is host_2.000004:2789
Tue Nov 14 17:13:11 2017 - [info] Oldest slaves:
Tue Nov 14 17:13:11 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:11 2017 - [info]     Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:11 2017 - [info]     Not candidate for the new Master (no_master is set)
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] Fetching dead master's binary logs..
Tue Nov 14 17:13:11 2017 - [info] Executing command on the dead master host_2(host_2:3306): save_binary_logs --command=save --start_file=host_2.000004 --start_pos=3683 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
  Creating /var/log/masterha/mha_test if not exists..    ok.
 Concat binary/relay logs from host_2.000004 pos 3683 to host_2.000004 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog ..
  Dumping binlog format description event, from position 0 to 150.. ok.
  Dumping effective binlog data from /data/mysql.bin/host_2.000004 position 3683 to tail(4596).. ok.
 Concat succeeded.
Tue Nov 14 17:13:12 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog succeeded.
Tue Nov 14 17:13:12 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Tue Nov 14 17:13:12 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Tue Nov 14 17:13:12 2017 - [info]
Tue Nov 14 17:13:12 2017 - [info] * Phase 3.3: Determining New Master Phase..
Tue Nov 14 17:13:12 2017 - [info]
Tue Nov 14 17:13:12 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Tue Nov 14 17:13:12 2017 - [info] Checking whether host_1 has relay logs from the oldest position..
Tue Nov 14 17:13:12 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000004 --latest_rmlp=3683 --target_mlf=host_2.000004 --target_rmlp=2789 --server_id=1261261646 --workdir=/var/log/masterha/mha_test --timestamp=20171114171308 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_1-relay-bin.000003 :
    Relay log found at /data/mysql_data, up to host_1-relay-bin.000003
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_1-relay-bin.000003, start_pos:269.
Target relay log FOUND!
Tue Nov 14 17:13:13 2017 - [info] OK. host_1 has all relay logs.
Tue Nov 14 17:13:13 2017 - [info] Searching new master from slaves..
Tue Nov 14 17:13:13 2017 - [info]  Candidate masters from the configuration file:
Tue Nov 14 17:13:13 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:13 2017 - [info]     Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:13 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Nov 14 17:13:13 2017 - [info]  Non-candidate masters:
Tue Nov 14 17:13:13 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:13 2017 - [info]     Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:13 2017 - [info]     Not candidate for the new Master (no_master is set)
Tue Nov 14 17:13:13 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Tue Nov 14 17:13:13 2017 - [info] New master is host_1(host_1:3306)
Tue Nov 14 17:13:13 2017 - [info] Starting master failover..
Tue Nov 14 17:13:13 2017 - [info]
From:
host_2(host_2:3306) (current master)
 +--host_1(host_1:3306)
 +--host_3(host_3:3306)

To:
host_1(host_1:3306) (new master)
 +--host_3(host_3:3306)
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Tue Nov 14 17:13:13 2017 - [info] Sending binlog..
Tue Nov 14 17:13:13 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog to root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog succeeded.
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Tue Nov 14 17:13:13 2017 - [info] Starting recovery on host_1(host_1:3306)..
Tue Nov 14 17:13:13 2017 - [info]  Generating diffs succeeded.
Tue Nov 14 17:13:13 2017 - [info] Waiting until all relay logs are applied.
Tue Nov 14 17:13:13 2017 - [info]  done.
Tue Nov 14 17:13:13 2017 - [info] Getting slave status..
Tue Nov 14 17:13:13 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000004:3683). No need to recover from Exec_Master_Log_Pos.
Tue Nov 14 17:13:13 2017 - [info] Connecting to the target slave host host_1, running recover script..
Tue Nov 14 17:13:13 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171114171308 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Tue Nov 14 17:13:13 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Tue Nov 14 17:13:13 2017 - [info]  All relay logs were successfully applied.
Tue Nov 14 17:13:13 2017 - [info] Getting new master's binlog name and position..
Tue Nov 14 17:13:13 2017 - [info]  host_1.000003:2347
Tue Nov 14 17:13:13 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000003', MASTER_LOG_POS=2347, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Nov 14 17:13:13 2017 - [info] Executing master IP activate script:
Tue Nov 14 17:13:13 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
===================    swift vip :  tgw_vip to host_1  is added  ==============================


Tue Nov 14 17:13:16 2017 - [info]  OK.
Tue Nov 14 17:13:16 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Tue Nov 14 17:13:16 2017 - [info]  ok.
Tue Nov 14 17:13:16 2017 - [info] ** Finished master recovery successfully.
Tue Nov 14 17:13:16 2017 - [info] * Phase 3: Master Recovery Phase completed.
Tue Nov 14 17:13:16 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] * Phase 4: Slaves Recovery Phase..
Tue Nov 14 17:13:16 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Tue Nov 14 17:13:16 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 29885. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171114171308.log if it takes time..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Log messages from host_3 ...
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] Server host_3 received relay logs up to: host_2.000004:2789
Tue Nov 14 17:13:16 2017 - [info] Need to get diffs from the latest slave(host_1) up to: host_2.000004:3683 (using the latest slave's relay logs)
Tue Nov 14 17:13:16 2017 - [info] Connecting to the latest slave host host_1, generating diff relay log files..
Tue Nov 14 17:13:16 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_3 --latest_mlf=host_2.000004 --latest_rmlp=3683 --target_mlf=host_2.000004 --target_rmlp=2789 --server_id=1261261646 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171114171308 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_1-relay-bin.000003
Tue Nov 14 17:13:17 2017 - [info]
    Relay log found at /data/mysql_data, up to host_1-relay-bin.000003
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_1-relay-bin.000003, start_pos:269.
 Concat binary/relay logs from host_1-relay-bin.000003 pos 269 to host_1-relay-bin.000003 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog ..
  Dumping binlog format description event, from position 0 to 269.. ok.
  Dumping effective binlog data from /data/mysql_data/host_1-relay-bin.000003 position 269 to tail(1163).. ok.
 Concat succeeded.
 Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog .
 scp host_1.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog to root@host_3(22) succeeded.
Tue Nov 14 17:13:17 2017 - [info]  Generating diff files succeeded.
Tue Nov 14 17:13:17 2017 - [info] End of log messages from host_3.
Tue Nov 14 17:13:17 2017 - [info] -- Slave diff log generation on host host_3(host_3:3306) succeeded.
Tue Nov 14 17:13:17 2017 - [info] Generating relay diff files from the latest slave succeeded.
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 31393. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171114171308.log if it takes time..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Log messages from host_3 ...
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Sending binlog..
Tue Nov 14 17:13:17 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog succeeded.
Tue Nov 14 17:13:17 2017 - [info] Starting recovery on host_3(host_3:3306)..
Tue Nov 14 17:13:17 2017 - [info]  Generating diffs succeeded.
Tue Nov 14 17:13:17 2017 - [info] Waiting until all relay logs are applied.
Tue Nov 14 17:13:17 2017 - [info]  done.
Tue Nov 14 17:13:17 2017 - [info] Getting slave status..
Tue Nov 14 17:13:17 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000004:2789). No need to recover from Exec_Master_Log_Pos.
Tue Nov 14 17:13:17 2017 - [info] Connecting to the target slave host host_3, running recover script..
Tue Nov 14 17:13:17 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171114171308 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Tue Nov 14 17:13:17 2017 - [info]
 Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171114171308.binlog ..
 Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171114171308.binlog.. ok.
  Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog.. dumped up to pos 150. ok.
 /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog has effective binlog events from pos 150.
  Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog position 150 to tail(1063).. ok.
 Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171114171308.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Tue Nov 14 17:13:17 2017 - [info]  All relay logs were successfully applied.
Tue Nov 14 17:13:17 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Tue Nov 14 17:13:17 2017 - [info]  Executed CHANGE MASTER.
Tue Nov 14 17:13:17 2017 - [info]  Slave started.
Tue Nov 14 17:13:17 2017 - [info] End of log messages from host_3.
Tue Nov 14 17:13:17 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Tue Nov 14 17:13:17 2017 - [info] All new slave servers recovered successfully.
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] * Phase 5: New master cleanup phase..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Resetting slave info on the new master..
Tue Nov 14 17:13:17 2017 - [info]  host_1: Resetting slave info succeeded.
Tue Nov 14 17:13:17 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Tue Nov 14 17:13:17 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded

Master host_2(host_2:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_1(host_1:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): Generating differential relay logs up to host_1(host_1:3306)succeeded.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
Tue Nov 14 17:13:17 2017 - [info] Sending mail..


## 切换结果

new master 和 new etl
dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 114 | 1    |
| 115 | 2    |
| 116 | 3    |
| 117 | 4    |
| 118 | 5    |
| 119 | 6    |
| 120 | 7    |
| 121 | 8    |
| 122 | 10   |
| 123 | 11   |
| 124 | 12   |
| 125 | 13   |
| 126 | 14   |
| 127 | 15   |
+-----+------+
14 rows in set (0.00 sec)

1.3.2 当master的所有日志已经传递slave，这时候master 上的MySQL挂了

### 模拟现场，现场的3台DB binlog状态

* master host_1

dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_1.000010 |     3341 |              |                  |                   |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 167 | 1    |
| 168 | 2    |
| 169 | 3    |
+-----+------+
3 rows in set (0.00 sec)

* slave host_2

              Master_Log_File: host_1.000010
          Exec_Master_Log_Pos: 3341

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 167 | 1    |
| 168 | 2    |
| 169 | 3    |
+-----+------+
3 rows in set (0.00 sec)


* etl host_3

              Master_Log_File: host_1.000010
          Exec_Master_Log_Pos: 2381


### 切换日志

Wed Nov 15 10:39:36 2017 - [info] MHA::MasterFailover version 0.56.
Wed Nov 15 10:39:36 2017 - [info] Starting master failover.
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] * Phase 1: Configuration Check Phase..
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Wed Nov 15 10:39:36 2017 - [info] GTID failover mode = 0
Wed Nov 15 10:39:36 2017 - [info] Dead Servers:
Wed Nov 15 10:39:36 2017 - [info]   host_1(host_1:3306)
Wed Nov 15 10:39:36 2017 - [info] Checking master reachability via MySQL(double check)...
Wed Nov 15 10:39:36 2017 - [info]  ok.
Wed Nov 15 10:39:36 2017 - [info] Alive Servers:
Wed Nov 15 10:39:36 2017 - [info]   host_2(host_2:3306)
Wed Nov 15 10:39:36 2017 - [info]   host_3(host_3:3306)
Wed Nov 15 10:39:36 2017 - [info] Alive Slaves:
Wed Nov 15 10:39:36 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:36 2017 - [info]     Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:36 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:39:36 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:36 2017 - [info]     Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:36 2017 - [info]     Not candidate for the new Master (no_master is set)
Wed Nov 15 10:39:36 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
Wed Nov 15 10:39:36 2017 - [info]   done.
Wed Nov 15 10:39:36 2017 - [info] Starting Non-GTID based failover.
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Wed Nov 15 10:39:37 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Wed Nov 15 10:39:37 2017 - [info] Executing master IP deactivation script:
Wed Nov 15 10:39:37 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
===================    swift vip :  tgw_vip from host_1 is deleted  ==============================


--2017-11-15 10:39:37-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：未指定 [text/html]
正在保存至: “STDOUT”

     0K                                                        11.4M=0s

2017-11-15 10:39:39 (11.4 MB/s) - 已写入标准输出 [38]

Wed Nov 15 10:39:39 2017 - [info]  done.
Wed Nov 15 10:39:39 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Wed Nov 15 10:39:39 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] * Phase 3: Master Recovery Phase..
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] The latest binary log file/position on all slaves is host_1.000010:3341
Wed Nov 15 10:39:39 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Wed Nov 15 10:39:39 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:39 2017 - [info]     Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:39 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:39:39 2017 - [info] The oldest binary log file/position on all slaves is host_1.000010:2381
Wed Nov 15 10:39:39 2017 - [info] Oldest slaves:
Wed Nov 15 10:39:39 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:39 2017 - [info]     Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:39 2017 - [info]     Not candidate for the new Master (no_master is set)
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] Fetching dead master's binary logs..
Wed Nov 15 10:39:39 2017 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1.000010 --start_pos=3341 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
  Creating /var/log/masterha/mha_test if not exists..    ok.
 Concat binary/relay logs from host_1.000010 pos 3341 to host_1.000010 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog ..
 Binlog Checksum enabled
  Dumping binlog format description event, from position 0 to 154.. ok.
  Dumping effective binlog data from /data/mysql.bin/host_1.000010 position 3341 to tail(3364).. ok.
 Binlog Checksum enabled
 Concat succeeded.
Wed Nov 15 10:39:39 2017 - [info] scp from root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog succeeded.
Wed Nov 15 10:39:40 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Wed Nov 15 10:39:40 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] * Phase 3.3: Determining New Master Phase..
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Wed Nov 15 10:39:40 2017 - [info] Checking whether host_2 has relay logs from the oldest position..
Wed Nov 15 10:39:40 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_1.000010 --latest_rmlp=3341 --target_mlf=host_1.000010 --target_rmlp=2381 --server_id=1261261656 --workdir=/var/log/masterha/mha_test --timestamp=20171115103936 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002 :
    Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:636.
Target relay log FOUND!
Wed Nov 15 10:39:40 2017 - [info] OK. host_2 has all relay logs.
Wed Nov 15 10:39:40 2017 - [info] Searching new master from slaves..
Wed Nov 15 10:39:40 2017 - [info]  Candidate masters from the configuration file:
Wed Nov 15 10:39:40 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:40 2017 - [info]     Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:40 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:39:40 2017 - [info]  Non-candidate masters:
Wed Nov 15 10:39:40 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:40 2017 - [info]     Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:40 2017 - [info]     Not candidate for the new Master (no_master is set)
Wed Nov 15 10:39:40 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Wed Nov 15 10:39:40 2017 - [info] New master is host_2(host_2:3306)
Wed Nov 15 10:39:40 2017 - [info] Starting master failover..
Wed Nov 15 10:39:40 2017 - [info]
From:
host_1(host_1:3306) (current master)
 +--host_2(host_2:3306)
 +--host_3(host_3:3306)

To:
host_2(host_2:3306) (new master)
 +--host_3(host_3:3306)
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Wed Nov 15 10:39:40 2017 - [info] Sending binlog..
Wed Nov 15 10:39:41 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog to root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog succeeded.
Wed Nov 15 10:39:41 2017 - [info]
Wed Nov 15 10:39:41 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Wed Nov 15 10:39:41 2017 - [info]
Wed Nov 15 10:39:41 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Wed Nov 15 10:39:41 2017 - [info] Starting recovery on host_2(host_2:3306)..
Wed Nov 15 10:39:41 2017 - [info]  Generating diffs succeeded.
Wed Nov 15 10:39:41 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:39:41 2017 - [info]  done.
Wed Nov 15 10:39:41 2017 - [info] Getting slave status..
Wed Nov 15 10:39:41 2017 - [info] This slave(host_2)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1.000010:3341). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:39:41 2017 - [info] Connecting to the target slave host host_2, running recover script..
Wed Nov 15 10:39:41 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_2 --slave_ip=host_2 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115103936 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:39:41 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog on host_2:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:39:41 2017 - [info]  All relay logs were successfully applied.
Wed Nov 15 10:39:41 2017 - [info] Getting new master's binlog name and position..
Wed Nov 15 10:39:41 2017 - [info]  host_2.000011:1307
Wed Nov 15 10:39:41 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2.000011', MASTER_LOG_POS=1307, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Wed Nov 15 10:39:41 2017 - [info] Executing master IP activate script:
Wed Nov 15 10:39:41 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
===================    swift vip :  tgw_vip to host_2  is added  ==============================


Wed Nov 15 10:39:44 2017 - [info]  OK.
Wed Nov 15 10:39:44 2017 - [info] Setting read_only=0 on host_2(host_2:3306)..
Wed Nov 15 10:39:44 2017 - [info]  ok.
Wed Nov 15 10:39:44 2017 - [info] ** Finished master recovery successfully.
Wed Nov 15 10:39:44 2017 - [info] * Phase 3: Master Recovery Phase completed.
Wed Nov 15 10:39:44 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] * Phase 4: Slaves Recovery Phase..
Wed Nov 15 10:39:44 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Wed Nov 15 10:39:44 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 11760. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115103936.log if it takes time..
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] Server host_3 received relay logs up to: host_1.000010:2381
Wed Nov 15 10:39:44 2017 - [info] Need to get diffs from the latest slave(host_2) up to: host_1.000010:3341 (using the latest slave's relay logs)
Wed Nov 15 10:39:44 2017 - [info] Connecting to the latest slave host host_2, generating diff relay log files..
Wed Nov 15 10:39:44 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_3 --latest_mlf=host_1.000010 --latest_rmlp=3341 --target_mlf=host_1.000010 --target_rmlp=2381 --server_id=1261261656 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171115103936 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002
Wed Nov 15 10:39:45 2017 - [info]
    Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:636.
 Concat binary/relay logs from host_2-relay-bin.000002 pos 636 to host_2-relay-bin.000002 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog ..
 Binlog Checksum enabled
  Dumping binlog format description event, from position 0 to 315.. ok.
  Dumping effective binlog data from /data/mysql_data/host_2-relay-bin.000002 position 636 to tail(1596).. ok.
 Binlog Checksum enabled
 Concat succeeded.
 Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog .
 scp host_2.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog to root@host_3(22) succeeded.
Wed Nov 15 10:39:45 2017 - [info]  Generating diff files succeeded.
Wed Nov 15 10:39:45 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:39:45 2017 - [info] -- Slave diff log generation on host host_3(host_3:3306) succeeded.
Wed Nov 15 10:39:45 2017 - [info] Generating relay diff files from the latest slave succeeded.
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 12881. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115103936.log if it takes time..
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:40:45 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] Sending binlog..
Wed Nov 15 10:39:45 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog succeeded.
Wed Nov 15 10:39:45 2017 - [info] Starting recovery on host_3(host_3:3306)..
Wed Nov 15 10:39:45 2017 - [info]  Generating diffs succeeded.
Wed Nov 15 10:39:45 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:39:45 2017 - [info]  done.
Wed Nov 15 10:39:45 2017 - [info] Getting slave status..
Wed Nov 15 10:39:45 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1.000010:2381). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:39:45 2017 - [info] Connecting to the target slave host host_3, running recover script..
Wed Nov 15 10:39:45 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115103936 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:40:45 2017 - [info]
 Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171115103936.binlog ..
 Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171115103936.binlog.. ok.
  Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog..  Binlog Checksum enabled
dumped up to pos 154. ok.
 /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog has effective binlog events from pos 154.
  Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog position 154 to tail(177).. ok.
 Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171115103936.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:40:45 2017 - [info]  All relay logs were successfully applied.
Wed Nov 15 10:40:45 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Wed Nov 15 10:40:45 2017 - [info]  Executed CHANGE MASTER.
Wed Nov 15 10:40:45 2017 - [info]  Slave started.
Wed Nov 15 10:40:45 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:40:45 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Wed Nov 15 10:40:45 2017 - [info] All new slave servers recovered successfully.
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:40:45 2017 - [info] * Phase 5: New master cleanup phase..
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:40:45 2017 - [info] Resetting slave info on the new master..
Wed Nov 15 10:40:45 2017 - [info]  host_2: Resetting slave info succeeded.
Wed Nov 15 10:40:45 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
Wed Nov 15 10:40:45 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded

Master host_1(host_1:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): Generating differential relay logs up to host_2(host_2:3306)succeeded.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.
Wed Nov 15 10:40:45 2017 - [info] Sending mail..



### 切换结果

new master & new etl 数据一致

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 167 | 1    |
| 168 | 2    |
| 169 | 3    |

1.4 slave(候选master）上面有大事务在跑

1000s的大查询

无影响，正常切换

1.5 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

90%的场景都是可以重新执行的
5%的场景不能执行

* NO。1
	问题： dead master上面的vip已经delete，再次failover 这一步会报错
	方案： a）将dead master上的vip再重新add上去  或者 b）直接过滤掉 delete dead master vip这一步

5%的场景不能再次执行，执行会报错

一般这种场景就是：已经failover到最后的change master阶段，这样主从结构已经变更( candidate slave <==rep== etl )，MHA无法重新走一遍。

好在，change master的语句已经生成，其余slave再次执行一遍日志里面的change master即可

Thu Nov  9 16:49:39 2017 - [info] GTID failover mode = 0
Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln169] Detected dead master host_1(host_1:3306) does not match with specified dead master host_2(host_2:3306)!
Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/bin/masterha_master_switch line 53

1.7 Master：MySQL down小结

1. failover最终命令

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=$dead_master_ip  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

二、Master : Server down

2.1 etl 延迟8小时

同1.1 结论

2.2 slave(候选master)比etl还要落后更多

2.2.1 当master的部分日志还没传递两个slave，这时候master server挂了

### 3台DB的binlog状态

* master

dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000012 |     1651 |              |                  |                   |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

dba:lc>
dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 172 | 1    |
| 173 | 2    |
| 174 | 3    |
| 175 | 4    |
+-----+------+
4 rows in set (0.00 sec)


* slave

              Master_Log_File: host_2.000012
          Exec_Master_Log_Pos: 467

dba:lc> select * from t_char_2;
Empty set (0.00 sec)



* etl

              Master_Log_File: host_2.000012
          Exec_Master_Log_Pos: 1059

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 172 | 1    |
| 173 | 2    |
+-----+------+
2 rows in set (0.00 sec)


### 模拟故障场景

* 隔离master的网络，让其等同于down机

master> iptables -A INPUT -p tcp -s other_ip --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP

### 切换日志

Thu Nov 16 16:54:40 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 16 16:54:40 2017 - [info] Starting master failover.
Thu Nov 16 16:54:40 2017 - [info]
Thu Nov 16 16:54:40 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 16 16:54:40 2017 - [info]
Thu Nov 16 16:54:40 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Thu Nov 16 16:54:40 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Thu Nov 16 16:54:40 2017 - [info] GTID failover mode = 0
Thu Nov 16 16:54:40 2017 - [info] Dead Servers:
Thu Nov 16 16:54:40 2017 - [info]   host_2(host_2:3306)
Thu Nov 16 16:54:40 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 16 16:54:41 2017 - [info]  ok.
Thu Nov 16 16:54:41 2017 - [info] Alive Servers:
Thu Nov 16 16:54:41 2017 - [info]   host_1(host_1:3306)
Thu Nov 16 16:54:41 2017 - [info]   host_3(host_3:3306)
Thu Nov 16 16:54:41 2017 - [info] Alive Slaves:
Thu Nov 16 16:54:41 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:54:41 2017 - [info]     Replicating from host_2(host_2:3306)
Thu Nov 16 16:54:41 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 16:54:41 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:54:41 2017 - [info]     Replicating from host_2(host_2:3306)
Thu Nov 16 16:54:41 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov 16 16:54:41 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..
Thu Nov 16 16:54:41 2017 - [info]   done.
Thu Nov 16 16:54:41 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
Thu Nov 16 16:54:41 2017 - [info]   done.
Thu Nov 16 16:54:41 2017 - [info] Starting Non-GTID based failover.
Thu Nov 16 16:54:41 2017 - [info]
Thu Nov 16 16:54:41 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 16 16:54:41 2017 - [info]
Thu Nov 16 16:54:41 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 16 16:54:41 2017 - [info]
Thu Nov 16 16:55:31 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Thu Nov 16 16:55:31 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 16 16:55:31 2017 - [info] Executing master IP deactivation script:
Thu Nov 16 16:55:31 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stop
ssh: connect to host host_2 port 22: Connection timed out
===================    swift vip :  tgw_vip from host_2 is deleted  ==============================


--2017-11-16 16:55:38-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：未指定 [text/html]
正在保存至: “STDOUT”

     0K                                                        12.1M=0s

2017-11-16 16:57:36 (12.1 MB/s) - 已写入标准输出 [38]

Thu Nov 16 16:57:36 2017 - [info]  done.
Thu Nov 16 16:57:36 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 16 16:57:36 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] The latest binary log file/position on all slaves is host_2.000012:1059
Thu Nov 16 16:57:36 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 16 16:57:36 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:36 2017 - [info]     Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:36 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov 16 16:57:36 2017 - [info] The oldest binary log file/position on all slaves is host_2.000012:467
Thu Nov 16 16:57:36 2017 - [info] Oldest slaves:
Thu Nov 16 16:57:36 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:36 2017 - [info]     Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:36 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [warning] Dead Master is not SSH reachable. Could not save it's binlogs. Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost.
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Nov 16 16:57:37 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Thu Nov 16 16:57:37 2017 - [info] Checking whether host_3 has relay logs from the oldest position..
Thu Nov 16 16:57:37 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000012 --latest_rmlp=1059 --target_mlf=host_2.000012 --target_rmlp=467 --server_id=1261261666 --workdir=/var/log/masterha/mha_test --timestamp=20171116165440 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004 :
    Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:678.
Target relay log FOUND!
Thu Nov 16 16:57:37 2017 - [info] OK. host_3 has all relay logs.
Thu Nov 16 16:57:37 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Nov 16 16:57:37 2017 - [info] Searching new master from slaves..
Thu Nov 16 16:57:37 2017 - [info]  Candidate masters from the configuration file:
Thu Nov 16 16:57:37 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:37 2017 - [info]     Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:37 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 16:57:37 2017 - [info]  Non-candidate masters:
Thu Nov 16 16:57:37 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:37 2017 - [info]     Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:37 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov 16 16:57:37 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 16 16:57:37 2017 - [info]   Not found.
Thu Nov 16 16:57:37 2017 - [info]  Searching from all candidate_master slaves..
Thu Nov 16 16:57:37 2017 - [info] New master is host_1(host_1:3306)
Thu Nov 16 16:57:37 2017 - [info] Starting master failover..
Thu Nov 16 16:57:37 2017 - [info]
From:
host_2(host_2:3306) (current master)
 +--host_1(host_1:3306)
 +--host_3(host_3:3306)

To:
host_1(host_1:3306) (new master)
 +--host_3(host_3:3306)
Thu Nov 16 16:57:37 2017 - [info]
Thu Nov 16 16:57:37 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Nov 16 16:57:37 2017 - [info]
Thu Nov 16 16:57:37 2017 - [info] Server host_1 received relay logs up to: host_2.000012:467
Thu Nov 16 16:57:37 2017 - [info] Need to get diffs from the latest slave(host_3) up to: host_2.000012:1059 (using the latest slave's relay logs)
Thu Nov 16 16:57:38 2017 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Thu Nov 16 16:57:38 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_1 --latest_mlf=host_2.000012 --latest_rmlp=1059 --target_mlf=host_2.000012 --target_rmlp=467 --server_id=1261261666 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171116165440 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004
Thu Nov 16 16:57:38 2017 - [info]
    Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:678.
 Concat binary/relay logs from host_3-relay-bin.000004 pos 678 to host_3-relay-bin.000004 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog ..
  Dumping binlog format description event, from position 0 to 361.. ok.
  Dumping effective binlog data from /data/mysql_data/host_3-relay-bin.000004 position 678 to tail(1270).. ok.
 Concat succeeded.
 Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog .
 scp host_3.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog to root@host_1(22) succeeded.
Thu Nov 16 16:57:38 2017 - [info]  Generating diff files succeeded.
Thu Nov 16 16:57:38 2017 - [info]
Thu Nov 16 16:57:38 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Nov 16 16:57:38 2017 - [info]
Thu Nov 16 16:57:38 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Nov 16 16:57:38 2017 - [info] Starting recovery on host_1(host_1:3306)..
Thu Nov 16 16:57:38 2017 - [info]  Generating diffs succeeded.
Thu Nov 16 16:57:38 2017 - [info] Waiting until all relay logs are applied.
Thu Nov 16 16:57:38 2017 - [info]  done.
Thu Nov 16 16:57:38 2017 - [info] Getting slave status..
Thu Nov 16 16:57:38 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000012:467). No need to recover from Exec_Master_Log_Pos.
Thu Nov 16 16:57:38 2017 - [info] Connecting to the target slave host host_1, running recover script..
Thu Nov 16 16:57:38 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171116165440 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Thu Nov 16 16:57:39 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Thu Nov 16 16:57:39 2017 - [info]  All relay logs were successfully applied.
Thu Nov 16 16:57:39 2017 - [info] Getting new master's binlog name and position..
Thu Nov 16 16:57:39 2017 - [info]  host_1.000012:1310
Thu Nov 16 16:57:39 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000012', MASTER_LOG_POS=1310, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 16 16:57:39 2017 - [info] Executing master IP activate script:
Thu Nov 16 16:57:39 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
===================    swift vip :  tgw_vip to host_1  is added  ==============================


Thu Nov 16 16:57:41 2017 - [info]  OK.
Thu Nov 16 16:57:41 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Thu Nov 16 16:57:41 2017 - [info]  ok.
Thu Nov 16 16:57:41 2017 - [info] ** Finished master recovery successfully.
Thu Nov 16 16:57:41 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 123011. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116165440.log if it takes time..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Log messages from host_3 ...
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Thu Nov 16 16:57:41 2017 - [info] End of log messages from host_3.
Thu Nov 16 16:57:41 2017 - [info] -- host_3(host_3:3306) has the latest relay log events.
Thu Nov 16 16:57:41 2017 - [info] Generating relay diff files from the latest slave succeeded.
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 123044. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116165440.log if it takes time..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Log messages from host_3 ...
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Starting recovery on host_3(host_3:3306)..
Thu Nov 16 16:57:41 2017 - [info]  This server has all relay logs. Waiting all logs to be applied..
Thu Nov 16 16:57:41 2017 - [info]   done.
Thu Nov 16 16:57:41 2017 - [info]  All relay logs were successfully applied.
Thu Nov 16 16:57:41 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Thu Nov 16 16:57:41 2017 - [info]  Executed CHANGE MASTER.
Thu Nov 16 16:57:41 2017 - [info]  Slave started.
Thu Nov 16 16:57:41 2017 - [info] End of log messages from host_3.
Thu Nov 16 16:57:41 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Thu Nov 16 16:57:41 2017 - [info] All new slave servers recovered successfully.
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Resetting slave info on the new master..
Thu Nov 16 16:57:41 2017 - [info]  host_1: Resetting slave info succeeded.
Thu Nov 16 16:57:41 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Thu Nov 16 16:57:41 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded

Master host_2(host_2:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.

### 切换后的结果

new_master & new_etl 结果一致，由于master上面未传过来的binlog彻底丢失，所以相应的新集群也缺失这些数据。

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 172 | 1    |
| 173 | 2    |
+-----+------+
2 rows in set (0.00 sec)


### 最后一步很重要

如果dead master之后又活过来了，那么这一步要做

dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c

http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md  ==》 tgw章节详细描述

2.2.2 当master的所有日志已经传递到1个etl，这时候master server挂了

测试省略，和2.2.1基本一样

结论：由于master上的所有日志全部传递到etl，所以最后是不会丢失master上任何数据的。

2.3 slave(候选master)的日志是最新的，比etl要多

2.3.1 当master的部分日志还没传递两个slave，这时候master server挂了

### 3台DB的binlog状态

* master

dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_1.000012 |     3860 |              |                  |                   |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 176 | 1    |
| 177 | 2    |
| 178 | 10   |
| 179 | 20   |
+-----+------+
4 rows in set (0.00 sec)


* slave

              Master_Log_File: host_1.000012
          Exec_Master_Log_Pos: 3216

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 176 | 1    |
| 177 | 2    |
+-----+------+
2 rows in set (0.00 sec)


* etl

              Master_Log_File: host_1.000012
          Exec_Master_Log_Pos: 2576

dba:lc> select * from t_char_2;
Empty set (0.00 sec)


### 模拟故障场景

* 隔离master的网络，让其等同于down机

master> iptables -A INPUT -p tcp -s other_ip --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP

### 切换日志

Thu Nov 16 17:17:59 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 16 17:17:59 2017 - [info] Starting master failover.
Thu Nov 16 17:17:59 2017 - [info]
Thu Nov 16 17:17:59 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 16 17:17:59 2017 - [info]
Thu Nov 16 17:17:59 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)
Thu Nov 16 17:17:59 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Thu Nov 16 17:17:59 2017 - [info] GTID failover mode = 0
Thu Nov 16 17:17:59 2017 - [info] Dead Servers:
Thu Nov 16 17:17:59 2017 - [info]   host_1(host_1:3306)
Thu Nov 16 17:17:59 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 16 17:18:00 2017 - [info]  ok.
Thu Nov 16 17:18:00 2017 - [info] Alive Servers:
Thu Nov 16 17:18:00 2017 - [info]   host_2(host_2:3306)
Thu Nov 16 17:18:00 2017 - [info]   host_3(host_3:3306)
Thu Nov 16 17:18:00 2017 - [info] Alive Slaves:
Thu Nov 16 17:18:00 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:18:00 2017 - [info]     Replicating from host_1(host_1:3306)
Thu Nov 16 17:18:00 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 17:18:00 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:18:00 2017 - [info]     Replicating from host_1(host_1:3306)
Thu Nov 16 17:18:00 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov 16 17:18:00 2017 - [info]  Starting SQL thread on host_2(host_2:3306) ..
Thu Nov 16 17:18:00 2017 - [info]   done.
Thu Nov 16 17:18:00 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..
Thu Nov 16 17:18:00 2017 - [info]   done.
Thu Nov 16 17:18:00 2017 - [info] Starting Non-GTID based failover.
Thu Nov 16 17:18:00 2017 - [info]
Thu Nov 16 17:18:00 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 16 17:18:00 2017 - [info]
Thu Nov 16 17:18:00 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 16 17:18:00 2017 - [info]
Thu Nov 16 17:18:50 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Thu Nov 16 17:18:50 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 16 17:18:50 2017 - [info] Executing master IP deactivation script:
Thu Nov 16 17:18:50 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stop
ssh: connect to host host_1 port 22: Connection timed out
===================    swift vip :  tgw_vip from host_1 is deleted  ==============================


--2017-11-16 17:18:57-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：未指定 [text/html]
正在保存至: “STDOUT”

     0K                                                        8.61M=0s

2017-11-16 17:20:55 (8.61 MB/s) - 已写入标准输出 [38]

Thu Nov 16 17:20:55 2017 - [info]  done.
Thu Nov 16 17:20:55 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 16 17:20:55 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] The latest binary log file/position on all slaves is host_1.000012:3216
Thu Nov 16 17:20:55 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 16 17:20:55 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:55 2017 - [info]     Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:55 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 17:20:55 2017 - [info] The oldest binary log file/position on all slaves is host_1.000012:2576
Thu Nov 16 17:20:55 2017 - [info] Oldest slaves:
Thu Nov 16 17:20:55 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:55 2017 - [info]     Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:55 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [warning] Dead Master is not SSH reachable. Could not save it's binlogs. Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost.
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Nov 16 17:20:55 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Thu Nov 16 17:20:55 2017 - [info] Checking whether host_2 has relay logs from the oldest position..
Thu Nov 16 17:20:55 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_1.000012 --latest_rmlp=3216 --target_mlf=host_1.000012 --target_rmlp=2576 --server_id=1261261656 --workdir=/var/log/masterha/mha_test --timestamp=20171116171759 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002 :
    Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:1581.
Target relay log FOUND!
Thu Nov 16 17:20:56 2017 - [info] OK. host_2 has all relay logs.
Thu Nov 16 17:20:56 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Thu Nov 16 17:20:56 2017 - [info] Searching new master from slaves..
Thu Nov 16 17:20:56 2017 - [info]  Candidate masters from the configuration file:
Thu Nov 16 17:20:56 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:56 2017 - [info]     Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:56 2017 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 17:20:56 2017 - [info]  Non-candidate masters:
Thu Nov 16 17:20:56 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:56 2017 - [info]     Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:56 2017 - [info]     Not candidate for the new Master (no_master is set)
Thu Nov 16 17:20:56 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 16 17:20:56 2017 - [info] New master is host_2(host_2:3306)
Thu Nov 16 17:20:56 2017 - [info] Starting master failover..
Thu Nov 16 17:20:56 2017 - [info]
From:
host_1(host_1:3306) (current master)
 +--host_2(host_2:3306)
 +--host_3(host_3:3306)

To:
host_2(host_2:3306) (new master)
 +--host_3(host_3:3306)
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Nov 16 17:20:56 2017 - [info] Starting recovery on host_2(host_2:3306)..
Thu Nov 16 17:20:56 2017 - [info]  This server has all relay logs. Waiting all logs to be applied..
Thu Nov 16 17:20:56 2017 - [info]   done.
Thu Nov 16 17:20:56 2017 - [info]  All relay logs were successfully applied.
Thu Nov 16 17:20:56 2017 - [info] Getting new master's binlog name and position..
Thu Nov 16 17:20:56 2017 - [info]  host_2.000012:3959
Thu Nov 16 17:20:56 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2.000012', MASTER_LOG_POS=3959, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 16 17:20:56 2017 - [info] Executing master IP activate script:
Thu Nov 16 17:20:56 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
===================    swift vip :  tgw_vip to host_2  is added  ==============================


Thu Nov 16 17:20:59 2017 - [info]  OK.
Thu Nov 16 17:20:59 2017 - [info] ** Finished master recovery successfully.
Thu Nov 16 17:20:59 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 16 17:20:59 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 16 17:20:59 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Nov 16 17:20:59 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 77007. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116171759.log if it takes time..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Log messages from host_3 ...
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] Server host_3 received relay logs up to: host_1.000012:2576
Thu Nov 16 17:20:59 2017 - [info] Need to get diffs from the latest slave(host_2) up to: host_1.000012:3216 (using the latest slave's relay logs)
Thu Nov 16 17:20:59 2017 - [info] Connecting to the latest slave host host_2, generating diff relay log files..
Thu Nov 16 17:20:59 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_3 --latest_mlf=host_1.000012 --latest_rmlp=3216 --target_mlf=host_1.000012 --target_rmlp=2576 --server_id=1261261656 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171116171759 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002
Thu Nov 16 17:21:00 2017 - [info]
    Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
 Fast relay log position search succeeded.
 Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:1581.
 Concat binary/relay logs from host_2-relay-bin.000002 pos 1581 to host_2-relay-bin.000002 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog ..
 Binlog Checksum enabled
  Dumping binlog format description event, from position 0 to 315.. ok.
  Dumping effective binlog data from /data/mysql_data/host_2-relay-bin.000002 position 1581 to tail(2221).. ok.
 Binlog Checksum enabled
 Concat succeeded.
 Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog .
 scp host_2.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog to root@host_3(22) succeeded.
Thu Nov 16 17:21:00 2017 - [info]  Generating diff files succeeded.
Thu Nov 16 17:21:00 2017 - [info] End of log messages from host_3.
Thu Nov 16 17:21:00 2017 - [info] -- Slave diff log generation on host host_3(host_3:3306) succeeded.
Thu Nov 16 17:21:00 2017 - [info] Generating relay diff files from the latest slave succeeded.
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 78627. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116171759.log if it takes time..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Log messages from host_3 ...
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Starting recovery on host_3(host_3:3306)..
Thu Nov 16 17:21:00 2017 - [info]  Generating diffs succeeded.
Thu Nov 16 17:21:00 2017 - [info] Waiting until all relay logs are applied.
Thu Nov 16 17:21:00 2017 - [info]  done.
Thu Nov 16 17:21:00 2017 - [info] Getting slave status..
Thu Nov 16 17:21:00 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1.000012:2576). No need to recover from Exec_Master_Log_Pos.
Thu Nov 16 17:21:00 2017 - [info] Connecting to the target slave host host_3, running recover script..
Thu Nov 16 17:21:00 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171116171759 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Thu Nov 16 17:21:00 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Thu Nov 16 17:21:00 2017 - [info]  All relay logs were successfully applied.
Thu Nov 16 17:21:00 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Thu Nov 16 17:21:00 2017 - [info]  Executed CHANGE MASTER.
Thu Nov 16 17:21:00 2017 - [info]  Slave started.
Thu Nov 16 17:21:00 2017 - [info] End of log messages from host_3.
Thu Nov 16 17:21:00 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Thu Nov 16 17:21:00 2017 - [info] All new slave servers recovered successfully.
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Resetting slave info on the new master..
Thu Nov 16 17:21:00 2017 - [info]  host_2: Resetting slave info succeeded.
Thu Nov 16 17:21:00 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
Thu Nov 16 17:21:00 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded

Master host_1(host_1:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): Generating differential relay logs up to host_2(host_2:3306)succeeded.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.



### 切换后的结果

new_master & new_etl 结果一致，由于master上面未传过来的binlog彻底丢失，所以相应的新集群也缺失这些数据。

dba:lc> select * from t_char_2;
+-----+------+
| id  | name |
+-----+------+
| 176 | 1    |
| 177 | 2    |
+-----+------+
2 rows in set (0.00 sec)



### 最后一步很重要

如果dead master之后又活过来了，那么这一步要做

dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c

http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md  ==》 tgw章节详细描述

结论：由于master 已挂，然而最后的日志没有传递到其他服务器，所以会丢失master没有传递过来的事务日志

2.3.2 当master的所有日志已经传递slave，这时候master server挂了

测试省略，和2.3.1基本一样

结论：由于master上的所有日志全部传递到slave，所以最后是不会丢失master上任何数据的。

2.4 slave(候选master）上面有大事务在跑

1000s的大查询

同1.4结论

flush tables with readlock

同1.4结论

2.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

同1.6结论

三、遇到的坑

3.1 交互模式下，如果没有及时敲’YES’，则终止切换

3.2 如果在non-gtid模式的机器上，配置了binlog server，会有什么影响呢？

无影响

3.3 不要在relay-log的地方伪造日志

xx-relay-bin.000001
xx-relay-bin.000002
xx-relay-bin.000002.lc   --这个是伪造的，当时这个日志是自己解析的 mysqlbinlog -vv xx-relay-bin.000002 > xx-relay-bin.000002.lc

MHA在切换的时候报错如下：

Reading xx-relay-bin.000002.lc
Event too large: pos: 4, event_length: 1163083840, event_type: 32
 at /usr/share/perl5/vendor_perl/MHA/BinlogPosFindManager.pm line 103
Tue Nov 14 10:39:08 2017 - [warning] xx doesn't have all relay logs. Maybe some logs were purged.
Tue Nov 14 10:39:08 2017 - [warning] None of latest servers have enough relay logs from oldest position. We can't recover oldest slaves.
Tue Nov 14 10:39:08 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln947] None of the latest slaves has enough relay logs for recovery.
Tue Nov 14 10:39:08 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/bin/masterha_master_switch line 53

3.4 flush tables with read lock 会阻塞

dba:(none)> show processlist;
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| 63 | dba  | localhost            | NULL | Query   |    0 | starting                     | show processlist                                                                                     |
| 65 | dba  | xx:11164   | NULL | Sleep   |  121 |                              | NULL                                                                                                 |
| 83 | dba  | new master:49022 | NULL | Query   |  176 | Waiting for global read lock | BINLOG '
GpAKWhNYUy1LMAAAAGYHAAAAAG0AAAAAAAEAAmxjAAh0X2NoYXJfMgACAw8CLAEC
GpAKWh5YUy1LJwAAAI0HAAAAAG |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)


会卡在这一步
Tue Nov 14 15:16:23 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
--------------Waiting for global read lock-----------------


那么后面的步骤都会卡住

* Phase 4.2: Starting Parallel Slave Log Apply Phase..
--将 当前需要恢复的slave 和 latest slave之间的diff日志 传送给 需要恢复的 slave
--将 latest slave 和 dead master之间的diff日志 传送给 需要恢复的 slave
--将 刚刚的差异日志依次apply 到需要恢复的slave
--Resetting slave ， Executed CHANGE MASTER to new_master

*Phase 5: New master cleanup phase..
Resetting slave info on the new master.

3.5 binlog伪造

Tue Nov 14 17:23:10 2017 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1.000003 --start_pos=4042 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171114172304.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
  Creating /var/log/masterha/mha_test if not exists..    ok.
 Concat binary/relay logs from host_1.000003 pos 4042 to host_1.000050 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171114172304.binlog ..
 Binlog Checksum enabled
  Dumping binlog format description event, from position 0 to 154.. ok.
  Dumping effective binlog data from /data/mysql.bin/host_1.000003 position 4042 to tail(4065).. ok.
Failed to save binary log: Target file /data/mysql.bin/host_1.000004 not found!
 at /usr/bin/save_binary_logs line 176
Tue Nov 14 17:23:10 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln760] Failed to save binary log events from the orig master. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

测试：

有flush + 无binlog.xx + 1伪造  =  save binlog --成功
有flush + 有binlog.xx + 1伪造  =  save binlog --失败
无flush + 无binlog.xx + 1伪造  =  save binlog --成功
无flush + 有binlog.xx + 1伪造  =  save binlog --失败

说明，如果有伪造的binlog，且后缀比之前的大，那么就会告警

解析：

/usr/bin/save_binary_logs ==》
        generate_diff_binary_log =>
            /usr/share/perl5/vendor_perl/MHA/BinlogManager.pm ==》
                concat_all_binlogs_from():: =>
                    start_num , end_num(如果出现伪造的binlog，则会报错)

四、总结

MHA + NON-GTID 模式，重点配置和用法如下：

1. command
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=$dead_master_ip  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error




2. tgw 清理

dead master 如果还可以起来，那么必须在上面执行： /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c

原因可参看：http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==> TGW 章节

五、流程简介

* Phase 1: Configuration Check Phase..
HealthCheck: SSH N台DB是否reachable
Binlog server： N台DB 是否reachable
GTID failover mode = ？
Dead Servers is ？
Primary candidate for the new Master (candidate_master is set) ？

* Phase 2: Dead Master Shutdown Phase..
Executing master IP deactivation script：
TGW-vip delete操作
shutdown_script： ？

* Phase 3: Master Recovery Phase..
* Phase 3.1: Getting Latest Slaves Phase..
Latest slaves ，file position ？
Oldest slaves ， file position ？
* Phase 3.2: Saving Dead Master's Binlog Phase..
Executing command on the dead master : save_binary_logs --command=save
Concat binary/relay logs from latest_slave file_pos to master's binlog EOF
scp from root@master_ip: binlog to local: xx.binlog --将master最新的，latest缺失的binlog传递到manager机器

* Phase 3.3: Determining New Master Phase..
选择哪个slave为new master

* Phase 3.3: New Master Diff Log Generation Phase..
Need to get diffs from the latest slave up to: xx (using the latest slave's relay logs) --生成new master和latest slave之间的diff日志
scp latest_ip:xx to new_master:xx --将new master和latest slave之间的diff日志 传送给 new master
scp from local: xx to new_master: xx --将latest slave 和 dead master之间的diff日志 传送给 new master

* Phase 3.4: Master Log Apply Phase..
new master 将之前的两段diff日志合并，然后开始apply
All other slaves should start replication from here --开始生成change master语句
Executing master IP activate script: TGW-vip 激活操作，并且设置readonly=0

* Phase 4: Slaves Recovery Phase.. (并行操作)
* Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. --开始生成其余slave 和 latest 之间的diff日志
* Phase 4.2: Starting Parallel Slave Log Apply Phase..
--将 当前需要恢复的slave 和 latest slave之间的diff日志 传送给 需要恢复的 slave
--将 latest slave 和 dead master之间的diff日志 传送给 需要恢复的 slave
--将 刚刚的差异日志依次apply 到需要恢复的slave
--Resetting slave ， Executed CHANGE MASTER to new_master

* Phase 5: New master cleanup phase..
Resetting slave info on the new master.

注：本文内容来自互联网，旨在为开发者提供分享、交流的平台。如有涉及文章版权等事宜，请你联系站长进行处理。

[聚合文章] MHA failover NON-GTID 专题