MHA 在线切换是MHA除了自动监控切换换提供的另外一种方式,多用于诸如硬件升级,MySQL数据库迁移等等。该方式提供快速切换和优雅的阻塞写入,无关关闭原有服务器,整个切换过程在0.5-2s 的时间左右,大大减少了停机时间。Online master switch开始只有当所有下列条件得到满足:
1. IO threads on all slaves are running // 在所有slave上IO线程运行。
2. SQL threads on all slaves are running //SQL线程在所有的slave上正常运行。
3. Seconds_Behind_Master on all slaves are less or equal than --running_updates_limit seconds // 在所有的slaves上 Seconds_Behind_Master 要小于等于 running_updates_limit seconds
4. On master, none of update queries take more than --running_updates_limit seconds in the show processlist output // 在主上,没有更新查询操作多于running_updates_limit seconds 在show processlist输出结果上。
这些限制的原因是出于安全原因,并尽快切换到新主库。
1.校验当前是否启用masterha_manager(建议停掉)
[root@DBproxy app2]# masterha_check_status --conf=/data/masterha/app1/app1.cnf
app1 (pid:6769) is running(0:PING_OK), master:192.168.0.50
[root@DBproxy app2]#
2.校验slave的IO_threads、SQL_threads、Seconds_Behind_Master
[mysql@MyDB02 masterha]$ mysql -uroot -p123456 -h192.168.0.60 -e 'show slave status G'|grep -E "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master"
Warning: Using a password on the command line interface can be insecure.
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
[mysql@MyDB02 masterha]$
3.实施在线切换
[root@DBproxy masterha]# masterha_master_switch --conf=/data/masterha/app1/app1.cnf --master_state=alive --new_master_host=192.168.0.60 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0 Sat Jul 16 09:11:00 2016 - [info] MHA::MasterRotate version 0.56. Sat Jul 16 09:11:00 2016 - [info] Starting online master switch.. Sat Jul 16 09:11:00 2016 - [info] Sat Jul 16 09:11:00 2016 - [info] * Phase 1: Configuration Check Phase.. Sat Jul 16 09:11:00 2016 - [info] Sat Jul 16 09:11:00 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Sat Jul 16 09:11:00 2016 - [info] Reading application default configuration from /data/masterha/app1/app1.cnf.. Sat Jul 16 09:11:00 2016 - [info] Reading server configuration from /data/masterha/app1/app1.cnf.. Sat Jul 16 09:11:00 2016 - [info] GTID failover mode = 0 Sat Jul 16 09:11:00 2016 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306) Sat Jul 16 09:11:00 2016 - [info] Alive Slaves: Sat Jul 16 09:11:00 2016 - [info] 192.168.0.60(192.168.0.60:3306) Version=5.6.29-log (oldest major version between slaves) log-bin:enabled Sat Jul 16 09:11:00 2016 - [info] Replicating from 192.168.0.50(192.168.0.50:3306) Sat Jul 16 09:11:00 2016 - [info] Primary candidate for the new Master (candidate_master is set) Sat Jul 16 09:11:00 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time.. Sat Jul 16 09:11:00 2016 - [info] ok. Sat Jul 16 09:11:00 2016 - [info] Checking MHA is not monitoring or doing failover.. Sat Jul 16 09:11:00 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln142] Getting advisory lock failed on the current master. MHA Monitor runs on the current master. Stop MHA Manager/Monitor and try again. Sat Jul 16 09:11:00 2016 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53 [root@DBproxy masterha]# 将MHA停掉再进行测试 [root@DBproxy masterha]# masterha_stop --conf=/data/masterha/app1/app1.cnf Stopped app1 successfully. [2]- Exit 1 nohup masterha_manager --conf=/data/masterha/app1/app1.cnf 2>&1 (wd: /data/masterha/app2) (wd now: /data/masterha) [root@DBproxy masterha]#
4.再次实施在线切换
[root@DBproxy masterha]# masterha_master_switch --conf=/data/masterha/app1/app1.cnf --master_state=alive --new_master_host=192.168.0.60 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0
Sat Jul 16 09:15:03 2016 - [info] MHA::MasterRotate version 0.56.
Sat Jul 16 09:15:03 2016 - [info] Starting online master switch..
Sat Jul 16 09:15:03 2016 - [info]
Sat Jul 16 09:15:03 2016 - [info] * Phase 1: Configuration Check Phase..
Sat Jul 16 09:15:03 2016 - [info]
Sat Jul 16 09:15:03 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat Jul 16 09:15:03 2016 - [info] Reading application default configuration from /data/masterha/app1/app1.cnf..
Sat Jul 16 09:15:03 2016 - [info] Reading server configuration from /data/masterha/app1/app1.cnf..
Sat Jul 16 09:15:03 2016 - [info] GTID failover mode = 0
Sat Jul 16 09:15:03 2016 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306)
Sat Jul 16 09:15:03 2016 - [info] Alive Slaves:
Sat Jul 16 09:15:03 2016 - [info] 192.168.0.60(192.168.0.60:3306) Version=5.6.29-log (oldest major version between slaves) log-bin:enabled
Sat Jul 16 09:15:03 2016 - [info] Replicating from 192.168.0.50(192.168.0.50:3306)
Sat Jul 16 09:15:03 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Jul 16 09:15:03 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Sat Jul 16 09:15:03 2016 - [info] ok.
Sat Jul 16 09:15:03 2016 - [info] Checking MHA is not monitoring or doing failover..
Sat Jul 16 09:15:03 2016 - [info] Checking replication health on 192.168.0.60..
Sat Jul 16 09:15:03 2016 - [info] ok.
Sat Jul 16 09:15:03 2016 - [info] 192.168.0.60 can be new master.
Sat Jul 16 09:15:03 2016 - [info]
From:
192.168.0.50(192.168.0.50:3306) (current master)
+--192.168.0.60(192.168.0.60:3306)
To:
192.168.0.60(192.168.0.60:3306) (new master)
+--192.168.0.50(192.168.0.50:3306)
Sat Jul 16 09:15:03 2016 - [info] Checking whether 192.168.0.60(192.168.0.60:3306) is ok for the new master..
Sat Jul 16 09:15:03 2016 - [info] ok.
Sat Jul 16 09:15:03 2016 - [info] 192.168.0.50(192.168.0.50:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.
Sat Jul 16 09:15:03 2016 - [info] 192.168.0.50(192.168.0.50:3306): Resetting slave pointing to the dummy host.
Sat Jul 16 09:15:03 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Jul 16 09:15:03 2016 - [info]
Sat Jul 16 09:15:03 2016 - [info] * Phase 2: Rejecting updates Phase..
Sat Jul 16 09:15:03 2016 - [info]
Sat Jul 16 09:15:03 2016 - [warning] master_ip_online_change_script is not defined. Skipping disabling writes on the current master.
Sat Jul 16 09:15:03 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Sat Jul 16 09:15:03 2016 - [info] Executing FLUSH TABLES WITH READ LOCK..
Sat Jul 16 09:15:03 2016 - [info] ok.
Sat Jul 16 09:15:03 2016 - [info] Orig master binlog:pos is mysql-bin.000009:40355591.
Sat Jul 16 09:15:03 2016 - [info] Waiting to execute all relay logs on 192.168.0.60(192.168.0.60:3306)..
Sat Jul 16 09:15:03 2016 - [info] master_pos_wait(mysql-bin.000009:40355591) completed on 192.168.0.60(192.168.0.60:3306). Executed 0 events.
Sat Jul 16 09:15:03 2016 - [info] done.
Sat Jul 16 09:15:03 2016 - [info] Getting new master's binlog name and position..
Sat Jul 16 09:15:03 2016 - [info] mysql-bin.000006:120
Sat Jul 16 09:15:03 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000006', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Sat Jul 16 09:15:03 2016 - [info]
Sat Jul 16 09:15:03 2016 - [info] * Switching slaves in parallel..
Sat Jul 16 09:15:03 2016 - [info]
Sat Jul 16 09:15:03 2016 - [info] Unlocking all tables on the orig master:
Sat Jul 16 09:15:03 2016 - [info] Executing UNLOCK TABLES..
Sat Jul 16 09:15:03 2016 - [info] ok.
Sat Jul 16 09:15:03 2016 - [info] Starting orig master as a new slave..
Sat Jul 16 09:15:03 2016 - [info] Resetting slave 192.168.0.50(192.168.0.50:3306) and starting replication from the new master 192.168.0.60(192.168.0.60:3306)..
Sat Jul 16 09:15:03 2016 - [info] Executed CHANGE MASTER.
Sat Jul 16 09:15:14 2016 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln784] Slave could not be started on 192.168.0.50(192.168.0.50:3306)! Check slave status.
Sat Jul 16 09:15:14 2016 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln862] Starting slave IO/SQL thread on 192.168.0.50(192.168.0.50:3306) failed!
Sat Jul 16 09:15:14 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln573] Failed!
Sat Jul 16 09:15:14 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln602] Switching master to 192.168.0.60(192.168.0.60:3306) done, but switching slaves partially failed.
[root@DBproxy masterha]#
通过主从机本身的日志判断 可能是主从机中ip和主机名的未做映射导致的。修改hosts
主机的/etc/hosts
127.0.0.1 MyDB01
从机的/etc/hosts
127.0.0.1 MyDB02
修改后主从机器的/etc/hosts
[root@MyDB02 ~]# more /etc/hosts
192.168.0.60 MyDB02
192.168.0.50 MyDB01
因之前的操作为完全成功,导致两台机器为双主架构。手动切换后调整为最初架构一主一从。在线切换前做一次检查:
[root@DBproxy app1]# masterha_check_repl --conf=/data/masterha/app1/app1.cnf
Sat Jul 16 10:24:49 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat Jul 16 10:24:49 2016 - [info] Reading application default configuration from /data/masterha/app1/app1.cnf..
Sat Jul 16 10:24:49 2016 - [info] Reading server configuration from /data/masterha/app1/app1.cnf..
Sat Jul 16 10:24:49 2016 - [info] MHA::MasterMonitor version 0.56.
Sat Jul 16 10:24:49 2016 - [info] GTID failover mode = 0
Sat Jul 16 10:24:49 2016 - [info] Dead Servers:
Sat Jul 16 10:24:49 2016 - [info] Alive Servers:
Sat Jul 16 10:24:49 2016 - [info] 192.168.0.50(192.168.0.50:3306)
Sat Jul 16 10:24:49 2016 - [info] 192.168.0.60(192.168.0.60:3306)
Sat Jul 16 10:24:49 2016 - [info] Alive Slaves:
Sat Jul 16 10:24:49 2016 - [info] 192.168.0.60(192.168.0.60:3306) Version=5.6.29-log (oldest major version between slaves) log-bin:enabled
Sat Jul 16 10:24:49 2016 - [info] Replicating from 192.168.0.50(192.168.0.50:3306)
Sat Jul 16 10:24:49 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Jul 16 10:24:49 2016 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306)
Sat Jul 16 10:24:49 2016 - [info] Checking slave configurations..
Sat Jul 16 10:24:49 2016 - [info] read_only=1 is not set on slave 192.168.0.60(192.168.0.60:3306).
Sat Jul 16 10:24:49 2016 - [info] Checking replication filtering settings..
Sat Jul 16 10:24:49 2016 - [info] binlog_do_db= , binlog_ignore_db=
Sat Jul 16 10:24:49 2016 - [info] Replication filtering check ok.
Sat Jul 16 10:24:49 2016 - [info] GTID (with auto-pos) is not supported
Sat Jul 16 10:24:49 2016 - [info] Starting SSH connection tests..
Sat Jul 16 10:24:50 2016 - [info] All SSH connection tests passed successfully.
Sat Jul 16 10:24:50 2016 - [info] Checking MHA Node version..
Sat Jul 16 10:24:51 2016 - [info] Version check ok.
Sat Jul 16 10:24:51 2016 - [info] Checking SSH publickey authentication settings on the current master..
Sat Jul 16 10:24:51 2016 - [info] HealthCheck: SSH to 192.168.0.50 is reachable.
Sat Jul 16 10:24:51 2016 - [info] Master MHA Node version is 0.56.
Sat Jul 16 10:24:51 2016 - [info] Checking recovery script configurations on 192.168.0.50(192.168.0.50:3306)..
Sat Jul 16 10:24:51 2016 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql/3306/binlog --output_file=/data/masterha/app1/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000010
Sat Jul 16 10:24:51 2016 - [info] Connecting to root@192.168.0.50(192.168.0.50:22)..
Creating /data/masterha/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /data/mysql/3306/binlog, up to mysql-bin.000010
Sat Jul 16 10:24:52 2016 - [info] Binlog setting check done.
Sat Jul 16 10:24:52 2016 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Sat Jul 16 10:24:52 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.0.60 --slave_ip=192.168.0.60 --slave_port=3306 --workdir=/data/masterha/app1 --target_version=5.6.29-log --manager_version=0.56 --relay_log_info=/data/mysql/3306/data/relay-log.info --relay_dir=/data/mysql/3306/data/ --slave_pass=xxx
Sat Jul 16 10:24:52 2016 - [info] Connecting to root@192.168.0.60(192.168.0.60:22)..
Checking slave recovery environment settings..
Opening /data/mysql/3306/data/relay-log.info ... ok.
Relay log found at /data/mysql/3306/binlog, up to relay-bin.000002
Temporary relay log file is /data/mysql/3306/binlog/relay-bin.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Sat Jul 16 10:24:53 2016 - [info] Slaves settings check done.
Sat Jul 16 10:24:53 2016 - [info]
192.168.0.50(192.168.0.50:3306) (current master)
+--192.168.0.60(192.168.0.60:3306)
Sat Jul 16 10:24:53 2016 - [info] Checking replication health on 192.168.0.60..
Sat Jul 16 10:24:53 2016 - [info] ok.
Sat Jul 16 10:24:53 2016 - [warning] master_ip_failover_script is not defined.
Sat Jul 16 10:24:53 2016 - [warning] shutdown_script is not defined.
Sat Jul 16 10:24:53 2016 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.
5.实施切换
[root@DBproxy app1]# masterha_master_switch --conf=/data/masterha/app1/app1.cnf --master_state=alive --new_master_host=192.168.0.60 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0
Sat Jul 16 10:26:59 2016 - [info] MHA::MasterRotate version 0.56.
Sat Jul 16 10:26:59 2016 - [info] Starting online master switch..
Sat Jul 16 10:26:59 2016 - [info]
Sat Jul 16 10:26:59 2016 - [info] * Phase 1: Configuration Check Phase..
Sat Jul 16 10:26:59 2016 - [info]
Sat Jul 16 10:26:59 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat Jul 16 10:26:59 2016 - [info] Reading application default configuration from /data/masterha/app1/app1.cnf..
Sat Jul 16 10:26:59 2016 - [info] Reading server configuration from /data/masterha/app1/app1.cnf..
Sat Jul 16 10:26:59 2016 - [info] GTID failover mode = 0
Sat Jul 16 10:26:59 2016 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306)
Sat Jul 16 10:26:59 2016 - [info] Alive Slaves:
Sat Jul 16 10:26:59 2016 - [info] 192.168.0.60(192.168.0.60:3306) Version=5.6.29-log (oldest major version between slaves) log-bin:enabled
Sat Jul 16 10:26:59 2016 - [info] Replicating from 192.168.0.50(192.168.0.50:3306)
Sat Jul 16 10:26:59 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Jul 16 10:26:59 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Sat Jul 16 10:26:59 2016 - [info] ok.
Sat Jul 16 10:26:59 2016 - [info] Checking MHA is not monitoring or doing failover..
Sat Jul 16 10:26:59 2016 - [info] Checking replication health on 192.168.0.60..
Sat Jul 16 10:26:59 2016 - [info] ok.
Sat Jul 16 10:26:59 2016 - [info] 192.168.0.60 can be new master.
Sat Jul 16 10:26:59 2016 - [info]
From:
192.168.0.50(192.168.0.50:3306) (current master)
+--192.168.0.60(192.168.0.60:3306)
To:
192.168.0.60(192.168.0.60:3306) (new master)
+--192.168.0.50(192.168.0.50:3306)
Sat Jul 16 10:26:59 2016 - [info] Checking whether 192.168.0.60(192.168.0.60:3306) is ok for the new master..
Sat Jul 16 10:26:59 2016 - [info] ok.
Sat Jul 16 10:26:59 2016 - [info] 192.168.0.50(192.168.0.50:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.
Sat Jul 16 10:26:59 2016 - [info] 192.168.0.50(192.168.0.50:3306): Resetting slave pointing to the dummy host.
Sat Jul 16 10:26:59 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Jul 16 10:26:59 2016 - [info]
Sat Jul 16 10:26:59 2016 - [info] * Phase 2: Rejecting updates Phase..
Sat Jul 16 10:26:59 2016 - [info]
Sat Jul 16 10:26:59 2016 - [warning] master_ip_online_change_script is not defined. Skipping disabling writes on the current master.
Sat Jul 16 10:26:59 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Sat Jul 16 10:26:59 2016 - [info] Executing FLUSH TABLES WITH READ LOCK..
Sat Jul 16 10:26:59 2016 - [info] ok.
Sat Jul 16 10:26:59 2016 - [info] Orig master binlog:pos is mysql-bin.000010:120.
Sat Jul 16 10:26:59 2016 - [info] Waiting to execute all relay logs on 192.168.0.60(192.168.0.60:3306)..
Sat Jul 16 10:27:00 2016 - [info] master_pos_wait(mysql-bin.000010:120) completed on 192.168.0.60(192.168.0.60:3306). Executed 0 events.
Sat Jul 16 10:27:00 2016 - [info] done.
Sat Jul 16 10:27:00 2016 - [info] Getting new master's binlog name and position..
Sat Jul 16 10:27:00 2016 - [info] mysql-bin.000008:239
Sat Jul 16 10:27:00 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000008', MASTER_LOG_POS=239, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Sat Jul 16 10:27:00 2016 - [info]
Sat Jul 16 10:27:00 2016 - [info] * Switching slaves in parallel..
Sat Jul 16 10:27:00 2016 - [info]
Sat Jul 16 10:27:00 2016 - [info] Unlocking all tables on the orig master:
Sat Jul 16 10:27:00 2016 - [info] Executing UNLOCK TABLES..
Sat Jul 16 10:27:00 2016 - [info] ok.
Sat Jul 16 10:27:00 2016 - [info] Starting orig master as a new slave..
Sat Jul 16 10:27:00 2016 - [info] Resetting slave 192.168.0.50(192.168.0.50:3306) and starting replication from the new master 192.168.0.60(192.168.0.60:3306)..
Sat Jul 16 10:27:00 2016 - [info] Executed CHANGE MASTER.
Sat Jul 16 10:27:00 2016 - [info] Slave started.
Sat Jul 16 10:27:00 2016 - [info] All new slave servers switched successfully.
Sat Jul 16 10:27:00 2016 - [info]
Sat Jul 16 10:27:00 2016 - [info] * Phase 5: New master cleanup phase..
Sat Jul 16 10:27:00 2016 - [info]
Sat Jul 16 10:27:00 2016 - [info] 192.168.0.60: Resetting slave info succeeded.
Sat Jul 16 10:27:00 2016 - [info] Switching master to 192.168.0.60(192.168.0.60:3306) completed successfully.
[root@DBproxy app1]#