一、SSh 检查日志分析
执行过程及对应的日志:
1、读取MHA manger 节点上的配置文件
2、根据配置文件,得到各个主机的信息,逐一进行SSH检查
3、每个主机都通过SSH连接除了自己以外的其他所有主机
4、当所有主机相互之间都能通过SSH免密登录,SSH检查就通过。
[root@A2 app1]# masterha_check_ssh --conf=/etc/masterha/app1.conf Sun Jun 23 18:36:45 2019 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Sun Jun 23 18:36:45 2019 - [info] Reading application default configuration from /etc/masterha/app1.conf.. Sun Jun 23 18:36:45 2019 - [info] Reading server configuration from /etc/masterha/app1.conf.. Sun Jun 23 18:36:45 2019 - [info] Starting SSH connection tests.. Sun Jun 23 18:36:45 2019 - [debug] Sun Jun 23 18:36:45 2019 - [debug] Connecting via SSH from root@172.16.13.15(172.16.13.15:22) to root@172.16.15.3(172.16.15.3:22).. Sun Jun 23 18:36:45 2019 - [debug] ok. Sun Jun 23 18:36:45 2019 - [debug] Connecting via SSH from root@172.16.13.15(172.16.13.15:22) to root@172.16.15.2(172.16.15.2:22).. Sun Jun 23 18:36:45 2019 - [debug] ok. Sun Jun 23 18:36:46 2019 - [debug] Sun Jun 23 18:36:45 2019 - [debug] Connecting via SSH from root@172.16.15.3(172.16.15.3:22) to root@172.16.13.15(172.16.13.15:22).. Sun Jun 23 18:36:45 2019 - [debug] ok. Sun Jun 23 18:36:45 2019 - [debug] Connecting via SSH from root@172.16.15.3(172.16.15.3:22) to root@172.16.15.2(172.16.15.2:22).. Sun Jun 23 18:36:46 2019 - [debug] ok. Sun Jun 23 18:36:46 2019 - [debug] Sun Jun 23 18:36:46 2019 - [debug] Connecting via SSH from root@172.16.15.2(172.16.15.2:22) to root@172.16.13.15(172.16.13.15:22).. Sun Jun 23 18:36:46 2019 - [debug] ok. Sun Jun 23 18:36:46 2019 - [debug] Connecting via SSH from root@172.16.15.2(172.16.15.2:22) to root@172.16.15.3(172.16.15.3:22).. Sun Jun 23 18:36:46 2019 - [debug] ok. Sun Jun 23 18:36:46 2019 - [info] All SSH connection tests passed successfully.
二、主从复制检查日志分析
1、读取配置文件,根据配置文件,检查当前的所有主机状态,MHA Node版本,是否支持GTID主从复制,得到当前的主从复制架构
[root@A2 app1]# masterha_check_repl --conf=/etc/masterha/app1.conf Sun Jun 23 18:36:28 2019 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Sun Jun 23 18:36:28 2019 - [info] Reading application default configuration from /etc/masterha/app1.conf.. Sun Jun 23 18:36:28 2019 - [info] Reading server configuration from /etc/masterha/app1.conf.. Sun Jun 23 18:36:28 2019 - [info] MHA::MasterMonitor version 0.56. Sun Jun 23 18:36:29 2019 - [info] GTID failover mode = 0 Sun Jun 23 18:36:29 2019 - [info] Dead Servers: Sun Jun 23 18:36:29 2019 - [info] Alive Servers: Sun Jun 23 18:36:29 2019 - [info] 172.16.13.15(172.16.13.15:3306) Sun Jun 23 18:36:29 2019 - [info] 172.16.15.3(172.16.15.3:3306) Sun Jun 23 18:36:29 2019 - [info] 172.16.15.2(172.16.15.2:3306) Sun Jun 23 18:36:29 2019 - [info] Alive Slaves: Sun Jun 23 18:36:29 2019 - [info] 172.16.13.15(172.16.13.15:3306) Version=5.7.16-log (oldest major version between slaves) log-bin:enabled Sun Jun 23 18:36:29 2019 - [info] Replicating from 172.16.15.3(172.16.15.3:3306) Sun Jun 23 18:36:29 2019 - [info] Primary candidate for the new Master (candidate_master is set) Sun Jun 23 18:36:29 2019 - [info] 172.16.15.2(172.16.15.2:3306) Version=5.7.16-log (oldest major version between slaves) log-bin:enabled Sun Jun 23 18:36:29 2019 - [info] Replicating from 172.16.15.3(172.16.15.3:3306) Sun Jun 23 18:36:29 2019 - [info] Current Alive Master: 172.16.15.3(172.16.15.3:3306)
2、检查从库配置,检查主从复制是否过滤,是否支持GTID复制
Sun Jun 23 18:36:29 2019 - [info] Checking slave configurations.. Sun Jun 23 18:36:29 2019 - [info] Checking replication filtering settings.. Sun Jun 23 18:36:29 2019 - [info] binlog_do_db= , binlog_ignore_db= Sun Jun 23 18:36:29 2019 - [info] Replication filtering check ok. Sun Jun 23 18:36:29 2019 - [info] GTID (with auto-pos) is not supported
3、进行SSH连接测试,MHA版本检查
Sun Jun 23 18:36:29 2019 - [info] Starting SSH connection tests.. Sun Jun 23 18:36:31 2019 - [info] All SSH connection tests passed successfully. Sun Jun 23 18:36:31 2019 - [info] Checking MHA Node version.. Sun Jun 23 18:36:31 2019 - [info] Version check ok.
4、检查主库上SSH 配置,测试恢复脚本(save_binary_logs)的可用性,对binlog设置进行检查
Sun Jun 23 18:36:31 2019 - [info] Checking SSH publickey authentication settings on the current master.. Sun Jun 23 18:36:31 2019 - [info] HealthCheck: SSH to 172.16.15.3 is reachable. Sun Jun 23 18:36:32 2019 - [info] Master MHA Node version is 0.56. Sun Jun 23 18:36:32 2019 - [info] Checking recovery script configurations on 172.16.15.3(172.16.15.3:3306).. Sun Jun 23 18:36:32 2019 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/usr/local/mysql/data --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql_bin.000025 Sun Jun 23 18:36:32 2019 - [info] Connecting to root@172.16.15.3(172.16.15.3:22).. Creating /tmp if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /usr/local/mysql/data, up to mysql_bin.000025 Sun Jun 23 18:36:32 2019 - [info] Binlog setting check done.
5、检查从库SSH配置,测试应用差异日志脚本(apply_diff_relay_logs)的可用性,检查从库恢复环境和 relay log 的情况,检查MySQL的连接和权限,清理测试文件
Sun Jun 23 18:36:32 2019 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Sun Jun 23 18:36:32 2019 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=172.16.13.15 --slave_ip=172.16.13.15 --slave_port=3306 --workdir=/tmp --target_version=5.7.16-log --manager_version=0.56 --relay_log_info=/usr/local/mysql/data/relay-log.info --relay_dir=/usr/local/mysql/data/ --slave_pass=xxx Sun Jun 23 18:36:32 2019 - [info] Connecting to root@172.16.13.15(172.16.13.15:22).. Checking slave recovery environment settings.. Opening /usr/local/mysql/data/relay-log.info ... ok. Relay log found at /usr/local/mysql/data, up to mysqlserver-relay-bin.000002 Temporary relay log file is /usr/local/mysql/data/mysqlserver-relay-bin.000002 Testing mysql connection and privileges.. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Sun Jun 23 18:36:32 2019 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=172.16.15.2 --slave_ip=172.16.15.2 --slave_port=3306 --workdir=/tmp --target_version=5.7.16-log --manager_version=0.56 --relay_log_info=/usr/local/mysql/data/relay-log.info --relay_dir=/usr/local/mysql/data/ --slave_pass=xxx Sun Jun 23 18:36:32 2019 - [info] Connecting to root@172.16.15.2(172.16.15.2:22).. Checking slave recovery environment settings.. Opening /usr/local/mysql/data/relay-log.info ... ok. Relay log found at /usr/local/mysql/data, up to A2-relay-bin.000002 Temporary relay log file is /usr/local/mysql/data/A2-relay-bin.000002 Testing mysql connection and privileges.. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Sun Jun 23 18:36:32 2019 - [info] Slaves settings check done.
6、得到当前的主从结构,检查每个从库的复制状态。检查故障切换等脚本的状态,完成主从复制检查
Sun Jun 23 18:36:32 2019 - [info] 172.16.15.3(172.16.15.3:3306) (current master) +--172.16.13.15(172.16.13.15:3306) +--172.16.15.2(172.16.15.2:3306) Sun Jun 23 18:36:32 2019 - [info] Checking replication health on 172.16.13.15.. Sun Jun 23 18:36:32 2019 - [info] ok. Sun Jun 23 18:36:32 2019 - [info] Checking replication health on 172.16.15.2.. Sun Jun 23 18:36:32 2019 - [info] ok. Sun Jun 23 18:36:32 2019 - [info] Checking master_ip_failover_script status: Sun Jun 23 18:36:32 2019 - [info] /var/log/masterha/scripts/master_ip_failover --command=status --ssh_user=root --orig_master_host=172.16.15.3 --orig_master_ip=172.16.15.3 --orig_master_port=3306 Checking the Status of the script.. OK Sun Jun 23 18:36:32 2019 - [info] OK. Sun Jun 23 18:36:32 2019 - [warning] shutdown_script is not defined. Sun Jun 23 18:36:32 2019 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK.
三、在线日志切换分析
1、读取配置文件,检查是否支持GTID复制,得到当前的主从结构
[root@A2 app1]# masterha_master_switch --conf=/etc/masterha/app1.conf --master_state=alive --new_master_host=172.16.13.15 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=1000 Sun Jun 23 19:29:06 2019 - [info] MHA::MasterRotate version 0.56. Sun Jun 23 19:29:06 2019 - [info] Starting online master switch.. Sun Jun 23 19:29:06 2019 - [info] Sun Jun 23 19:29:06 2019 - [info] * Phase 1: Configuration Check Phase.. Sun Jun 23 19:29:06 2019 - [info] Sun Jun 23 19:29:06 2019 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Sun Jun 23 19:29:06 2019 - [info] Reading application default configuration from /etc/masterha/app1.conf.. Sun Jun 23 19:29:06 2019 - [info] Reading server configuration from /etc/masterha/app1.conf.. Sun Jun 23 19:29:08 2019 - [info] GTID failover mode = 0 Sun Jun 23 19:29:08 2019 - [info] Current Alive Master: 172.16.15.3(172.16.15.3:3306) Sun Jun 23 19:29:08 2019 - [info] Alive Slaves: Sun Jun 23 19:29:08 2019 - [info] 172.16.13.15(172.16.13.15:3306) Version=5.7.16-log (oldest major version between slaves) log-bin:enabled Sun Jun 23 19:29:08 2019 - [info] Replicating from 172.16.15.3(172.16.15.3:3306) Sun Jun 23 19:29:08 2019 - [info] Primary candidate for the new Master (candidate_master is set) Sun Jun 23 19:29:08 2019 - [info] 172.16.15.2(172.16.15.2:3306) Version=5.7.16-log (oldest major version between slaves) log-bin:enabled Sun Jun 23 19:29:08 2019 - [info] Replicating from 172.16.15.3(172.16.15.3:3306)
2、在主库上确认执行 FLUSH NO_WRITE_TO_BINLOG TABLES,关闭已经打开的表,不再记录binlog,进行主从复制检查,得到新主库的信息
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 172.16.15.3(172.16.15.3:3306)? (YES/no): yes Sun Jun 23 19:29:11 2019 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time.. Sun Jun 23 19:29:11 2019 - [info] ok. Sun Jun 23 19:29:11 2019 - [info] Checking MHA is not monitoring or doing failover.. Sun Jun 23 19:29:11 2019 - [info] Checking replication health on 172.16.13.15.. Sun Jun 23 19:29:11 2019 - [info] ok. Sun Jun 23 19:29:11 2019 - [info] Checking replication health on 172.16.15.2.. Sun Jun 23 19:29:11 2019 - [info] ok. Sun Jun 23 19:29:11 2019 - [info] 172.16.13.15 can be new master. Sun Jun 23 19:29:11 2019 - [info] From: 172.16.15.3(172.16.15.3:3306) (current master) +--172.16.13.15(172.16.13.15:3306) +--172.16.15.2(172.16.15.2:3306) To: 172.16.13.15(172.16.13.15:3306) (new master) +--172.16.15.2(172.16.15.2:3306) +--172.16.15.3(172.16.15.3:3306)
3、开始从旧主切换到新主,检查新主能否成为主库。检查复制过滤,临时将旧主change master to到一个dummy地址
Starting master switch from 172.16.15.3(172.16.15.3:3306) to 172.16.13.15(172.16.13.15:3306)? (yes/NO): yes Sun Jun 23 19:29:13 2019 - [info] Checking whether 172.16.13.15(172.16.13.15:3306) is ok for the new master.. Sun Jun 23 19:29:13 2019 - [info] ok. Sun Jun 23 19:29:13 2019 - [info] 172.16.15.3(172.16.15.3:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host. Sun Jun 23 19:29:13 2019 - [info] 172.16.15.3(172.16.15.3:3306): Resetting slave pointing to the dummy host. Sun Jun 23 19:29:14 2019 - [info] ** Phase 1: Configuration Check Phase completed. Sun Jun 23 19:29:14 2019 - [info]
4、旧主库上执行master_ip_online_change,停止虚拟IP。在旧主上执行 FLUSH TABLES WITH READ LOCK..,实现全局读锁。
Sun Jun 23 19:29:14 2019 - [info] * Phase 2: Rejecting updates Phase.. Sun Jun 23 19:29:14 2019 - [info] Sun Jun 23 19:29:14 2019 - [info] Executing master ip online change script to disable write on the current master: Sun Jun 23 19:29:14 2019 - [info] /var/log/masterha/scripts/master_ip_online_change --command=stop --orig_master_host=172.16.15.3 --orig_master_ip=172.16.15.3 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='115433' --new_master_host=172.16.13.15 --new_master_ip=172.16.13.15 --new_master_port=3306 --new_master_user='root' --new_master_password='115433' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slave *************************************************************** Disabling the VIP - 172.16.13.141/16 on old master: 172.16.15.3 Disabled the VIP successfully *************************************************************** Sun Jun 23 19:29:14 2019 - [info] ok. Sun Jun 23 19:29:14 2019 - [info] Locking all tables on the orig master to reject updates from everybody (including root): Sun Jun 23 19:29:14 2019 - [info] Executing FLUSH TABLES WITH READ LOCK.. Sun Jun 23 19:29:15 2019 - [info] ok.
5、获取旧主的binlog位置,在新主上面应用中继日志。得到新主的binlog 位置,用于后期在其他从库上执行change master to,在新主库上面开启虚拟IP,set read_only =0
Sun Jun 23 19:29:15 2019 - [info] Orig master binlog:pos is mysql_bin.000025:150734601. Sun Jun 23 19:29:15 2019 - [info] Waiting to execute all relay logs on 172.16.13.15(172.16.13.15:3306).. Sun Jun 23 19:30:48 2019 - [info] master_pos_wait(mysql_bin.000025:150734601) completed on 172.16.13.15(172.16.13.15:3306). Executed 137 events. Sun Jun 23 19:30:48 2019 - [info] done. Sun Jun 23 19:30:48 2019 - [info] Getting new master's binlog name and position.. Sun Jun 23 19:30:48 2019 - [info] mysql_bin.000058:91914599 Sun Jun 23 19:30:48 2019 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.13.15', MASTER_PORT=3306, MASTER_LOG_FILE='mysql_bin.000058', MASTER_LOG_POS=91914599, MASTER_USER='root', MASTER_PASSWORD='xxx'; Sun Jun 23 19:30:48 2019 - [info] Executing master ip online change script to allow write on the new master: Sun Jun 23 19:30:48 2019 - [info] /var/log/masterha/scripts/master_ip_online_change --command=start --orig_master_host=172.16.15.3 --orig_master_ip=172.16.15.3 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='115433' --new_master_host=172.16.13.15 --new_master_ip=172.16.13.15 --new_master_port=3306 --new_master_user='root' --new_master_password='115433' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slave *************************************************************** Enabling the VIP - 172.16.13.141/16 on new master: 172.16.13.15 Enabled the VIP successfully *************************************************************** Sun Jun 23 19:30:48 2019 - [info] ok. Sun Jun 23 19:30:48 2019 - [info] Setting read_only=0 on 172.16.13.15(172.16.13.15:3306).. Sun Jun 23 19:30:48 2019 - [info] ok.
6、并行切换从库,应用中继日志到 旧主binlog位置,执行change master to;旧主也同时执行,而且执行 UNLOCK TABLES ,解锁。至此,从库切换完成
Sun Jun 23 19:30:48 2019 - [info] Sun Jun 23 19:30:48 2019 - [info] * Switching slaves in parallel.. Sun Jun 23 19:30:48 2019 - [info] Sun Jun 23 19:30:48 2019 - [info] -- Slave switch on host 172.16.15.2(172.16.15.2:3306) started, pid: 6048 Sun Jun 23 19:30:48 2019 - [info] Sun Jun 23 19:30:49 2019 - [info] Log messages from 172.16.15.2 ... Sun Jun 23 19:30:49 2019 - [info] Sun Jun 23 19:30:48 2019 - [info] Waiting to execute all relay logs on 172.16.15.2(172.16.15.2:3306).. Sun Jun 23 19:30:48 2019 - [info] master_pos_wait(mysql_bin.000025:150734601) completed on 172.16.15.2(172.16.15.2:3306). Executed 0 events. Sun Jun 23 19:30:48 2019 - [info] done. Sun Jun 23 19:30:48 2019 - [info] Resetting slave 172.16.15.2(172.16.15.2:3306) and starting replication from the new master 172.16.13.15(172.16.13.15:3306).. Sun Jun 23 19:30:48 2019 - [info] Executed CHANGE MASTER. Sun Jun 23 19:30:48 2019 - [info] Slave started. Sun Jun 23 19:30:49 2019 - [info] End of log messages from 172.16.15.2 ... Sun Jun 23 19:30:49 2019 - [info] Sun Jun 23 19:30:49 2019 - [info] -- Slave switch on host 172.16.15.2(172.16.15.2:3306) succeeded. Sun Jun 23 19:30:49 2019 - [info] Unlocking all tables on the orig master: Sun Jun 23 19:30:49 2019 - [info] Executing UNLOCK TABLES.. Sun Jun 23 19:30:49 2019 - [info] ok. Sun Jun 23 19:30:49 2019 - [info] Starting orig master as a new slave.. Sun Jun 23 19:30:49 2019 - [info] Resetting slave 172.16.15.3(172.16.15.3:3306) and starting replication from the new master 172.16.13.15(172.16.13.15:3306).. Sun Jun 23 19:30:49 2019 - [info] Executed CHANGE MASTER. Sun Jun 23 19:30:49 2019 - [info] Slave started. Sun Jun 23 19:30:49 2019 - [info] All new slave servers switched successfully. Sun Jun 23 19:30:49 2019 - [info]
7、对新主清理,更改从库信息
Sun Jun 23 19:30:49 2019 - [info] * Phase 5: New master cleanup phase.. Sun Jun 23 19:30:49 2019 - [info] Sun Jun 23 19:30:49 2019 - [info] 172.16.13.15: Resetting slave info succeeded. Sun Jun 23 19:30:49 2019 - [info] Switching master to 172.16.13.15(172.16.13.15:3306) completed successfully.