在线主库切换(Online master switch)条件
1、所有节点正常运行,无论时原主还是新主或者其他从库
if ( $#dead_servers >= 0 ) { $log->error( "Switching master should not be started if one or more servers is down." ); $log->info("Dead Servers:"); $_server_manager->print_dead_servers(); croak; }
2、主库正常,能获取到相关主库信息如Server-ID和BINLOG位点信息。
$orig_master = $_server_manager->get_current_alive_master(); if ( !$orig_master ) { $log->error( "Failed to get current master. Maybe replication is misconfigured or configuration file is broken?" ); croak; }
3、MHA Manager/Monitor处于关闭状态
$log->info("Checking MHA is not monitoring or doing failover.."); if ( $orig_master->get_monitor_advisory_lock(0) ) { $log->error( "Getting advisory lock failed on the current master. MHA Monitor runs on the current master. Stop MHA Manager/Monitor and try again." ); croak; }
4、主库和从库上没有超大事务(默认参数running_updates_limit=1)
my @threads = $orig_master->get_running_update_threads( $g_running_updates_limit + 1 ); if ( $#threads >= 0 ) { $log->error( sprintf( "We should not start online master switch when one of connections are running long updates on the current master(%s). Currently %d update thread(s) are running.", $orig_master->get_hostinfo(), $#threads + 1 ) ); MHA::DBHelper::print_threads_util( @threads, 10 ); croak; } my @threads = $new_master->get_running_threads($g_running_seconds_limit); if ( $#threads >= 0 ) { $log->error( sprintf( "We should not start online master switch when one of connections are running long queries on the new master(%s). Currently %d thread(s) are running.", $new_master->get_hostinfo(), $#threads + 1 ) ); MHA::DBHelper::print_threads_util( @threads, 10 ); croak; }
在线主库切换(Online master switch)步骤
1、配置检测(Configuration Check Phase)
1、检查主从关系和存活服务器 2、主库执行FLUSH NO_WRITE_TO_BINLOG TABLES关闭打开的表 3、检查主从复制是否正常 4、挑选新主库,并检查新主是否满足条件 5、检查当前主库的复制过滤规则,将当前主库设置为dummy slave。
2、禁写当前主库(Rejecting updates Phase)
1、尝试当前主库上调用master_ip_online_change_script来进行操作,建议在该脚本中对主库禁写和停用VIP 2、使用FLUSH TABLES WITH READ LOCK来禁止当前主库上所有写操作,获取当前主库上最新位点信息
3、启用新主库(switch_master)
1、等待新主库复制同步,获取新主库上最新位点信息 2、尝试在新主库上调用master_ip_online_change_script来进行操作,建议在该脚本中对从库开启写权限和启用VIP 3、新主库上关闭READ_ONLY选项
4、并行切换所有从库(Switching slaves in parallel)
1、根据步骤2获取到的原主库最后位点,等待从库应用完所有BINLOG 2、根据步骤3获取到的新主库最初位点,重置所有从库。
5、重置原主库(Starting orig master as a new slave)
1、对原主库执行(UNLOCK TABLES)释放锁 2、按照步骤3获取到的新主库最初位点,重置原主库为新从库。
6、新主库复制信息清理(New master cleanup phase)
1、调用STOP SLAVE命令停止复制 2、对于5.5版本调用CHANGE MASTER TO MASTER_HOST=''去除原复制信息 3、调用RESET SLAVE /*!50516 ALL */命令重置复制
在线主库切换的输出日志:
[root@DBproxy app1]# masterha_master_switch --conf=/data/masterha/app1/app1.cnf --master_state=alive --new_master_host=192.168.0.60 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0 Sat Jul 16 10:26:59 2016 - [info] MHA::MasterRotate version 0.56. Sat Jul 16 10:26:59 2016 - [info] Starting online master switch.. Sat Jul 16 10:26:59 2016 - [info] Sat Jul 16 10:26:59 2016 - [info] * Phase 1: Configuration Check Phase.. Sat Jul 16 10:26:59 2016 - [info] Sat Jul 16 10:26:59 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Sat Jul 16 10:26:59 2016 - [info] Reading application default configuration from /data/masterha/app1/app1.cnf.. Sat Jul 16 10:26:59 2016 - [info] Reading server configuration from /data/masterha/app1/app1.cnf.. Sat Jul 16 10:26:59 2016 - [info] GTID failover mode = 0 Sat Jul 16 10:26:59 2016 - [info] Current Alive Master: 192.168.0.50(192.168.0.50:3306) Sat Jul 16 10:26:59 2016 - [info] Alive Slaves: Sat Jul 16 10:26:59 2016 - [info] 192.168.0.60(192.168.0.60:3306) Version=5.6.29-log (oldest major version between slaves) log-bin:enabled Sat Jul 16 10:26:59 2016 - [info] Replicating from 192.168.0.50(192.168.0.50:3306) Sat Jul 16 10:26:59 2016 - [info] Primary candidate for the new Master (candidate_master is set) Sat Jul 16 10:26:59 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time.. Sat Jul 16 10:26:59 2016 - [info] ok. Sat Jul 16 10:26:59 2016 - [info] Checking MHA is not monitoring or doing failover.. Sat Jul 16 10:26:59 2016 - [info] Checking replication health on 192.168.0.60.. Sat Jul 16 10:26:59 2016 - [info] ok. Sat Jul 16 10:26:59 2016 - [info] 192.168.0.60 can be new master. Sat Jul 16 10:26:59 2016 - [info] From: 192.168.0.50(192.168.0.50:3306) (current master) +--192.168.0.60(192.168.0.60:3306) To: 192.168.0.60(192.168.0.60:3306) (new master) +--192.168.0.50(192.168.0.50:3306) Sat Jul 16 10:26:59 2016 - [info] Checking whether 192.168.0.60(192.168.0.60:3306) is ok for the new master.. Sat Jul 16 10:26:59 2016 - [info] ok. Sat Jul 16 10:26:59 2016 - [info] 192.168.0.50(192.168.0.50:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host. Sat Jul 16 10:26:59 2016 - [info] 192.168.0.50(192.168.0.50:3306): Resetting slave pointing to the dummy host. Sat Jul 16 10:26:59 2016 - [info] ** Phase 1: Configuration Check Phase completed. Sat Jul 16 10:26:59 2016 - [info] Sat Jul 16 10:26:59 2016 - [info] * Phase 2: Rejecting updates Phase.. Sat Jul 16 10:26:59 2016 - [info] Sat Jul 16 10:26:59 2016 - [warning] master_ip_online_change_script is not defined. Skipping disabling writes on the current master. Sat Jul 16 10:26:59 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root): Sat Jul 16 10:26:59 2016 - [info] Executing FLUSH TABLES WITH READ LOCK.. Sat Jul 16 10:26:59 2016 - [info] ok. Sat Jul 16 10:26:59 2016 - [info] Orig master binlog:pos is mysql-bin.000010:120. Sat Jul 16 10:26:59 2016 - [info] Waiting to execute all relay logs on 192.168.0.60(192.168.0.60:3306).. Sat Jul 16 10:27:00 2016 - [info] master_pos_wait(mysql-bin.000010:120) completed on 192.168.0.60(192.168.0.60:3306). Executed 0 events. Sat Jul 16 10:27:00 2016 - [info] done. Sat Jul 16 10:27:00 2016 - [info] Getting new master's binlog name and position.. Sat Jul 16 10:27:00 2016 - [info] mysql-bin.000008:239 Sat Jul 16 10:27:00 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000008', MASTER_LOG_POS=239, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Sat Jul 16 10:27:00 2016 - [info] Sat Jul 16 10:27:00 2016 - [info] * Switching slaves in parallel.. Sat Jul 16 10:27:00 2016 - [info] Sat Jul 16 10:27:00 2016 - [info] Unlocking all tables on the orig master: Sat Jul 16 10:27:00 2016 - [info] Executing UNLOCK TABLES.. Sat Jul 16 10:27:00 2016 - [info] ok. Sat Jul 16 10:27:00 2016 - [info] Starting orig master as a new slave.. Sat Jul 16 10:27:00 2016 - [info] Resetting slave 192.168.0.50(192.168.0.50:3306) and starting replication from the new master 192.168.0.60(192.168.0.60:3306).. Sat Jul 16 10:27:00 2016 - [info] Executed CHANGE MASTER. Sat Jul 16 10:27:00 2016 - [info] Slave started. Sat Jul 16 10:27:00 2016 - [info] All new slave servers switched successfully. Sat Jul 16 10:27:00 2016 - [info] Sat Jul 16 10:27:00 2016 - [info] * Phase 5: New master cleanup phase.. Sat Jul 16 10:27:00 2016 - [info] Sat Jul 16 10:27:00 2016 - [info] 192.168.0.60: Resetting slave info succeeded. Sat Jul 16 10:27:00 2016 - [info] Switching master to 192.168.0.60(192.168.0.60:3306) completed successfully.
上面日志摘抄自:https://www.cnblogs.com/polestar/p/5737121.html
GTID模式和非GTID模式切换
“原主库切换为新从库”和“原从库切换为新从库”都调用/MHA/ServerManager.pm中的change_master_and_start_slave方法:
if ( $self->is_gtid_auto_pos_enabled() && !$target->{is_mariadb} ) { $dbhelper->change_master_gtid( $addr, $master->{port}, $master->{repl_user}, $master->{repl_password} ); } else { $dbhelper->change_master( $addr, $master->{port}, $master_log_file, $master_log_pos, $master->{repl_user}, $master->{repl_password} ); }
会根据每个从库的原模式来进行切换,如果原模式使用GTID复制,则切换后也使用GTID复制。
在判断复制同步时,使用/MHA/DBHelper.pm中的master_pos_wait方法:
use constant Master_Pos_Wait_NoTimeout_SQL => "SELECT MASTER_POS_WAIT(?,?,0) AS Result";
sub master_pos_wait($$$) { my $self = shift; my $binlog_file = shift; my $binlog_pos = shift; my $sth = $self->{dbh}->prepare(Master_Pos_Wait_NoTimeout_SQL); $sth->execute( $binlog_file, $binlog_pos ); my $href = $sth->fetchrow_hashref; return $href->{Result}; }
通过MySQL中MASTER_POS_WAIT函数来确保所有原主库上的日志被应用完成,在该过程中,没有使用Executed_Gtid_Set来对比差异。
函数master_pos_wait
语法 select master_pos_wait(file, pos[, timeout]). 参数file和pos对应要执行到主库BINLOG位点信息,函数逻辑是等待当前从库达到这个位置后返回, 返回期间执行的事务个数。 参数timeout可选,若缺省则无限等待,timeout<=0时与缺省的逻辑相同。若为正数,则等待这么多秒,超时函数返回-1. 其他返回值:若当前slave为启动或在等待期间被终止,返回NULL; 若指定的值已经在之前达到,返回0
参考资料: