• mysql高可用研究(一) 主从+MHA架构


    最近在研究mysql的高可用架构,自己想总结下常用的高可用方案都有哪些、有哪些优缺点以及应用的场景?搞得是头昏脑涨,天昏地暗,看了诸多资料,每次都觉得公说公有理婆说婆有理。其实嘛,大家说的都有一定的道理,只不过适合自己的才是最正确的。今天就从比较常用的主从+MHA说起。

    学习一种新的架构还是软件,最好还是先从了解它的原理开始,这样才能在应用时扬长避短。

    一、【MHA原理】

    相较于其它HA软件,MHA的目的在于维持MySQL Replication中Master库的高可用性,其最大特点是可以修复多个Slave之间的差异日志,最终使所有Slave保持数据一致,然后从中选择一个充当新的Master,并将其它Slave指向它。

    MHA软件由两部分组成,Manager工具包和Node工具包,具体的说明如下。

    Manager工具包主要包括以下几个工具:

    masterha_check_ssh              检查MHA的SSH状况
    masterha_check_repl             检查MySQL复制状况
    masterha_manger                 启动MHA
    masterha_check_status           检测当前MHA运行状态
    masterha_master_monitor         检测master是否宕机
    masterha_master_switch          控制故障转移(自动或者手动)
    masterha_conf_host              添加或删除配置的server信息

    Node工具包(这些工具通常由MHA Manager的脚本触发,无需人为操作)主要包括以下几个工具:

    save_binary_logs                保存和复制master的二进制日志
    apply_diff_relay_logs           识别差异的中继日志事件并将其差异的事件应用于其他的slave
    filter_mysqlbinlog              去除不必要的ROLLBACK事件(MHA已不再使用这个工具)
    purge_relay_logs                清除中继日志(不会阻塞SQL线程)

    基本工作流程大致如下:

    (1) Manager定期监控Master,监控时间间隔由参数ping_interval决定,缺省为3秒钟一次;可利用其自身的监控功能,也可调用第三方软件来监控;MHA自身提供了两种监控方式:SELECT(执行SELECT 1)和CONNECT(创建连接/断开连接),

            主要由ping_type参数决定,默认是select方式。

    (2) 当监测到Master故障时,调用SSH脚本对所有Node执行一次检查,包括如下几个方面:

            ――MySQL实例是否可以连接;

            ――Master服务器是否可以SSH连通;

        ――检查SQL Thread的状态;

        ――检查哪些Server死掉了,哪些Server是活动的,以及活动的Slave实例;

        ――检查Slave实例的配置及复制过滤规则;

        ――最后退出监控脚本并返回代表特殊意义代码。

    (3) 开始Master故障切换,包括如下几个子阶段:

        ――Phase 1: Configuration Check Phase

            在这个阶段,若某个Slave实例的SQL Thread停止了,则会自动启动它;并再次确认活动的Servers及Slaves。

        ――Phase 2: Dead Master Shutdown Phase

            在这个阶段,首先调用master_ip_failover_script,若HA是基于VIP实现的,则关闭VIP,若是基于目录数据库实现的,则修改映射记录。然后调用shutdown_script脚本强制关闭主机,以避免服务重启时,发生脑裂。

        ――Phase 3: Master Recovery Phase

       又包括如下3个子阶段:

         Phase 3.1: Getting Latest Slaves Phase

        检查各个Slave,获取最近的和最旧的binary log file和position,并检查各个Slave成为Master的优先级,依赖于candidate_master、no_master、 [server_xxx]顺序、binary log差异量等因素。

         Phase 3.2: Saving Dead Master's Binlog Phase

        若dead master所在服务器依然可以通过SSH连通,则提取dead master的binary log,提取日志的起点就是上一步获取的最新的binary log file和position,直到最后一条事件日志,并在dead master本地的工作目录(由参数remote_workdir决定)中

               创建文件保存这些提取到的日志,然后将该文件拷贝到Manager服务器的工作 目录下(由参数manager_workdir决定)。若dead master系统就无法连接,也就不存在差异的binary log了。MHA还要对各个Slave节点进行健康检查,主要是SSH连通性。

        Phase 3.3: Determining New Master Phase

        接下来调用apply_diff_relay_logs命令恢复Slave的差异日志,这个差异日志指的是各个Slave之间的relay log。恢复完成后,所有的Slave数据是一致的,此时就可以根据优先级选择New Master了。

        Phase 3.4: New Master Diff Log Generation Phase

        这里是生成dead master和new master之间的差异日志,即将Phase 3.2保存的binary log拷贝到New Master的工作目录中(remote_workdir)。

        Phase 3.5: Master Log Apply Phase

        将上一步拷贝的差异日志恢复到New Master上,若发生错误,也可手动恢复。然后获取New Master的binlog name和position,以便其它Slave从这个新的binlog name和position开始复制。最后会开启New Master的写权限,即将read_only参数设置为0。

        ――Phase 4: Slaves Recovery Phase

        Phase 4.1: Starting Parallel Slave Diff Log Generation Phase

        生成Slave与New Slave之间的差异日志,并将该日志拷贝到各Slave的工作目录下,这部分日志dead master和new master之间差异的那部分日志,因为各个Slave在Phase 3.3阶段已经同步了。

        Phase 4.2: Starting Parallel Slave Log Apply Phase

        在各个Slave上应用这部分差异日志,然后通过CHANGE MASTER TO命令将这些Slave指向新的New Master,最后开始复制(start slave)。

        ――Phase 5: New master cleanup phase

        清理New Master其实就是重置slave info,即取消原来的Slave信息。至此整个Master故障切换过程完成。

    二、【实验部分】

    1、【环境说明】:默认三台机器上都已安装mysql5.6,且主从复制已经配置完成。
    角色 主机名 ip地址 功能

    主库 node1 192.168.245.129 (w/r) candidate_master node2 192.168.245.131 (r) 从库 node3 192.168.245.132 (r) vip: 192.168.245.100

    129为主库,对外提供读写服务,而131和132机器对外提供读服务,需要设置为只读状态,不建议将它写入配置文件,因为从库随时会切换为主库。如下:

    set global read_only=1

    2、配置三台机器之间的信任机制(省)

         目的:机器之间能够无需输入密码进行访问。

    3、安装mha软件

    •  rpm安装:每台机器都要安装node,manager节点建议安装到单独的一台机器上或者一个不用切换为主的从库上
    #安装可能需要的依赖包
    [root@node1 software]# yum install perl-DBD-MySQL  
    [root@node1 software]# yum install perl-Config-Tiny
    [root@node1 software]# yum install perl-Parallel-ForkManager*.rpm                
    [root@node1 software]# yum install perl-Mail-Sender*.rpm 
    [root@node1 software]# yum install perl-Mail-Sendmail*.rpm 
    [root@node1 software]# yum install perl-Log-Dispatch*.rpm 
    #安装mha,这里用rpm包安装,默认在/usr/bin
    [root@node1 software]# yum install mha4mysql-node-0.56-0.el6.noarch.rpm
    [root@node1 software]# yum install mha4mysql-manager-0.56-0.el6.noarch.rpm
    • 二进制包安装
    #node安装
    wget http://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.53.tar.gz
    tar xf mha4mysql-node-0.53.tar.gz
    cd mha4mysql-node-0.53
    perl Makefile.PL
    make && make install
    
    #manager安装
    wget http://mysql-master-ha.googlecode.com/files/mha4mysql-manager-0.53.tar.gz
    tar xf mha4mysql-manager-0.53.tar.gz 
    cd mha4mysql-manager-0.53
    perl Makefile.PL
    make && make install

    4、手工配置主库服务器的vip并测试

    这里通过脚本手动创建vip,如下:

    [root@node1 scripts]# cat init_vip.sh 
    vip="192.168.245.100/32"
    /sbin/ip addr add $vip dev eth0

    【测试】

    绑定完成后,可以用以下命令查看绑定情况:

    [root@node2 etc]# ip addr show
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
        inet6 ::1/128 scope host 
           valid_lft forever preferred_lft forever
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
        link/ether 00:0c:29:d7:0f:4a brd ff:ff:ff:ff:ff:ff
        inet 192.168.245.131/24 brd 192.168.245.255 scope global eth0
        inet 192.168.245.100/32 scope global eth0
        inet6 fe80::20c:29ff:fed7:f4a/64 scope link 
           valid_lft forever preferred_lft forever 

    到任意从库ssh 192.168.245.100 --看是否连上vip或者mysql -h 192.168.245.100 -udarren -pdarren    --是否连上vip数据库.       如果都能够连接上,表示vip设置成功了。

    5、配置mha及启动

    (1)创建mha监控用户(在主库执行,这样每个服务器都有这个用户了)

    mysql> grant all privileges on *.* to 'root'@'%' identified  by '123456';
    Query OK, 0 rows affected (0.00 sec)
    
    mysql> flush  privileges;
    Query OK, 0 rows affected (0.01 sec)

    (2)关闭purge_relay_logs:

    • 在修改配置文件前,注意一个知识点,须将从库上的relay log自动清除设置为OFF,因为MySQL数据库主从复制在缺省情况下从库的relay logs会在SQL线程执行完毕后被自动删除,但是对于MHA场景下,对于某些滞后从库的恢复依赖于其他从库的relay log,因此采取禁用自动删除功能以及定期清理的办法。对于清理过多过大的relay log需要注意引起的复制延迟资源开销等。MHA可通过purge_relay_logs脚本及配合cronjob来完成此项任务。

            purge_relay_logs的主要功能:

                a、为relay日志创建硬链接(最小化批量删除大文件导致的性能问题)

                b、SET GLOBAL relay_log_purge=1; FLUSH LOGS; SET GLOBAL relay_log_purge=0;
                c、删除relay log(rm –f  /path/to/archive_dir/*)

    purge_relay_logs的用法及相关参数
    1 purge_relay_logs --help
       Usage:
           purge_relay_logs --user=root --password=rootpass --host=127.0.0.1
    
    2 参数描述
    --user             用户名,缺省为root
    --password         密码
    --port             端口号
    --host             主机名,缺省为127.0.0.1
    --workdir          指定创建relay log的硬链接的位置,默认是/var/tmp,成功执行脚本后,硬链接的中继日志文件被删除,由于系统不同分区创建硬链接文件会失败,故需要执行硬链接具体位置,建议指定为relay log相同的分区
    --disable_relay_log_purge 默认情况下,如果参数relay_log_purge=1,脚本不做任何处理,自动退出.设定该参数,脚本会将relay_log_purge设置为0,当清理relay log之后,最后将参数设置为OFF(0)
    3 定制清理relay log cronjob pureg_relay_logs脚本在不阻塞SQL线程的情况下自动清理relay log。对于不断产生的relay log直接将该脚本部署到crontab以实现按天或按小时定期清理。 $ crontab -l # purge relay logs at 5am 0 5 * * * app /usr/bin/purge_relay_logs --user=root --password=PASSWORD --disable_relay_log_purge >> /var/log/masterha/purge_relay_logs.log 2>&1
    • 在我们的两个从库上设置relaylog为OFF:
    (product)root@127.0.0.1 [(none)]> set global relay_log_purge=0;
    Query OK, 0 rows affected (0.00 sec)
    • 然后通过定时任务,每隔一天定时清除relaylog:
    #清除脚本
    #!/bin/bash
    user=root
    passwd=root
    port=3306
    log_dir='/data/masterha/log'
    work_dir='/data'
    purge='/usr/bin/purge_relay_logs'
    
    if [ ! -d $log_dir ]
    then
       mkdir $log_dir -p
    fi
    
    $purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1
    
    #定时任务
    crontab -e
    #每天早上5点10分执行
    10 5 * * * sh /data/scripts/purge_relay_log.sh

    (3)修改配置文件:

    到manager节点的/etc下面新建masterha目录,并将mha需要的配置初始化文件拷贝到该目录下:

    [root@node3 ~]# cd /etc
    [root@node3 etc]# mkdir masterha
    #创建以下mha日志目录,没有则报错
    [root@node3 etc]#
    mkdir -p /var/log/masterha/app1
    [root@node3 mastermha]# ll
    total 32
    -rw-r--r--. 1 root root   503 Nov  9 01:26 app1.conf
    -rwxr-xr-x. 1 root root    55 Nov  9 01:26 drop_vip.sh
    -rwxr-xr-x. 1 root root    55 Nov  9 01:26 init_vip.sh
    -rw-r--r--. 1 root root   357 Nov  9 01:26 masterha_default.conf
    -rwxr-xr-x. 1 root root  3888 Nov  9 01:26 master_ip_failover
    -rwxr-xr-x. 1 root root 10298 Nov  9 01:26 master_ip_online_change

    然后修改vip的值:在masterha目录下执行grep "vip" *,将会列出所有文件中vip变量,然后一一修改为192.168.245.100。

    修改app1.conf文件:

    [server default]
    manager_log=/var/log/masterha/app1/app1.log
    manager_workdir=/var/log/masterha/app1
    master_ip_failover_script="/etc/masterha/master_ip_failover"
    master_ip_online_change_script="/etc/masterha/master_ip_online_change"
    password=root
    ping_interval=1
    remote_workdir=/var/log/masterha/app1
    repl_password=repl4slave
    repl_user=repl
    report_script="/etc/masterha/send_mail"
    shutdown_script=""
    ssh_user=root
    user=root

    [server1]
    candidate_master=1
    check_repl_delay=0
    hostname=192.168.245.129
    master_binlog_dir=/data/mysql/mysql_3306/logs

    [server3]
    hostname=192.168.245.132
    port=3306

    (4)检查mha环境并启动

    #检查MHA Manger到所有MHA Node的SSH连接状态:
    [root@node3 masterha]# /usr/bin/masterha_check_ssh --conf=/etc/masterha/app1.conf
    Mon Nov 16 01:24:21 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
    Mon Nov 16 01:24:21 2015 - [info] Reading application default configuration from /etc/masterha/app1.conf..
    Mon Nov 16 01:24:21 2015 - [info] Reading server configuration from /etc/masterha/app1.conf..
    Mon Nov 16 01:24:21 2015 - [info] Starting SSH connection tests..
    Mon Nov 16 01:24:24 2015 - [debug] 
    Mon Nov 16 01:24:21 2015 - [debug]  Connecting via SSH from root@192.168.245.129(192.168.245.129:22) to root@192.168.245.131(192.168.245.131:22)..
    Mon Nov 16 01:24:23 2015 - [debug]   ok.
    Mon Nov 16 01:24:23 2015 - [debug]  Connecting via SSH from root@192.168.245.129(192.168.245.129:22) to root@192.168.245.132(192.168.245.132:22)..
    Mon Nov 16 01:24:24 2015 - [debug]   ok.
    Mon Nov 16 01:24:25 2015 - [debug] 
    Mon Nov 16 01:24:22 2015 - [debug]  Connecting via SSH from root@192.168.245.131(192.168.245.131:22) to root@192.168.245.129(192.168.245.129:22)..
    Mon Nov 16 01:24:23 2015 - [debug]   ok.
    Mon Nov 16 01:24:23 2015 - [debug]  Connecting via SSH from root@192.168.245.131(192.168.245.131:22) to root@192.168.245.132(192.168.245.132:22)..
    Mon Nov 16 01:24:25 2015 - [debug]   ok.
    Mon Nov 16 01:24:25 2015 - [debug] 
    Mon Nov 16 01:24:22 2015 - [debug]  Connecting via SSH from root@192.168.245.132(192.168.245.132:22) to root@192.168.245.129(192.168.245.129:22)..
    Mon Nov 16 01:24:24 2015 - [debug]   ok.
    Mon Nov 16 01:24:24 2015 - [debug]  Connecting via SSH from root@192.168.245.132(192.168.245.132:22) to root@192.168.245.131(192.168.245.131:22)..
    Mon Nov 16 01:24:25 2015 - [debug]   ok.
    Mon Nov 16 01:24:25 2015 - [info] All SSH connection tests passed successfully.
    #检查主从复制环境
    [root@node3 masterha]# /usr/bin/masterha_check_repl --conf=/etc/masterha/app1.conf
    Mon Nov 16 01:37:08 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
    Mon Nov 16 01:37:08 2015 - [info] Reading application default configuration from /etc/masterha/app1.conf..
    Mon Nov 16 01:37:08 2015 - [info] Reading server configuration from /etc/masterha/app1.conf..
    Mon Nov 16 01:37:08 2015 - [info] MHA::MasterMonitor version 0.56.
    Mon Nov 16 01:37:09 2015 - [info] GTID failover mode = 0
    Mon Nov 16 01:37:09 2015 - [info] Dead Servers:
    Mon Nov 16 01:37:09 2015 - [info] Alive Servers:
    Mon Nov 16 01:37:09 2015 - [info]   192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 01:37:09 2015 - [info]   192.168.245.131(192.168.245.131:3306)
    Mon Nov 16 01:37:09 2015 - [info]   192.168.245.132(192.168.245.132:3306)
    Mon Nov 16 01:37:09 2015 - [info] Alive Slaves:
    Mon Nov 16 01:37:09 2015 - [info]   192.168.245.131(192.168.245.131:3306)  Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 01:37:09 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 01:37:09 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
    Mon Nov 16 01:37:09 2015 - [info]   192.168.245.132(192.168.245.132:3306)  Version=5.6.21-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 01:37:09 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 01:37:09 2015 - [info] Current Alive Master: 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 01:37:09 2015 - [info] Checking slave configurations..
    Mon Nov 16 01:37:09 2015 - [info]  read_only=1 is not set on slave 192.168.245.131(192.168.245.131:3306).
    Mon Nov 16 01:37:09 2015 - [info]  read_only=1 is not set on slave 192.168.245.132(192.168.245.132:3306).
    Mon Nov 16 01:37:09 2015 - [info] Checking replication filtering settings..
    Mon Nov 16 01:37:09 2015 - [info]  binlog_do_db= , binlog_ignore_db= 
    Mon Nov 16 01:37:09 2015 - [info]  Replication filtering check ok.
    Mon Nov 16 01:37:09 2015 - [info] GTID (with auto-pos) is not supported
    Mon Nov 16 01:37:09 2015 - [info] Starting SSH connection tests..
    Mon Nov 16 01:37:12 2015 - [info] All SSH connection tests passed successfully.
    Mon Nov 16 01:37:12 2015 - [info] Checking MHA Node version..
    Mon Nov 16 01:37:13 2015 - [info]  Version check ok.
    Mon Nov 16 01:37:13 2015 - [info] Checking SSH publickey authentication settings on the current master..
    Mon Nov 16 01:37:13 2015 - [info] HealthCheck: SSH to 192.168.245.129 is reachable.
    Mon Nov 16 01:37:14 2015 - [info] Master MHA Node version is 0.56.
    Mon Nov 16 01:37:14 2015 - [info] Checking recovery script configurations on 192.168.245.129(192.168.245.129:3306)..
    Mon Nov 16 01:37:14 2015 - [info]   Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql/mysql_3306/logs --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000023 
    Mon Nov 16 01:37:14 2015 - [info]   Connecting to root@192.168.245.129(192.168.245.129:22).. 
      Creating /var/log/masterha/app1 if not exists..    ok.
      Checking output directory is accessible or not..
       ok.
      Binlog found at /data/mysql/mysql_3306/logs, up to mysql-bin.000023
    Mon Nov 16 01:37:15 2015 - [info] Binlog setting check done.
    Mon Nov 16 01:37:15 2015 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
    Mon Nov 16 01:37:15 2015 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.245.131 --slave_ip=192.168.245.131 --slave_port=3306 --workdir=/var/log/masterha/app1 --target_version=5.6.23-log --manager_version=0.56 --relay_log_info=/data/mysql/mysql_3306/data/relay-log.info  --relay_dir=/data/mysql/mysql_3306/data/  --slave_pass=xxx
    Mon Nov 16 01:37:15 2015 - [info]   Connecting to root@192.168.245.131(192.168.245.131:22).. 
      Checking slave recovery environment settings..
        Opening /data/mysql/mysql_3306/data/relay-log.info ... ok.
        Relay log found at /data/mysql/mysql_3306/data, up to relay-bin.000009
        Temporary relay log file is /data/mysql/mysql_3306/data/relay-bin.000009
        Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
     done.
        Testing mysqlbinlog output.. done.
        Cleaning up test file(s).. done.
    Mon Nov 16 01:37:15 2015 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.245.132 --slave_ip=192.168.245.132 --slave_port=3306 --workdir=/var/log/masterha/app1 --target_version=5.6.21-log --manager_version=0.56 --relay_log_info=/data/mysql/data/relay-log.info  --relay_dir=/data/mysql/data/  --slave_pass=xxx
    Mon Nov 16 01:37:15 2015 - [info]   Connecting to root@192.168.245.132(192.168.245.132:22).. 
      Checking slave recovery environment settings..
        Opening /data/mysql/data/relay-log.info ... ok.
        Relay log found at /data/mysql/data, up to node3-relay-bin.000007
        Temporary relay log file is /data/mysql/data/node3-relay-bin.000007
        Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
     done.
        Testing mysqlbinlog output.. done.
        Cleaning up test file(s).. done.
    Mon Nov 16 01:37:16 2015 - [info] Slaves settings check done.
    Mon Nov 16 01:37:16 2015 - [info] 
    192.168.245.129(192.168.245.129:3306) (current master)
     +--192.168.245.131(192.168.245.131:3306)
     +--192.168.245.132(192.168.245.132:3306)
    
    Mon Nov 16 01:37:16 2015 - [info] Checking replication health on 192.168.245.131..
    Mon Nov 16 01:37:16 2015 - [info]  ok.
    Mon Nov 16 01:37:16 2015 - [info] Checking replication health on 192.168.245.132..
    Mon Nov 16 01:37:16 2015 - [info]  ok.
    Mon Nov 16 01:37:16 2015 - [warning] master_ip_failover_script is not defined.
    Mon Nov 16 01:37:16 2015 - [warning] shutdown_script is not defined.
    Mon Nov 16 01:37:16 2015 - [info] Got exit code 0 (Not master dead).
    
    MySQL Replication Health is OK.
    View Code

    如果遇到这个报错:

    Can't exec "mysqlbinlog": No such file or directory at /usr/share/perl5/vendor_perl/MHA/BinlogManager.pm line 106.
    mysqlbinlog version command failed with rc 1:0, please verify PATH, LD_LIBRARY_PATH, and client options
     at /usr/bin/apply_diff_relay_logs line 493
    Mon Nov 16 01:32:36 2015 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln205] Slaves settings check failed!
    Mon Nov 16 01:32:36 2015 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln413] Slave configuration failed.
    Mon Nov 16 01:32:36 2015 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/bin/masterha_check_repl line 48
    Mon Nov 16 01:32:36 2015 - [error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
    Mon Nov 16 01:32:36 2015 - [info] Got exit code 1 (Not master dead).
    
    MySQL Replication Health is NOT OK!

    解决方法如下,添加软连接(所有节点)

    ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
    ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql

    到此为止,都没问题了,开始启动manager节点进行监控,一般启动我们都使用nohup方式,但是当出现故障成功切换后,manager监控也会关闭,所以建议使用daemontools方式在后台运行。

    • nohup方式启动
    nohup /usr/bin/masterha_manager --conf=/etc/masterha/app1.conf --remove_dead_master_conf --ignore_last_failover &
    #关闭mha监控
    /usr/bin/masterha_stop --conf=/etc/masterha/app1.conf
    • daemontools方式启动:
    #下载daemontools工具
    wget http://cr.yp.to/daemontools/daemontools-0.76.tar.gz
    #安装
    cd /home/software/daemontools/admin/daemontools-0.76
    vim src/conf-cc #在第一行最后空一格后添加 -include /usr/include/errno.h 避免编译时报错
    package/install  #安装
    #在/etc/init/创建svscan.conf文件并添加如下内容:
    vim /etc/init/svscan.conf
    start on runlevel [345]
            respawn
            exec /command/svscanboot
    #在/etc/init目录下重新加载并启动
    [root@node3 init]# initctl reload-configuration
    [root@node3 init]# initctl start svscan
    svscan start/running, process 17002

    使用daemon启动mha监控:

    [root@node3 init]# mkdir -p /service/masterha_app
    [root@node3 ~]# cd /service/masterha_app/
    [root@node3 masterha_app]# pwd
    /service/masterha_app
    #在/service/masterha_app/创建run脚本并增加执行权限,内容如下
    [root@node3 masterha_app]# cat /service/masterha_app/run
    #!/bin/bash
     exec /usr/bin/masterha_manager --conf=/etc/masterha/app1.conf --remove_dead_master_conf --ignore_last_failover
    [root@node3 masterha_app]# chmod +x /service/masterha_app/run
    #停止监控
    svc
    -d /service/masterha_app
    #开启监控
    svc
    -u /service/masterha_app

    另起一个session,查看启动日志:

    
    
    [root@node3 app1]# tail -f app1.log 
    192.168.245.129(192.168.245.129:3306) (current master)
     +--192.168.245.131(192.168.245.131:3306)
     +--192.168.245.132(192.168.245.132:3306)
    
    Mon Nov 16 01:55:10 2015 - [warning] master_ip_failover_script is not defined.
    Mon Nov 16 01:55:10 2015 - [warning] shutdown_script is not defined.
    Mon Nov 16 01:55:10 2015 - [info] Set master ping interval 1 seconds.
    Mon Nov 16 01:55:10 2015 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
    Mon Nov 16 01:55:10 2015 - [info] Starting ping health check on 192.168.245.129(192.168.245.129:3306)..
    Mon Nov 16 01:55:10 2015 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..

    当看到最后一句“Ping(SELECT) succeeded, waiting until MySQL doesn't respond..”表示mha已经启动起来了。当然,你或许在一台机子上希望监控多套master-salve复制,这非常容易,只要为第二套集群创建一个新的配置文件并启动manager

    # masterha_manager --conf=/etc/conf/masterha/app1.cnf  
    # masterha_manager --conf=/etc/conf/masterha/app2.cnf
    如果你在app1和app2上有一些共有的参数,可在全局配置文件中配置。

    启动参数介绍:

    --remove_dead_master_conf  该参数表示当发生主从切换后,老的主库的ip将会从配置文件中移除。如果故障机器修复好了,需要手工添加ip信息到配置文件中

    --manger_log                            日志存放位置

    --ignore_last_failover                 在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件, 默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如 果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件,为了方便,这里设置为--ignore_last_failover。

    查看MHA Manager监控是否正常:

    [root@node3 app1]#  /usr/bin/masterha_check_status --conf=/etc/masterha/app1.conf
    app1 (pid:6795) is running(0:PING_OK), master:192.168.245.129

    6、模拟测试

    (1)自动Failover测试

    • 首先用sysbench工具模拟主库业务的写入。
    #下载sysbench
    http://dev.mysql.com/downloads/benchmarks.html
    #安装sysbench
    yum install libtool -y  #依赖包
    tar zxvf sysbench-0.4.12.7.tar.gz
    cd sysbench-0.4.12.7
    ./configure && make && make install

    在主库执行,创建sysbench测试表:

    (product)root@127.0.0.1 [(none)]> create database sbtest;
    Query OK, 1 row affected (0.00 sec)
    
    [root@node1 sysbench-0.4.12.7]# /usr/local/bin/sysbench --test=oltp --oltp-table-size=100000 --oltp-read-only=off --init-rng=on --num-threads=4 --max-requests=0 --oltp-dist-type=uniform --max-time=1800 --mysql-user=root --mysql-socket=/tmp/mysql_3306.sock --mysql-password=root --mysql-host=192.168.245.129 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex prepare
    sysbench 0.4.12.6:  multi-threaded system evaluation benchmark
    
    Creating table 'sbtest'...
    Creating 100000 records in table 'sbtest'...
    • 停止接管主库那个从库的io_thread线程:
    (product)root@127.0.0.1 [(none)]> stop slave io_thread;
    Query OK, 0 rows affected (0.01 sec)
    • 向主库插入数据:
    [root@node1 sysbench-0.4.12.7]# /usr/local/bin/sysbench --test=oltp --oltp-table-size=100000 --oltp-read-only=off --init-rng=on --num-threads=4 --max-requests=0 --oltp-dist-type=uniform --max-time=180 --mysql-user=root --mysql-socket=/tmp/mysql_3306.sock --mysql-password=root --mysql-host=192.168.245.129 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex run
    sysbench 0.4.12.6:  multi-threaded system evaluation benchmark
    
    Running the test with following options:
    Number of threads: 4
    Initializing random number generator from timer.
    
    Random number generator seed is 0 and will be ignored
    
    
    Doing OLTP test.
    Running mixed OLTP test
    Using Uniform distribution
    Using "BEGIN" for starting transactions
    Using auto_inc on the id column
    Using 1 test tables
    Threads started!
    Time limit exceeded, exiting...
    (last message repeated 3 times)
    Done.
    
    OLTP test statistics:
        queries performed:
            read:                            489090
            write:                           174675
            other:                           69870
            total:                           733635
        transactions:                        34935  (194.05 per sec.)
        deadlocks:                           0      (0.00 per sec.)
        read/write requests:                 663765 (3686.93 per sec.)
        other operations:                    69870  (388.10 per sec.)
    
    Test execution summary:
        total time:                          180.0317s
        total number of events:              34935
        total time taken by event execution: 719.7722
        per-request statistics:
             min:                                  3.47ms
             avg:                                 20.60ms
             max:                                444.43ms
             approx.  95 percentile:              28.17ms
    
    Threads fairness:
        events (avg/stddev):           8733.7500/260.92
        execution time (avg/stddev):   179.9430/0.01

    然后启动从库上io_thread:

    (product)root@127.0.0.1 [(none)]> start slave io_thread;
    Query OK, 0 rows affected, 1 warning (0.00 sec)
    • kill主库的mysqld:
    pkill -9 mysqld

    最后,查看mha日志,这时已经接管过来了!

    Mon Nov 16 02:07:11 2015 - [warning] master_ip_failover_script is not defined.
    Mon Nov 16 02:07:11 2015 - [warning] shutdown_script is not defined.
    Mon Nov 16 02:07:11 2015 - [info] Set master ping interval 1 seconds.
    Mon Nov 16 02:07:11 2015 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
    Mon Nov 16 02:07:11 2015 - [info] Starting ping health check on 192.168.245.129(192.168.245.129:3306)..
    Mon Nov 16 02:07:11 2015 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
    Mon Nov 16 03:02:12 2015 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
    Mon Nov 16 03:02:12 2015 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql/mysql_3306/logs --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=mysql-bin
    Mon Nov 16 03:02:13 2015 - [info] HealthCheck: SSH to 192.168.245.129 is reachable.
    Mon Nov 16 03:02:13 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
    Mon Nov 16 03:02:13 2015 - [warning] Connection failed 2 time(s)..
    Mon Nov 16 03:02:14 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
    Mon Nov 16 03:02:14 2015 - [warning] Connection failed 3 time(s)..
    Mon Nov 16 03:02:15 2015 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
    Mon Nov 16 03:02:15 2015 - [warning] Connection failed 4 time(s)..
    Mon Nov 16 03:02:15 2015 - [warning] Master is not reachable from health checker!
    Mon Nov 16 03:02:15 2015 - [warning] Master 192.168.245.129(192.168.245.129:3306) is not reachable!
    Mon Nov 16 03:02:15 2015 - [warning] SSH is reachable.
    Mon Nov 16 03:02:15 2015 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.conf again, and trying to connect to all servers to check server status..
    Mon Nov 16 03:02:15 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
    Mon Nov 16 03:02:15 2015 - [info] Reading application default configuration from /etc/masterha/app1.conf..
    Mon Nov 16 03:02:15 2015 - [info] Reading server configuration from /etc/masterha/app1.conf..
    Mon Nov 16 03:02:15 2015 - [info] GTID failover mode = 0
    Mon Nov 16 03:02:15 2015 - [info] Dead Servers:
    Mon Nov 16 03:02:15 2015 - [info]   192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:15 2015 - [info] Alive Servers:
    Mon Nov 16 03:02:15 2015 - [info]   192.168.245.131(192.168.245.131:3306)
    Mon Nov 16 03:02:15 2015 - [info]   192.168.245.132(192.168.245.132:3306)
    Mon Nov 16 03:02:15 2015 - [info] Alive Slaves:
    Mon Nov 16 03:02:15 2015 - [info]   192.168.245.131(192.168.245.131:3306)  Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 03:02:15 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:15 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
    Mon Nov 16 03:02:15 2015 - [info]   192.168.245.132(192.168.245.132:3306)  Version=5.6.21-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 03:02:15 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:15 2015 - [info] Checking slave configurations..
    Mon Nov 16 03:02:15 2015 - [info]  read_only=1 is not set on slave 192.168.245.131(192.168.245.131:3306).
    Mon Nov 16 03:02:15 2015 - [info]  read_only=1 is not set on slave 192.168.245.132(192.168.245.132:3306).
    Mon Nov 16 03:02:15 2015 - [info] Checking replication filtering settings..
    Mon Nov 16 03:02:15 2015 - [info]  Replication filtering check ok.
    Mon Nov 16 03:02:15 2015 - [info] Master is down!
    Mon Nov 16 03:02:15 2015 - [info] Terminating monitoring script.
    Mon Nov 16 03:02:15 2015 - [info] Got exit code 20 (Master dead).
    Mon Nov 16 03:02:15 2015 - [info] MHA::MasterFailover version 0.56.
    Mon Nov 16 03:02:15 2015 - [info] Starting master failover.
    Mon Nov 16 03:02:15 2015 - [info] 
    Mon Nov 16 03:02:15 2015 - [info] * Phase 1: Configuration Check Phase..
    Mon Nov 16 03:02:15 2015 - [info] 
    Mon Nov 16 03:02:16 2015 - [info] GTID failover mode = 0
    Mon Nov 16 03:02:16 2015 - [info] Dead Servers:
    Mon Nov 16 03:02:16 2015 - [info]   192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:16 2015 - [info] Checking master reachability via MySQL(double check)...
    Mon Nov 16 03:02:16 2015 - [info]  ok.
    Mon Nov 16 03:02:16 2015 - [info] Alive Servers:
    Mon Nov 16 03:02:16 2015 - [info]   192.168.245.131(192.168.245.131:3306)
    Mon Nov 16 03:02:16 2015 - [info]   192.168.245.132(192.168.245.132:3306)
    Mon Nov 16 03:02:16 2015 - [info] Alive Slaves:
    Mon Nov 16 03:02:16 2015 - [info]   192.168.245.131(192.168.245.131:3306)  Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 03:02:16 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:16 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
    Mon Nov 16 03:02:16 2015 - [info]   192.168.245.132(192.168.245.132:3306)  Version=5.6.21-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 03:02:16 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:16 2015 - [info] Starting Non-GTID based failover.
    Mon Nov 16 03:02:16 2015 - [info] 
    Mon Nov 16 03:02:16 2015 - [info] ** Phase 1: Configuration Check Phase completed.
    Mon Nov 16 03:02:16 2015 - [info] 
    Mon Nov 16 03:02:16 2015 - [info] * Phase 2: Dead Master Shutdown Phase..
    Mon Nov 16 03:02:16 2015 - [info] 
    Mon Nov 16 03:02:16 2015 - [info] Forcing shutdown so that applications never connect to the current master..
    Mon Nov 16 03:02:16 2015 - [warning] master_ip_failover_script is not set. Skipping invalidating dead master IP address.
    Mon Nov 16 03:02:16 2015 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
    Mon Nov 16 03:02:16 2015 - [info] * Phase 2: Dead Master Shutdown Phase completed.
    Mon Nov 16 03:02:16 2015 - [info] 
    Mon Nov 16 03:02:16 2015 - [info] * Phase 3: Master Recovery Phase..
    Mon Nov 16 03:02:16 2015 - [info] 
    Mon Nov 16 03:02:16 2015 - [info] * Phase 3.1: Getting Latest Slaves Phase..
    Mon Nov 16 03:02:16 2015 - [info] 
    Mon Nov 16 03:02:16 2015 - [info] The latest binary log file/position on all slaves is mysql-bin.000023:35416929
    Mon Nov 16 03:02:16 2015 - [info] Latest slaves (Slaves that received relay log files to the latest):
    Mon Nov 16 03:02:16 2015 - [info]   192.168.245.131(192.168.245.131:3306)  Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 03:02:16 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:16 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
    Mon Nov 16 03:02:16 2015 - [info]   192.168.245.132(192.168.245.132:3306)  Version=5.6.21-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 03:02:16 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:16 2015 - [info] The oldest binary log file/position on all slaves is mysql-bin.000023:35416929
    Mon Nov 16 03:02:16 2015 - [info] Oldest slaves:
    Mon Nov 16 03:02:16 2015 - [info]   192.168.245.131(192.168.245.131:3306)  Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 03:02:16 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:16 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
    Mon Nov 16 03:02:16 2015 - [info]   192.168.245.132(192.168.245.132:3306)  Version=5.6.21-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 03:02:16 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:16 2015 - [info] 
    Mon Nov 16 03:02:16 2015 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
    Mon Nov 16 03:02:16 2015 - [info] 
    Mon Nov 16 03:02:17 2015 - [info] Fetching dead master's binary logs..
    Mon Nov 16 03:02:17 2015 - [info] Executing command on the dead master 192.168.245.129(192.168.245.129:3306): save_binary_logs --command=save --start_file=mysql-bin.000023  --start_pos=35416929 --binlog_dir=/data/mysql/mysql_3306/logs --output_file=/var/log/masterha/app
    1/saved_master_binlog_from_192.168.245.129_3306_20151116030215.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
      Creating /var/log/masterha/app1 if not exists..    ok.
     Concat binary/relay logs from mysql-bin.000023 pos 35416929 to mysql-bin.000023 EOF into /var/log/masterha/app1/saved_master_binlog_from_192.168.245.129_3306_20151116030215.binlog ..
     Binlog Checksum enabled
      Dumping binlog format description event, from position 0 to 120.. ok.
      No need to dump effective binlog data from /data/mysql/mysql_3306/logs/mysql-bin.000023 (pos starts 35416929, filesize 35416929). Skipping.
     Binlog Checksum enabled
     /var/log/masterha/app1/saved_master_binlog_from_192.168.245.129_3306_20151116030215.binlog has no effective data events.
    Event not exists.
    Mon Nov 16 03:02:18 2015 - [info] Additional events were not found from the orig master. No need to save.
    Mon Nov 16 03:02:18 2015 - [info] 
    Mon Nov 16 03:02:18 2015 - [info] * Phase 3.3: Determining New Master Phase..
    Mon Nov 16 03:02:18 2015 - [info] 
    Mon Nov 16 03:02:18 2015 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
    Mon Nov 16 03:02:18 2015 - [info] All slaves received relay logs to the same position. No need to resync each other.
    Mon Nov 16 03:02:18 2015 - [info] Searching new master from slaves..
    Mon Nov 16 03:02:18 2015 - [info]  Candidate masters from the configuration file:
    Mon Nov 16 03:02:18 2015 - [info]   192.168.245.131(192.168.245.131:3306)  Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 03:02:18 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 03:02:18 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
    Mon Nov 16 03:02:18 2015 - [info]  Non-candidate masters:
    Mon Nov 16 03:02:18 2015 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
    Mon Nov 16 03:02:18 2015 - [info] New master is 192.168.245.131(192.168.245.131:3306)
    Mon Nov 16 03:02:18 2015 - [info] Starting master failover..
    Mon Nov 16 03:02:18 2015 - [info] 
    From:
    192.168.245.129(192.168.245.129:3306) (current master)
     +--192.168.245.131(192.168.245.131:3306)
     +--192.168.245.132(192.168.245.132:3306)
    
    To:
    192.168.245.131(192.168.245.131:3306) (new master)
     +--192.168.245.132(192.168.245.132:3306)
    Mon Nov 16 03:02:18 2015 - [info] 
    Mon Nov 16 03:02:18 2015 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
    Mon Nov 16 03:02:18 2015 - [info] 
    Mon Nov 16 03:02:18 2015 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
    Mon Nov 16 03:02:18 2015 - [info] 
    Mon Nov 16 03:02:18 2015 - [info] * Phase 3.4: Master Log Apply Phase..
    Mon Nov 16 03:02:18 2015 - [info] 
    Mon Nov 16 03:02:18 2015 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
    Mon Nov 16 03:02:18 2015 - [info] Starting recovery on 192.168.245.131(192.168.245.131:3306)..
    Mon Nov 16 03:02:18 2015 - [info]  This server has all relay logs. Waiting all logs to be applied.. 
    Mon Nov 16 03:02:20 2015 - [info]   done.
    Mon Nov 16 03:02:20 2015 - [info]  All relay logs were successfully applied.
    Mon Nov 16 03:02:20 2015 - [info] Getting new master's binlog name and position..
    Mon Nov 16 03:02:20 2015 - [info]  mysql-bin.000013:35416929
    Mon Nov 16 03:02:20 2015 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.245.131', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000013', MASTER_LOG_POS=35416929, MASTER_USER='repl', MASTER_PASSWOR
    D='xxx';
    Mon Nov 16 03:02:20 2015 - [warning] master_ip_failover_script is not set. Skipping taking over new master IP address.
    Mon Nov 16 03:02:20 2015 - [info] ** Finished master recovery successfully.
    Mon Nov 16 03:02:20 2015 - [info] * Phase 3: Master Recovery Phase completed.
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] * Phase 4: Slaves Recovery Phase..
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] -- Slave diff file generation on host 192.168.245.132(192.168.245.132:3306) started, pid: 10508. Check tmp log /var/log/masterha/app1/192.168.245.132_3306_20151116030215.log if it takes time..
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] Log messages from 192.168.245.132 ...
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
    Mon Nov 16 03:02:20 2015 - [info] End of log messages from 192.168.245.132.
    Mon Nov 16 03:02:20 2015 - [info] -- 192.168.245.132(192.168.245.132:3306) has the latest relay log events.
    Mon Nov 16 03:02:20 2015 - [info] Generating relay diff files from the latest slave succeeded.
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] -- Slave recovery on host 192.168.245.132(192.168.245.132:3306) started, pid: 10510. Check tmp log /var/log/masterha/app1/192.168.245.132_3306_20151116030215.log if it takes time..
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] Log messages from 192.168.245.132 ...
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] Starting recovery on 192.168.245.132(192.168.245.132:3306)..
    Mon Nov 16 03:02:20 2015 - [info]  This server has all relay logs. Waiting all logs to be applied.. 
    Mon Nov 16 03:02:20 2015 - [info]   done.
    Mon Nov 16 03:02:20 2015 - [info]  All relay logs were successfully applied.
    Mon Nov 16 03:02:20 2015 - [info]  Resetting slave 192.168.245.132(192.168.245.132:3306) and starting replication from the new master 192.168.245.131(192.168.245.131:3306)..
    Mon Nov 16 03:02:20 2015 - [info]  Executed CHANGE MASTER.
    Mon Nov 16 03:02:20 2015 - [info]  Slave started.
    Mon Nov 16 03:02:20 2015 - [info] End of log messages from 192.168.245.132.
    Mon Nov 16 03:02:20 2015 - [info] -- Slave recovery on host 192.168.245.132(192.168.245.132:3306) succeeded.
    Mon Nov 16 03:02:20 2015 - [info] All new slave servers recovered successfully.
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] * Phase 5: New master cleanup phase..
    Mon Nov 16 03:02:20 2015 - [info] 
    Mon Nov 16 03:02:20 2015 - [info] Resetting slave info on the new master..
    Mon Nov 16 03:02:20 2015 - [info]  192.168.245.131: Resetting slave info succeeded.
    Mon Nov 16 03:02:20 2015 - [info] Master failover to 192.168.245.131(192.168.245.131:3306) completed successfully.
    Mon Nov 16 03:02:20 2015 - [info] Deleted server1 entry from /etc/masterha/app1.conf .
    Mon Nov 16 03:02:20 2015 - [info] 
    
    ----- Failover Report -----
    
    app1: MySQL Master failover 192.168.245.129(192.168.245.129:3306) to 192.168.245.131(192.168.245.131:3306) succeeded
    
    Master 192.168.245.129(192.168.245.129:3306) is down!
    
    Check MHA Manager logs at node3:/var/log/masterha/app1/app1.log for details.
    
    Started automated(non-interactive) failover.
    The latest slave 192.168.245.131(192.168.245.131:3306) has all relay logs for recovery.
    Selected 192.168.245.131(192.168.245.131:3306) as a new master.
    192.168.245.131(192.168.245.131:3306): OK: Applying all logs succeeded.
    192.168.245.132(192.168.245.132:3306): This host has the latest relay log events.
    Generating relay diff files from the latest slave succeeded.
    192.168.245.132(192.168.245.132:3306): OK: Applying all logs succeeded. Slave started, replicating from 192.168.245.131(192.168.245.131:3306)
    192.168.245.131(192.168.245.131:3306): Resetting slave info succeeded.
    Master failover to 192.168.245.131(192.168.245.131:3306) completed successfully.
    View Code)

     (2)修复故障master

     通常情况下自动切换以后,待原master主机修复后,如果数据完整的情况下,可能想把原来master重新作为新主库的slave,这时我们可以借助当时自动切换时刻的MHA日志来完成对原master的修复。下面是提取相关日志的命令:

    [root@node3 masterha]#  grep -i "All other slaves should start" /var/log/masterha/app1/app1.log
    Mon Nov 16 03:02:20 2015 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.245.131', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000013', MASTER_LOG_POS=35416929, MASTER_USER='repl', MASTER_PASSWORD='xxx';

     将change master to拿到原master上执行,那么就作为新master的slave了。

    (3)在线切换测试

     在许多情况下, 需要将现有的主服务器迁移到另外一台服务器上。 比如主服务器硬件故障,RAID 控制卡需要重建,将主服务器移到性能更好的服务器上等等。 MHA 提供快速切换和优雅的阻塞写入,这个切换过程只需要 0.5-2s 的时间,这段时间内数据是无法写入的。在很多情况下,0.5-2s 的阻塞写入是可以接受的。因此切换主服务器不需要计划分配维护时间窗口。

    MHA在线切换的大概过程:
    1.检测复制设置和确定当前主服务器
    2.确定新的主服务器
    3.阻塞写入到当前主服务器
    4.等待所有从服务器赶上复制
    5.授予写入到新的主服务器
    6.重新设置从服务器 

    注意,在线切换的时候应用架构需要考虑以下两个问题:

    1.自动识别master和slave的问题(master的机器可能会切换),如果采用了vip的方式,基本可以解决这个问题。

    2.负载均衡的问题(可以定义大概的读写比例,每台机器可承担的负载比例,当有机器离开集群时,需要考虑这个问题)

    为了保证数据完全一致性,在最快的时间内完成切换,MHA的在线切换必须满足以下条件才会切换成功,否则会切换失败。

    1.所有slave的IO线程都在运行

    2.所有slave的SQL线程都在运行

    3.所有的show slave status的输出中Seconds_Behind_Master参数小于或者等于running_updates_limit秒,如果在切换过程中不指 定running_updates_limit,那么默认情况下running_updates_limit为1秒。

    4.在master端,通过show processlist输出,没有一个更新花费的时间大于running_updates_limit秒。

    在线切换步骤如下:

    首先,停掉MHA监控:

    [root@node3 app1]# svc -d /service/masterha_app
    [root@node3 app1]# checkstatus 
    app1 is stopped(2:NOT_RUNNING).

    其次,进行在线切换操作(模拟在线切换主库操作,原主库192.168.245.129变为slave,192.168.245.131提升为新的主库)

    [root@node3 masterha]# /usr/bin/masterha_master_switch --conf=/etc/masterha/app1.conf --master_state=alive --new_master_host=192.168.245.131 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
    Mon Nov 16 21:47:24 2015 - [info] MHA::MasterRotate version 0.56.
    Mon Nov 16 21:47:24 2015 - [info] Starting online master switch..
    Mon Nov 16 21:47:24 2015 - [info] 
    Mon Nov 16 21:47:24 2015 - [info] * Phase 1: Configuration Check Phase..
    Mon Nov 16 21:47:24 2015 - [info] 
    Mon Nov 16 21:47:24 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
    Mon Nov 16 21:47:24 2015 - [info] Reading application default configuration from /etc/masterha/app1.conf..
    Mon Nov 16 21:47:24 2015 - [info] Reading server configuration from /etc/masterha/app1.conf..
    Mon Nov 16 21:47:24 2015 - [info] GTID failover mode = 0
    Mon Nov 16 21:47:24 2015 - [info] Current Alive Master: 192.168.245.129(192.168.245.129:3306)
    Mon Nov 16 21:47:24 2015 - [info] Alive Slaves:
    Mon Nov 16 21:47:24 2015 - [info]   192.168.245.132(192.168.245.132:3306)  Version=5.6.21-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 21:47:24 2015 - [info]     Replicating from 192.168.245.129(192.168.245.129:3306)
    
    It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.245.129(192.168.245.129:3306)? (YES/no): yes
    Mon Nov 16 21:47:35 2015 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
    Mon Nov 16 21:47:35 2015 - [info]  ok.
    Mon Nov 16 21:47:35 2015 - [info] Checking MHA is not monitoring or doing failover..
    Mon Nov 16 21:47:35 2015 - [info] Checking replication health on 192.168.245.132..
    Mon Nov 16 21:47:35 2015 - [info]  ok.
    Mon Nov 16 21:47:35 2015 - [error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln1218] 192.168.245.131 is not alive!
    Mon Nov 16 21:47:35 2015 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln232] Failed to get new master!
    Mon Nov 16 21:47:35 2015 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/bin/masterha_master_switch line 53

    这里怎么会报错呢?意思是131这个机器not alive,可是我去检查下没有问题啊,主从也ok的,后来发现了问题所在,原因是:之前启动的manager脚本中加上了--remove_dead_master_conf参数,导致appl.conf中没有131机器的配置。

    然后到app1.conf中加上131机器的配置信息,再次执行:

    [root@node3 masterha]# /usr/bin/masterha_master_switch --conf=/etc/masterha/app1.conf --master_state=alive --new_master_host=192.168.245.129 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
    Mon Nov 16 22:34:36 2015 - [info] MHA::MasterRotate version 0.56.
    Mon Nov 16 22:34:36 2015 - [info] Starting online master switch..
    Mon Nov 16 22:34:36 2015 - [info] 
    Mon Nov 16 22:34:36 2015 - [info] * Phase 1: Configuration Check Phase..
    Mon Nov 16 22:34:36 2015 - [info] 
    Mon Nov 16 22:34:36 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
    Mon Nov 16 22:34:36 2015 - [info] Reading application default configuration from /etc/masterha/app1.conf..
    Mon Nov 16 22:34:36 2015 - [info] Reading server configuration from /etc/masterha/app1.conf..
    Mon Nov 16 22:34:37 2015 - [info] GTID failover mode = 0
    Mon Nov 16 22:34:37 2015 - [info] Current Alive Master: 192.168.245.131(192.168.245.131:3306)
    Mon Nov 16 22:34:37 2015 - [info] Alive Slaves:
    Mon Nov 16 22:34:37 2015 - [info]   192.168.245.129(192.168.245.129:3306)  Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 22:34:37 2015 - [info]     Replicating from 192.168.245.131(192.168.245.131:3306)
    Mon Nov 16 22:34:37 2015 - [info]     Primary candidate for the new Master (candidate_master is set)
    Mon Nov 16 22:34:37 2015 - [info]   192.168.245.132(192.168.245.132:3306)  Version=5.6.21-log (oldest major version between slaves) log-bin:enabled
    Mon Nov 16 22:34:37 2015 - [info]     Replicating from 192.168.245.131(192.168.245.131:3306)
    
    It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.245.131(192.168.245.131:3306)? (YES/no): yes
    Mon Nov 16 22:34:38 2015 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
    Mon Nov 16 22:34:38 2015 - [info]  ok.
    Mon Nov 16 22:34:38 2015 - [info] Checking MHA is not monitoring or doing failover..
    Mon Nov 16 22:34:38 2015 - [info] Checking replication health on 192.168.245.129..
    Mon Nov 16 22:34:38 2015 - [info]  ok.
    Mon Nov 16 22:34:38 2015 - [info] Checking replication health on 192.168.245.132..
    Mon Nov 16 22:34:38 2015 - [info]  ok.
    Mon Nov 16 22:34:38 2015 - [info] 192.168.245.129 can be new master.
    Mon Nov 16 22:34:38 2015 - [info] 
    From:
    192.168.245.131(192.168.245.131:3306) (current master)
     +--192.168.245.129(192.168.245.129:3306)
     +--192.168.245.132(192.168.245.132:3306)
    
    To:
    192.168.245.129(192.168.245.129:3306) (new master)
     +--192.168.245.132(192.168.245.132:3306)
     +--192.168.245.131(192.168.245.131:3306)
    
    Starting master switch from 192.168.245.131(192.168.245.131:3306) to 192.168.245.129(192.168.245.129:3306)? (yes/NO): yes
    Mon Nov 16 22:34:39 2015 - [info] Checking whether 192.168.245.129(192.168.245.129:3306) is ok for the new master..
    Mon Nov 16 22:34:39 2015 - [info]  ok.
    Mon Nov 16 22:34:39 2015 - [info] 192.168.245.131(192.168.245.131:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.
    Mon Nov 16 22:34:39 2015 - [info] 192.168.245.131(192.168.245.131:3306): Resetting slave pointing to the dummy host.
    Mon Nov 16 22:34:39 2015 - [info] ** Phase 1: Configuration Check Phase completed.
    Mon Nov 16 22:34:39 2015 - [info] 
    Mon Nov 16 22:34:39 2015 - [info] * Phase 2: Rejecting updates Phase..
    Mon Nov 16 22:34:39 2015 - [info] 
    Mon Nov 16 22:34:39 2015 - [info] Executing master ip online change script to disable write on the current master:
    Mon Nov 16 22:34:39 2015 - [info]   /etc/masterha/master_ip_online_change --command=stop --orig_master_host=192.168.245.131 --orig_master_ip=192.168.245.131 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='root' --new_master_host=192.168.245.129 --new_master_ip=192.168.245.129 --new_master_port=3306 --new_master_user='root' --new_master_password='root' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave
    Unknown option: orig_master_password
    Unknown option: new_master_password
    Unknown option: orig_master_ssh_user
    Unknown option: new_master_ssh_user
    Unknown option: orig_master_is_new_slave
    Mon Nov 16 22:34:39 2015 501641 Set read_only on the new master.. ok.
    Mon Nov 16 22:34:39 2015 508208 Set read_only=1 on the orig master.. ok.
    Disabling the VIP on old master: 192.168.245.131 
    Mon Nov 16 22:34:39 2015 876127 Killing all application threads..
    Mon Nov 16 22:34:39 2015 876272 done.
    Mon Nov 16 22:34:39 2015 - [info]  ok.
    Mon Nov 16 22:34:39 2015 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
    Mon Nov 16 22:34:39 2015 - [info] Executing FLUSH TABLES WITH READ LOCK..
    Mon Nov 16 22:34:39 2015 - [info]  ok.
    Mon Nov 16 22:34:39 2015 - [info] Orig master binlog:pos is mysql-bin.000014:120.
    Mon Nov 16 22:34:39 2015 - [info]  Waiting to execute all relay logs on 192.168.245.129(192.168.245.129:3306)..
    Mon Nov 16 22:34:39 2015 - [info]  master_pos_wait(mysql-bin.000014:120) completed on 192.168.245.129(192.168.245.129:3306). Executed 0 events.
    Mon Nov 16 22:34:39 2015 - [info]   done.
    Mon Nov 16 22:34:39 2015 - [info] Getting new master's binlog name and position..
    Mon Nov 16 22:34:39 2015 - [info]  mysql-bin.000024:120
    Mon Nov 16 22:34:39 2015 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.245.129', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000024', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx';
    Mon Nov 16 22:34:39 2015 - [info] Executing master ip online change script to allow write on the new master:
    Mon Nov 16 22:34:39 2015 - [info]   /etc/masterha/master_ip_online_change --command=start --orig_master_host=192.168.245.131 --orig_master_ip=192.168.245.131 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='root' --new_master_host=192.168.245.129 --new_master_ip=192.168.245.129 --new_master_port=3306 --new_master_user='root' --new_master_password='root' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave
    Unknown option: orig_master_password
    Unknown option: new_master_password
    Unknown option: orig_master_ssh_user
    Unknown option: new_master_ssh_user
    Unknown option: orig_master_is_new_slave
    Mon Nov 16 22:34:40 2015 165977 Set read_only=0 on the new master.
    Enabling the VIP - 192.168.245.100/32 on the new master - 192.168.245.129 
    Mon Nov 16 22:34:40 2015 - [info]  ok.
    Mon Nov 16 22:34:40 2015 - [info] 
    Mon Nov 16 22:34:40 2015 - [info] * Switching slaves in parallel..
    Mon Nov 16 22:34:40 2015 - [info] 
    Mon Nov 16 22:34:40 2015 - [info] -- Slave switch on host 192.168.245.132(192.168.245.132:3306) started, pid: 29022
    Mon Nov 16 22:34:40 2015 - [info] 
    Mon Nov 16 22:34:40 2015 - [info] Log messages from 192.168.245.132 ...
    Mon Nov 16 22:34:40 2015 - [info] 
    Mon Nov 16 22:34:40 2015 - [info]  Waiting to execute all relay logs on 192.168.245.132(192.168.245.132:3306)..
    Mon Nov 16 22:34:40 2015 - [info]  master_pos_wait(mysql-bin.000014:120) completed on 192.168.245.132(192.168.245.132:3306). Executed 0 events.
    Mon Nov 16 22:34:40 2015 - [info]   done.
    Mon Nov 16 22:34:40 2015 - [info]  Resetting slave 192.168.245.132(192.168.245.132:3306) and starting replication from the new master 192.168.245.129(192.168.245.129:3306)..
    Mon Nov 16 22:34:40 2015 - [info]  Executed CHANGE MASTER.
    Mon Nov 16 22:34:40 2015 - [info]  Slave started.
    Mon Nov 16 22:34:40 2015 - [info] End of log messages from 192.168.245.132 ...
    Mon Nov 16 22:34:40 2015 - [info] 
    Mon Nov 16 22:34:40 2015 - [info] -- Slave switch on host 192.168.245.132(192.168.245.132:3306) succeeded.
    Mon Nov 16 22:34:40 2015 - [info] Unlocking all tables on the orig master:
    Mon Nov 16 22:34:40 2015 - [info] Executing UNLOCK TABLES..
    Mon Nov 16 22:34:40 2015 - [info]  ok.
    Mon Nov 16 22:34:40 2015 - [info] Starting orig master as a new slave..
    Mon Nov 16 22:34:40 2015 - [info]  Resetting slave 192.168.245.131(192.168.245.131:3306) and starting replication from the new master 192.168.245.129(192.168.245.129:3306)..
    Mon Nov 16 22:34:40 2015 - [info]  Executed CHANGE MASTER.
    Mon Nov 16 22:34:41 2015 - [info]  Slave started.
    Mon Nov 16 22:34:41 2015 - [info] All new slave servers switched successfully.
    Mon Nov 16 22:34:41 2015 - [info] 
    Mon Nov 16 22:34:41 2015 - [info] * Phase 5: New master cleanup phase..
    Mon Nov 16 22:34:41 2015 - [info] 
    Mon Nov 16 22:34:41 2015 - [info]  192.168.245.129: Resetting slave info succeeded.
    Mon Nov 16 22:34:41 2015 - [info] Switching master to 192.168.245.129(192.168.245.129:3306) completed successfully.
    View Code

    其中参数的意思:

    --orig_master_is_new_slave 切换时加上此参数是将原 master 变为 slave 节点,如果不加此参数,原来的 master 将不启动

    --running_updates_limit=10000,故障切换时,候选master 如果有延迟的话, mha 切换不能成功,加上此参数表示延迟在此时间范围内都可切换(单位为s),但是切换的时间长短是由recover 时relay 日志的大小决定

    (4)切换时发送邮件:

    需要创建一个send_mail脚本,然后将脚本路径写入app1.conf中即可。

    #!/usr/bin/perl
    
    #  Copyright (C) 2011 DeNA Co.,Ltd.
    #
    #  This program is free software; you can redistribute it and/or modify
    #  it under the terms of the GNU General Public License as published by
    #  the Free Software Foundation; either version 2 of the License, or
    #  (at your option) any later version.
    #
    #  This program is distributed in the hope that it will be useful,
    #  but WITHOUT ANY WARRANTY; without even the implied warranty of
    #  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    #  GNU General Public License for more details.
    #
    #  You should have received a copy of the GNU General Public License
    #   along with this program; if not, write to the Free Software
    #  Foundation, Inc.,
    #  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
    
    ## Note: This is a sample script and is not complete. Modify the script based on your environment.
    
    use strict;
    use warnings FATAL => 'all';
    use Mail::Sender;
    use Getopt::Long;
    
    #new_master_host and new_slave_hosts are set only when recovering master succeeded
    my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body );
    my $smtp='smtp.163.com';
    my $mail_from='xxxx';
    my $mail_user='xxxxx';
    my $mail_pass='xxxxx';
    my $mail_to=['xxxx','xxxx'];
    GetOptions(
      'orig_master_host=s' => $dead_master_host,
      'new_master_host=s'  => $new_master_host,
      'new_slave_hosts=s'  => $new_slave_hosts,
      'subject=s'          => $subject,
      'body=s'             => $body,
    );
    
    mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body);
    
    sub mailToContacts {
        my ( $smtp, $mail_from, $user, $passwd, $mail_to, $subject, $msg ) = @_;
        open my $DEBUG, "> /tmp/monitormail.log"
            or die "Can't open the debug      file:$!
    ";
        my $sender = new Mail::Sender {
            ctype       => 'text/plain; charset=utf-8',
            encoding    => 'utf-8',
            smtp        => $smtp,
            from        => $mail_from,
            auth        => 'LOGIN',
            TLS_allowed => '0',
            authid      => $user,
            authpwd     => $passwd,
            to          => $mail_to,
            subject     => $subject,
            debug       => $DEBUG
        };
    
        $sender->MailMsg(
            {   msg   => $msg,
                debug => $DEBUG
            }
        ) or print $Mail::Sender::Error;
        return 1;
    }
    
    
    
    # Do whatever you want here
    
    exit 0;
    View Code

    至此,mysql mha部分就搞定了,下面将会结合mha这个架构加入代理层,从而实现读写分离功能,主要采用360公司的Atlas,敬请期待。。。

  • 相关阅读:
    pku 1061 青蛙的约会 扩展欧几里得
    莫比乌斯反演
    51Nod 1240 莫比乌斯函数
    51Nod 1284 2 3 5 7的倍数 容斥原理
    51Nod 1110 距离之和最小 V3 中位数 思维
    51Nod 1108 距离之和最小 V2 1096 距离之和最小 中位数性质
    HDU 2686 Matrix 多线程dp
    51Nod 1084 矩阵取数问题 V2 双线程DP 滚动数组优化
    HDU 1317XYZZY spfa+判断正环+链式前向星(感觉不对,但能A)
    设计模式(4)---单例模式
  • 原文地址:https://www.cnblogs.com/mysql-dba/p/4936708.html
Copyright © 2020-2023  润新知