• mysql5.5 物理删除binlog文件导致的故障


    故障现象:

    中午12点多,一套主从集群的主库因为没有配置大页内存,发布时导致OOM,MYSQL实例重启了,然后MHA发生了切换。切换过程正常。切换后需要把原master配置成新master的slave,在manager.log文件里面找到change master to ....命令,执行后发现复制状态一直停留在connectiong 。名称定:OOM的是M1,挂掉后顶替的是S1.

    mysql> show slave statusG
    *************************** 1. row ***************************
                   Slave_IO_State: Waiting to reconnect after a failed master event read
                      Master_Host: 10.3.171.40
                      Master_User: rep_user
                      Master_Port: 3306
                    Connect_Retry: 60
                  Master_Log_File: centos-bin.000002
              Read_Master_Log_Pos: 107
                   Relay_Log_File: relay-bin.000001
                    Relay_Log_Pos: 4
            Relay_Master_Log_File: centos-bin.000002
                 Slave_IO_Running: Connecting
                Slave_SQL_Running: Yes
                  Replicate_Do_DB: 
              Replicate_Ignore_DB: 
               Replicate_Do_Table: 
           Replicate_Ignore_Table: 
          Replicate_Wild_Do_Table: 
      Replicate_Wild_Ignore_Table: 
                       Last_Errno: 0
                       Last_Error: 
                     Skip_Counter: 0
              Exec_Master_Log_Pos: 107
                  Relay_Log_Space: 107
                  Until_Condition: None
                   Until_Log_File: 
                    Until_Log_Pos: 0
               Master_SSL_Allowed: No
               Master_SSL_CA_File: 
               Master_SSL_CA_Path: 
                  Master_SSL_Cert: 
                Master_SSL_Cipher: 
                   Master_SSL_Key: 
            Seconds_Behind_Master: NULL
    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0
                    Last_IO_Error: 
                   Last_SQL_Errno: 0
                   Last_SQL_Error: 
      Replicate_Ignore_Server_Ids: 
                 Master_Server_Id: 2017140

    检查错误日志文件,日志如下,提示在S1上找不到master上的binlog文件

    160408 12:25:40 [Note] Slave I/O thread: connected to master 'rep_user@10.3.171.40:3306',replication started in log 'centos-bin.000002' at position 107
    160408 12:25:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)
    160408 12:25:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'centos-bin.000002' at postion 107
    160408 12:25:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)
    160408 12:26:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'centos-bin.000002' at postion 107
    160408 12:26:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)

    到S1上去检查,show master status;show master logs可以看到业务数据在写入,POS位置也一直在改变,这里奇怪的是00001文件的大小是0

    mysql> show master logs;
    +-------------------+-----------+
    | Log_name          | File_size |
    +-------------------+-----------+
    | centos-bin.000001 |         0 |
    | centos-bin.000002 | 568661746 |
    +-------------------+-----------+
    2 rows in set (0.00 sec)
    
    mysql> show master logs;
    +-------------------+-----------+
    | Log_name          | File_size |
    +-------------------+-----------+
    | centos-bin.000001 |         0 |
    | centos-bin.000002 | 568941034 |
    +-------------------+-----------+
    2 rows in set (0.00 sec)
    
    mysql> show master logs;
    +-------------------+-----------+
    | Log_name          | File_size |
    +-------------------+-----------+
    | centos-bin.000001 |         0 |
    | centos-bin.000002 | 569017617 |
    +-------------------+-----------+
    2 rows in set (0.00 sec)

    到data目录查看,却没有找到这2个文件。复制提示也是找不到文件

    到这里奇特的现象是:业务正常写数据库,show master status也可以看到有pos位置变化,但是磁盘上没有文件,复制无法建立

    [root@GZ_NS_M5_SYNC_mysql_sync1-standby_171.40 ~]# find / -name centos-bin.000002
    [root@GZ_NS_M5_SYNC_mysql_sync1-standby_171.40 ~]# 

    #故障重现

    1)正常启动实例,开启binlog,配置复制环境

    2)rm 把主库的binlog.index.binlog.0000X删除

    3)继续写入数据,pos位置变化

    4)从库报错,找不到binlog文件

    #为什么会出现这样的情况

    回想起来这个故障,应该和故障重现的过程是一样的,这套集群3,4个月前搭起来的,在复制正常后,standby的binlog相关文件被删除了,其实删除的整个目录,这个目录专门用来存binlog,relaylog的。删除后搭建复制的时候做change master to,把relay log重建了,但是binlog没有。今天发生了MHA切换,standby变成了master,接受数据写入。MHA里面的filename,pos是连到standby做show master status得到的,但是这些文件已经被删除。所以复制出错。

    #继续做实验

    1)生成binlog.0001后,把binlog.index,binlog.00001都rm后,数据写入,pos逐步变大,当超过1G大小做文件切换,会发生什么?

    答:当1写满后做切换,binlog.index没有,拿不到最大的文件ID,那就又从1开始。结论:一直写00001文件

    2)留下index文件,把00001删除,继续写入,超过1G大小会怎么样?

    答:会生成00002文件,这个文件是落地磁盘的正常的binlog文件。

    #今天出现的故障,如何把events拿出来?

    测试下来,如果是statement的,可以通过show master events in xxxx,得到binlog的命令。如果是row格式的,拿不到具体的SQL命令。

  • 相关阅读:
    SOLO: 按位置分割对象
    支付宝架构
    h264和h265多维度区别
    机器学习图解
    机器视觉系统性能
    APA自动泊车系统
    智能驾驶测距估计
    结构感知图像修复:ICCV2019论文解析
    Lambda表达式
    转:利用 T-sql 的从句 for xml path('') 实现多行合并到一行, 并带有分隔符
  • 原文地址:https://www.cnblogs.com/zuoxingyu/p/5369096.html
Copyright © 2020-2023  润新知