环境: MySQL 5.7.25 主主架构
故障现象: 发现互相之间的同步均发生异常,两端均出现1236错误,在两个主节点上分别执行show slave status
显示的关键信息如下:
Master1:
Slave_IO_Running: No
Slave_SQL_Running: Yes
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'A slave with the same server_uuid/server_id as this slave has connected to the master; the first event 'mybinlog.000002' at 284776285, the last event read from '/data/mysql/mybinlog.000007' at 769196837, the last byte read from '/data/mysql/mybinlog.000007' at 769196837.'
Master2:
Slave_IO_Running: No
Slave_SQL_Running: Yes
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'A slave with the same server_uuid/server_id as this slave has connected to the master; the first event 'mybinlog.000002' at 284777403, the last event read from '/data/mysql/mybinlog.000007' at 790522661, the last byte read from '/data/mysql/mybinlog.000007' at 790522661.'
可以看到我们经常关注的指标Slave_IO_Running
值已经变为No,Last_IO_Errno
是1236。
而具体报错信息中比较明显的一点是说A slave with the same server_uuid/server_id as this slave has connected to the master
,可是这套主主同步的环境,server_id和server_uuid都是不一样的,怎么会说存在相同server_uuid或server_id的slave呢?很是奇怪。
最终结合报错时间点和客户沟通是否有变更,结果发现这套环境在虚拟化平台上,该时刻正好用户以这套主主同步的MySQL环境Master1、Master2分别克隆出两台虚拟主机New1、New2,而克隆中的New1就和Master1完全一样,且指向Master2;New2就和Master2完全一样,且指向Master1,也就导致了这个故障的发生。
知道原因后,与用户沟通,最终是将New1、New2的slave停止;重新启动Master1和Master2的slave进程,就恢复了正常的同步。
如果想修改server_uuid/server_id,这两个id对应配置文件分别为 auto.cnf 和 my.cnf。
[root@test01 mysql]# cat auto.cnf
[auto]
server-uuid=08c887bf-98ab-11ea-b70c-080027c2997a
[root@test01 mysql]# grep server-id /etc/mysql/my.cnf
#4)server-id = 1121 确保主从或主主各个节点不同,规则可考虑使用ip地址后两段,如192.168.1.121 server-id=1121
server-id = 1121
幸好本次克隆出来的机器网卡名称由eth2变成了eth3,在克隆出来的环境查看keepalived的日志是因网卡名称有误没有启动成功,不然都不晓得会不会因为vip冲突导致数据讹误,如果会,那就比较悲惨了。