• 11.Redis故障模转移测试


    以下是Redis sentinel的拓扑结构 

    [root@redis ~]# ps -ef|grep redis
    root       1259      1  0 3月18 ?       00:08:38 redis-server *:6379
    root       1289      1  0 3月18 ?       00:08:18 redis-server *:6380
    root       1343      1  0 3月18 ?       00:09:04 redis-server *:6381
    root       2064      1  0 03:36 ?        00:09:29 redis-server *:26379 [sentinel]
    root       2076      1  0 03:38 ?        00:09:29 redis-server *:26380 [sentinel]
    root       2087      1  0 03:38 ?        00:10:19 redis-server *:26381 [sentinel]

    说明:6379/6380/6381是三个Redis数据节点,而26379/26380/26381是三个sentinel 

    26379  run_id:6521bce7ca0d09d99d2d52976e71b73d96eafe52
    26380  run_id:4261696cd87cd788f08dcbf7feffe33a960ae8a7
    26381  run_id:cb0049d5314b388e6dbd4ed425029d752d9eaf42

    开始模拟

      模拟故障方法有很多,比较典型的方法有如下几种:

      .方法一,强制杀掉对应节点的进程号,这样可以模拟出宕机的效果

      .方法二,  使用Redis的debug sleep命令,让节点进入睡眠状态,这样可以模拟阻塞的效果

      .方法三,  使用redis的shutdown命令,模拟正常的停掉redis。

      这里使用方法一来进行测试,因为从实际经验来看,数百上千机器偶尔宕机一两台是会不定期出现的,这里记录一下用kill -9 使主节点(这里的主节点是6381)进程宕机的时间:

      注意一下,这里发现kill掉主节点进程之后,主节点中没有打印什么日志信息(可能和shutdown效果不一样),但是发现其余的另外的两个节点中有一些日志信息显示

      看一下6379日志信息: 

    2845:S 24 Mar 2022 21:58:25.702 * RDB age 0 seconds
    2845:S 24 Mar 2022 21:58:25.702 * RDB memory usage when created 1.97 Mb
    2845:S 24 Mar 2022 21:58:25.702 # Done loading RDB, keys loaded: 4, keys expired: 0.
    2845:S 24 Mar 2022 21:58:25.703 * MASTER <-> REPLICA sync: Finished with success
    2845:S 24 Mar 2022 22:04:25.687 # Connection with master lost.  ##
    2845:S 24 Mar 2022 22:04:25.687 * Caching the disconnected master state.
    2845:S 24 Mar 2022 22:04:25.688 * Reconnecting to MASTER 127.0.0.1:6381
    2845:S 24 Mar 2022 22:04:25.688 * MASTER <-> REPLICA sync started
    2845:S 24 Mar 2022 22:04:25.688 # Error condition on socket for SYNC: Connection refused
    2845:S 24 Mar 2022 22:04:26.141 * Connecting to MASTER 127.0.0.1:6381
    2845:S 24 Mar 2022 22:04:26.142 * MASTER <-> REPLICA sync started
    2845:S 24 Mar 2022 22:04:26.142 # Error condition on socket for SYNC: Connection refused
    2845:S 24 Mar 2022 22:04:27.155 * Connecting to MASTER 127.0.0.1:6381
    2845:S 24 Mar 2022 22:04:27.155 * MASTER <-> REPLICA sync started

      可以看到一个时间点间隔(从21:58跳到了22:04)这个过程就是执行了kill命令后所出现的反应,从  Connection with master lost这句话开始 说明6379这个从节点去6381(主节点)请求就开始失败了

      然后22:04:56时它接到sentinel节点命令,清理原理缓冲的主节点状态信息,sentinel将6379节点晋升为主节点,并重写了配置(config rewrite executed with sucess). 

    2845:M 24 Mar 2022 22:04:56.005 * Discarding previously cached master state.
    2845:M 24 Mar 2022 22:04:56.005 # Setting secondary replication ID to 4e096befcb0b46425e84c0935a44afd848cffe63, valid up to offset: 28505953. New replication ID is 42589dec1d800f42453389730f53bf92ea2fda26
    2845:M 24 Mar 2022 22:04:56.005 * MASTER MODE enabled (user request from 'id=10 addr=127.0.0.1:43084 laddr=127.0.0.1:6379 fd=10 name=sentinel-ab0fbd6d-cmd age=391 idle=1 flags=x db=0 sub=0 psub=0 multi=4 qbuf=188 qbuf-free=40766 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
    2845:M 24 Mar 2022 22:04:56.009 # CONFIG REWRITE executed with success.
    2845:M 24 Mar 2022 22:04:56.709 * Replica 127.0.0.1:6380 asks for synchronization
    2845:M 24 Mar 2022 22:04:56.709 * Partial resynchronization request from 127.0.0.1:6380 accepted. Sending 289 bytes of backlog starting from offset 28505953.

      然后从 Replica 127.0.0.1:6380 asks for synchronization 这句话可以看到6380这个节点发起了连接主节点的请求,

      这时去看sentinel日志:

      这里我们可以26379的日志:  

    2064:X 24 Mar 2022 22:04:55.814 # +try-failover master mymaster 127.0.0.1 6381
    2064:X 24 Mar 2022 22:04:55.817 # +vote-for-leader ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 4
    2064:X 24 Mar 2022 22:04:55.824 # 18e34da0c4ee7c3c285d1d2735d8ab12a44cd72b voted for ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 4
    2064:X 24 Mar 2022 22:04:55.825 # eadd4c9ebfa6c1fbcbfdc5d7e1eec13c80774ec5 voted for ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 4

      这里可以看到sentinel集群在进行选举:  'ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a' 这个是run_id

      再来看看26380的日志: 

    2076:X 24 Mar 2022 22:04:55.706 # +sdown master mymaster 127.0.0.1 6381
    2076:X 24 Mar 2022 22:04:55.821 # +new-epoch 4
    2076:X 24 Mar 2022 22:04:55.823 # +vote-for-leader ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 4
    2076:X 24 Mar 2022 22:04:56.700 # +config-update-from sentinel ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 127.0.0.1 26379 @ mymaster 127.0.0.1 6381
    2076:X 24 Mar 2022 22:04:56.700 # +switch-master mymaster 127.0.0.1 6381 127.0.0.1 6379
    2076:X 24 Mar 2022 22:04:56.701 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
    2076:X 24 Mar 2022 22:04:56.701 * +slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
    2076:X 24 Mar 2022 22:05:26.723 # +sdown slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

      从26380日志可以看出,原master(127.0.0.1 6381)被主观下线了,然后sentinel进行选举了,选择了run_id是‘ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a’这个sentinel做了sentinel集群的主节点,从上面可以看到这个sentinel主节点也就是(23679)

      而26381的日志也和26380内容差不多

    2087:X 24 Mar 2022 22:04:55.795 # +sdown master mymaster 127.0.0.1 6381
    2087:X 24 Mar 2022 22:04:55.822 # +new-epoch 4
    2087:X 24 Mar 2022 22:04:55.825 # +vote-for-leader ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 4
    2087:X 24 Mar 2022 22:04:55.853 # +odown master mymaster 127.0.0.1 6381 #quorum 3/2
    2087:X 24 Mar 2022 22:04:55.854 # Next failover delay: I will not start a failover before Thu Mar 24 22:10:55 2022
    2087:X 24 Mar 2022 22:04:56.702 # +config-update-from sentinel ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 127.0.0.1 26379 @ mymaster 127.0.0.1 6381
    2087:X 24 Mar 2022 22:04:56.702 # +switch-master mymaster 127.0.0.1 6381 127.0.0.1 6379
    2087:X 24 Mar 2022 22:04:56.702 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
    2087:X 24 Mar 2022 22:04:56.703 * +slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
    2087:X 24 Mar 2022 22:05:26.765 # +sdown slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

      这里来看一下各个节点客观下线的时间 

    26379 2064:X 24 Mar 2022 22:04:55.814 # +odown master mymaster 127.0.0.1 6381 #quorum 2/2
    
    26380  这个在日志中竟然没有找到
    
    26381 2087:X 24 Mar 2022 22:04:55.853 # +odown master mymaster 127.0.0.1 6381 #quorum 3/2

        从这两个日志的+odown客观下线时间来看,26379是最先客观下线的,

      3)故障转移日志:这里可以通过sentinel的主节点(26379)日志看出整个日志转移时的整个流程。 

    2064:X 24 Mar 2022 22:04:55.736 # +sdown master mymaster 127.0.0.1 6381
    2064:X 24 Mar 2022 22:04:55.814 # +odown master mymaster 127.0.0.1 6381 #quorum 2/2  ##对主节点进行客观下线
    2064:X 24 Mar 2022 22:04:55.814 # +new-epoch 4
    2064:X 24 Mar 2022 22:04:55.814 # +try-failover master mymaster 127.0.0.1 6381  ##这里是尝试故障转移6381这个节点
    2064:X 24 Mar 2022 22:04:55.817 # +vote-for-leader ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 4  ##这里进行选举sentinel主节点
    2064:X 24 Mar 2022 22:04:55.824 # 18e34da0c4ee7c3c285d1d2735d8ab12a44cd72b voted for ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 4   ##run_id
    2064:X 24 Mar 2022 22:04:55.825 # eadd4c9ebfa6c1fbcbfdc5d7e1eec13c80774ec5 voted for ab0fbd6da2b0a72051dbe3bee8ef3f02b3cc9a2a 4
    2064:X 24 Mar 2022 22:04:55.878 # +elected-leader master mymaster 127.0.0.1 6381     ##这里没有没有搞懂意思。。。
    2064:X 24 Mar 2022 22:04:55.879 # +failover-state-select-slave master mymaster 127.0.0.1 6381  ##这里故障转移原主节点(6381)
    2064:X 24 Mar 2022 22:04:55.939 # +selected-slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381   ##选举6379作为redis数据的主节点
    2064:X 24 Mar 2022 22:04:55.939 * +failover-state-send-slaveof-noone slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381 ##这里通过命令salveof no one使得6379成为主节点
    2064:X 24 Mar 2022 22:04:56.003 * +failover-state-wait-promotion slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381   ##这里是等待6379成为redis数据主节点
    2064:X 24 Mar 2022 22:04:56.645 # +promoted-slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381    ##这里是提升6379作为redis数据主节点
    2064:X 24 Mar 2022 22:04:56.645 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6381   ##这里重配置了原主节点6381的配置文件
    2064:X 24 Mar 2022 22:04:56.695 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381   ##这里是重配置了6380的配置文件
    2064:X 24 Mar 2022 22:04:56.901 # -odown master mymaster 127.0.0.1 6381  ##撤销对原主节点进行客观下线
    2064:X 24 Mar 2022 22:04:57.687 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
    2064:X 24 Mar 2022 22:04:57.687 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
    2064:X 24 Mar 2022 22:04:57.746 # +failover-end master mymaster 127.0.0.1 6381  ##原主节点故障转移结束
    2064:X 24 Mar 2022 22:04:57.747 # +switch-master mymaster 127.0.0.1 6381 127.0.0.1 6379  ##这里最好说明了有原主127.0.0.1 6381  ---> 新主 127.0.0.1 6379  这样的一个过程
    2064:X 24 Mar 2022 22:04:57.747 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379  ##原从节点6380开始连接新主节点
    2064:X 24 Mar 2022 22:04:57.747 * +slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379  ##原从节点6381也开始连接新主节点
    2064:X 24 Mar 2022 22:05:27.798 # +sdown slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379  ##这里又判断了6381为主观下线,
    2064:X 25 Mar 2022 09:47:35.167 # +tilt #tilt mode entered

    这里对日志中的一些订阅做解释:

      原主节点如果重启后:

      6381(原主节点)日志:

    5294:S 25 Mar 2022 22:54:11.963 * Connecting to MASTER 127.0.0.1:6379
    5294:S 25 Mar 2022 22:54:11.964 * MASTER <-> REPLICA sync started
    5294:S 25 Mar 2022 22:54:11.964 * REPLICAOF 127.0.0.1:6379 enabled (user request from 'id=3 addr=127.0.0.1:49004 laddr=127.0.0.1:6381 fd=8 name=sentinel-18e34da0-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=196 qbuf-free=40758 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
    5294:S 25 Mar 2022 22:54:11.967 # CONFIG REWRITE executed with success

      这里启动后接到sentinel节点的命令,让它去复制6379节点

     再来看一下6379(现主节点)日志

    2845:M 25 Mar 2022 22:54:11.972 * Replica 127.0.0.1:6381 asks for synchronization
    2845:M 25 Mar 2022 22:54:11.972 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'f4503c1ab5bec8923b2fb0559e95a58f0cd9f039', my replication IDs are '42589dec1d800f42453389730f53bf92ea2fda26' and '4e096befcb0b46425e84c0935a44afd848cffe63')
    2845:M 25 Mar 2022 22:54:11.972 * Starting BGSAVE for SYNC with target: disk
    2845:M 25 Mar 2022 22:54:11.973 * Background saving started by pid 5309
    5309:C 25 Mar 2022 22:54:11.977 * DB saved on disk
    5309:C 25 Mar 2022 22:54:11.978 * RDB: 2 MB of memory used by copy-on-write
    2845:M 25 Mar 2022 22:54:12.021 * Background saving terminated with success
    2845:M 25 Mar 2022 22:54:12.022 * Synchronization with replica 127.0.0.1:6381 succeeded

      这时可以新的主机点中看到,有来自失联多年的兄弟6381的请求复制,

      然后再来看看sentinel日志:

      26379日志: 撤销了对6381节点主观下线的决定 

    2064:X 26 Mar 2022 11:04:56.643 # -sdown slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

      26380日志:也撤销了对6381节点主观下线的决定

    2076:X 26 Mar 2022 11:04:56.921 # -sdown slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

      26381日志:也是撤销了对6381的主观下线的决定

    注意点:

      部署各个节点的机器时间尽量要同步,否则日志的时序性会很混乱,例如可以给机器添加NTP服务来同步时间

    这里补充一下如果其中一个节点宕掉了,会出现如下日志情况:

     会在redis数据的主节点出现连接丢失的报错 

    2845:M 26 Mar 2022 11:02:43.602 # Connection with replica 127.0.0.1:6381 lost.

      

      

     

        

         

      

        

  • 相关阅读:
    JS数组常用方法参考---5、sort方法
    JS数组常用方法---1、课程介绍
    JS数组常用方法参考---4、unshift方法
    JS数组常用方法参考---3、队列对应的方法
    ES6课程---12、面向对象
    legend3---30、不同类型的网站打包方案
    数据库Sharding的基本思想和切分策略
    用sharding技术来扩展你的数据库(一)sharding 介绍
    什么是Scale Up和Scale Out?
    淘宝网采用什么技术架构来实现网站高负载的
  • 原文地址:https://www.cnblogs.com/zmc60/p/16058114.html
Copyright © 2020-2023  润新知