Redis源码解析：21sentinel(二)定期发送消息、检测主观下线

六：定时发送消息

哨兵每隔一段时间，会向其所监控的所有实例发送一些命令，用于获取这些实例的状态。这些命令包括：”PING”、”INFO”和”PUBLISH”。

“PING”命令，主要用于哨兵探测实例是否活着。如果对方超过一段时间，还没有回复”PING”命令，则认为其是主观下线了。

“INFO”命令，主要用于哨兵获取实例当前的状态和信息，比如该实例当前是主节点还是从节点；该实例反馈的IP地址和PORT信息，是否与我记录的一样；该实例如果是主节点的话，那它都有哪些从节点；该实例如果是从节点的话，它与主节点是否连通，它的优先级是多少，它的复制偏移量是多少等等，这些信息在故障转移流程中，是判断实例状态的重要信息；

“PUBLISH”命令，主要用于哨兵向实例的HELLO频道发布有关自己以及主节点的信息，也就是所谓的HELLO消息。因为所有哨兵都会订阅主节点和从节点的HELLO频道，因此，每个哨兵都会收到其他哨兵发布的信息。

因此，通过这些命令，尽管在配置文件中只配置了主节点的信息，但是哨兵可以通过主节点的”INFO”回复，得到所有从节点的信息；又可以通过订阅实例的HELLO频道，接收其他哨兵通过”PUBLISH”命令发布的信息，从而得到监控同一主节点的所有其他哨兵的信息。

在“主函数”sentinelHandleRedisInstance中，是通过调用sentinelSendPeriodicCommands来发送这些命令的。注意，以上的命令都有自己的发送周期，在sentinelSendPeriodicCommands函数中，并不是一并发送三个命令，而是发送那些，按照发送周期应该发送的命令。该函数的代码如下：

void sentinelSendPeriodicCommands(sentinelRedisInstance *ri) {
    mstime_t now = mstime();
    mstime_t info_period, ping_period;
    int retval;

    /* Return ASAP if we have already a PING or INFO already pending, or
     * in the case the instance is not properly connected. */
    if (ri->flags & SRI_DISCONNECTED) return;

    /* For INFO, PING, PUBLISH that are not critical commands to send we
     * also have a limit of SENTINEL_MAX_PENDING_COMMANDS. We don't
     * want to use a lot of memory just because a link is not working
     * properly (note that anyway there is a redundant protection about this,
     * that is, the link will be disconnected and reconnected if a long
     * timeout condition is detected. */
    if (ri->pending_commands >= SENTINEL_MAX_PENDING_COMMANDS) return;

    /* If this is a slave of a master in O_DOWN condition we start sending
     * it INFO every second, instead of the usual SENTINEL_INFO_PERIOD
     * period. In this state we want to closely monitor slaves in case they
     * are turned into masters by another Sentinel, or by the sysadmin. */
    if ((ri->flags & SRI_SLAVE) &&
        (ri->master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS))) {
        info_period = 1000;
    } else {
        info_period = SENTINEL_INFO_PERIOD;
    }

    /* We ping instances every time the last received pong is older than
     * the configured 'down-after-milliseconds' time, but every second
     * anyway if 'down-after-milliseconds' is greater than 1 second. */
    ping_period = ri->down_after_period;
    if (ping_period > SENTINEL_PING_PERIOD) ping_period = SENTINEL_PING_PERIOD;

    if ((ri->flags & SRI_SENTINEL) == 0 &&
        (ri->info_refresh == 0 ||
        (now - ri->info_refresh) > info_period))
    {
        /* Send INFO to masters and slaves, not sentinels. */
        retval = redisAsyncCommand(ri->cc,
            sentinelInfoReplyCallback, NULL, "INFO");
        if (retval == REDIS_OK) ri->pending_commands++;
    } else if ((now - ri->last_pong_time) > ping_period) {
        /* Send PING to all the three kinds of instances. */
        sentinelSendPing(ri);
    } else if ((now - ri->last_pub_time) > SENTINEL_PUBLISH_PERIOD) {
        /* PUBLISH hello messages to all the three kinds of instances. */
        sentinelSendHello(ri);
    }
}

如果实例标志位中设置了SRI_DISCONNECTED标记，说明当前实例的异步上下文还没有创建好，因此直接返回；

实例的pending_commands属性，表示已经向该实例发送的命令中，尚有pending_commands个命令还没有收到回复。每次调用redisAsyncCommand函数，向实例异步发送一条命令之后，就会增加该属性的值，而每当收到命令回复之后，就会减少该属性的值；

因此，如果该属性的值大于SENTINEL_MAX_PENDING_COMMANDS(100)，说明该实例尚有超过100条命令的回复信息没有收到。这种情况下，说明与实例的连接已经不正常了，为了节约内存，因此直接返回；

接下来计算info_period和ping_period，这俩值表示发送"INFO"和"PING"命令的时间周期。如果当前时间距离上次收到"INFO"或"PING"回复的时间已经超过了info_period或ping_period，则向实例发送"INFO"或"PING"命令；

如果当前实例为从节点，并且该从节点对应的主节点已经客观下线了，则置info_period为1000，否则的话置为SENTINEL_INFO_PERIOD(10000)。之所以在主节点客观下线后更频繁的向从节点发送"INFO"命令，是因为从节点可能会被置为新的主节点，因此需要更加实时的获取其状态；

将ping_period置为ri->down_after_period的值，该属性的值是根据配置文件中down-after-milliseconds选项得到的，如果该属性值大于SENTINEL_PING_PERIOD(1000)，则将ping_period置为SENTINEL_PING_PERIOD；

接下来开始发送命令：如果当前实例不是哨兵实例，并且距离上次收到"INFO"命令回复已经超过了info_period，则向该实例异步发送"INFO"命令。

否则，如果距离上次收到"PING"命令回复已经超过了ping_period，则调用函数sentinelSendPing向该实例异步发送"PING"命令；

否则，如果距离上次收到"PUBLISH"命令的回复已经超过了SENTINEL_PUBLISH_PERIOD(2000)，则调用函数sentinelSendHello向该实例异步发送"PUBLISH"命令；

因此，"PING"用于探测实例是否活着，可以发送给所有类型的实例；而"INFO"命令用于获取实例的信息，只需发送给主节点和从节点实例；而"PUBLISH"用于向HELLO频道发布哨兵本身和主节点的信息，除了发送给主节点和从节点之外，哨兵本身也实现了"PUBLISH"命令的处理函数，因此该命令也会发送给哨兵实例。

1：PING消息

函数sentinelSendPing用于向实例发送”PING”命令，因为该命令用于探测实例是否主观下线，因此等到后面讲解主观下线是在分析。

2：HELLO消息

函数sentinelSendHello用于发布HELLO消息，它的代码如下：

int sentinelSendHello(sentinelRedisInstance *ri) {
    char ip[REDIS_IP_STR_LEN];
    char payload[REDIS_IP_STR_LEN+1024];
    int retval;
    char *announce_ip;
    int announce_port;
    sentinelRedisInstance *master = (ri->flags & SRI_MASTER) ? ri : ri->master;
    sentinelAddr *master_addr = sentinelGetCurrentMasterAddress(master);

    if (ri->flags & SRI_DISCONNECTED) return REDIS_ERR;

    /* Use the specified announce address if specified, otherwise try to
     * obtain our own IP address. */
    if (sentinel.announce_ip) {
        announce_ip = sentinel.announce_ip;
    } else {
        if (anetSockName(ri->cc->c.fd,ip,sizeof(ip),NULL) == -1)
            return REDIS_ERR;
        announce_ip = ip;
    }
    announce_port = sentinel.announce_port ?
                    sentinel.announce_port : server.port;

    /* Format and send the Hello message. */
    snprintf(payload,sizeof(payload),
        "%s,%d,%s,%llu," /* Info about this sentinel. */
        "%s,%s,%d,%llu", /* Info about current master. */
        announce_ip, announce_port, server.runid,
        (unsigned long long) sentinel.current_epoch,
        /* --- */
        master->name,master_addr->ip,master_addr->port,
        (unsigned long long) master->config_epoch);
    retval = redisAsyncCommand(ri->cc,
        sentinelPublishReplyCallback, NULL, "PUBLISH %s %s",
            SENTINEL_HELLO_CHANNEL,payload);
    if (retval != REDIS_OK) return REDIS_ERR;
    ri->pending_commands++;
    return REDIS_OK;
}

首先得到实例ri所属的主节点实例master；然后调用sentinelGetCurrentMasterAddress函数得到master的地址信息；

如果实例ri的标志位中具有SRI_DISCONNECTED标记的话，直接返回；

如果当前哨兵配置了sentinel.announce_ip的话，则使用该ip信息作为自己的ip地址，否则，调用anetSockName函数，根据socket描述符得到当前哨兵的ip地址；

如果当前哨兵配置了sentinel.announce_port的话，则使用该port信息作为自己的端口信息，否则，使用server.port作为当前哨兵的端口信息；

接下来组装要发布的HELLO信息，HELLO信息的格式是："sentinel_ip,sentinel_port,sentinel_runid,current_epoch,master_name,master_ip,master_port,master_config_epoch"

接下来，向ri异步发送"PUBLISH__sentinel__:hello <HELLO>"命令，设置命令回调函数为sentinelPublishReplyCallback；

当哨兵收到实例对于该”PUBLISH”命令的回复之后，会调用回调函数sentinelPublishReplyCallback，该函数只用于更新属性ri->last_pub_time，对回复内容无需关心：

void sentinelPublishReplyCallback(redisAsyncContext *c, void *reply, void *privdata) {
    sentinelRedisInstance *ri = c->data;
    redisReply *r;
    REDIS_NOTUSED(privdata);

    if (ri) ri->pending_commands--;
    if (!reply || !ri) return;
    r = reply;

    /* Only update pub_time if we actually published our message. Otherwise
     * we'll retry again in 100 milliseconds. */
    if (r->type != REDIS_REPLY_ERROR)
        ri->last_pub_time = mstime();
}

之前在介绍sentinelReconnectInstance函数时讲过，当哨兵向主节点或从节点实例建立订阅连接时，向实例发送” SUBSCRIBE __sentinel__:hello"命令，订阅HELLO频道时，设置该命令的回调函数为sentinelReceiveHelloMessages。因此，当收到该频道上发布的消息时，就会调用函数sentinelReceiveHelloMessages。

该频道上的消息，是监控同一实例的其他哨兵节点发来的HELLO消息，当前哨兵通过HELLO消息，来发现其他哨兵，并且相互之间交互最新的主节点信息。sentinelReceiveHelloMessages函数的代码如下：

void sentinelReceiveHelloMessages(redisAsyncContext *c, void *reply, void *privdata) {
    sentinelRedisInstance *ri = c->data;
    redisReply *r;
    REDIS_NOTUSED(privdata);

    if (!reply || !ri) return;
    r = reply;

    /* Update the last activity in the pubsub channel. Note that since we
     * receive our messages as well this timestamp can be used to detect
     * if the link is probably disconnected even if it seems otherwise. */
    ri->pc_last_activity = mstime();

    /* Sanity check in the reply we expect, so that the code that follows
     * can avoid to check for details. */
    if (r->type != REDIS_REPLY_ARRAY ||
        r->elements != 3 ||
        r->element[0]->type != REDIS_REPLY_STRING ||
        r->element[1]->type != REDIS_REPLY_STRING ||
        r->element[2]->type != REDIS_REPLY_STRING ||
        strcmp(r->element[0]->str,"message") != 0) return;

    /* We are not interested in meeting ourselves */
    if (strstr(r->element[2]->str,server.runid) != NULL) return;

    sentinelProcessHelloMessage(r->element[2]->str, r->element[2]->len);
}

该函数中，首先更新ri->pc_last_activity为当前时间；

然后判断是否处理接收到的消息，注意，只处理"message"消息，也就是说不会处理"subscribe"消息；

注意，如果收到的"message"消息中，包含了自身的runid，说明这是本哨兵自己发送的消息，因此无需处理，直接返回；

最后，调用sentinelProcessHelloMessage函数处理收到的HELLO消息；

注意：在测试时发现会收到从节点重复的HELLO消息，也就是同一时间，同一个哨兵发布的两条一模一样的消息。这是因为哨兵向主节点发送的”PUBLISH”命令，会因为主从复制的原因，而同步到从节点；而同时该哨兵也向从节点发送”PUBLISH”命令，因此，从节点就会在同一时间，收到两条一模一样的HELLO消息，并将它们发布到频道上。

另外，一旦哨兵发现了其他哨兵之后，可以直接向其发送"PUBLISH __sentinel__:hello <HELLO>"命令。哨兵自己实现了”PUBLISH”的处理函数sentinelPublishCommand，当收到其他哨兵直接发来的HELLO消息时，就会调用该函数处理。该函数的代码如下：

void sentinelPublishCommand(redisClient *c) {
    if (strcmp(c->argv[1]->ptr,SENTINEL_HELLO_CHANNEL)) {
        addReplyError(c, "Only HELLO messages are accepted by Sentinel instances.");
        return;
    }
    sentinelProcessHelloMessage(c->argv[2]->ptr,sdslen(c->argv[2]->ptr));
    addReplyLongLong(c,1);
}

因此，不管是从真正的订阅频道中收到HELLO消息，还是直接收到其他哨兵发来的”PUBLISH”命令，最终都是通过sentinelProcessHelloMessage函数对HELLO消息进行处理的。该函数的代码如下：

void sentinelProcessHelloMessage(char *hello, int hello_len) {
    /* Format is composed of 8 tokens:
     * 0=ip,1=port,2=runid,3=current_epoch,4=master_name,
     * 5=master_ip,6=master_port,7=master_config_epoch. */
    int numtokens, port, removed, master_port;
    uint64_t current_epoch, master_config_epoch;
    char **token = sdssplitlen(hello, hello_len, ",", 1, &numtokens);
    sentinelRedisInstance *si, *master;

    if (numtokens == 8) {
        /* Obtain a reference to the master this hello message is about */
        master = sentinelGetMasterByName(token[4]);
        if (!master) goto cleanup; /* Unknown master, skip the message. */

        /* First, try to see if we already have this sentinel. */
        port = atoi(token[1]);
        master_port = atoi(token[6]);
        si = getSentinelRedisInstanceByAddrAndRunID(
                        master->sentinels,token[0],port,token[2]);
        current_epoch = strtoull(token[3],NULL,10);
        master_config_epoch = strtoull(token[7],NULL,10);

        if (!si) {
            /* If not, remove all the sentinels that have the same runid
             * OR the same ip/port, because it's either a restart or a
             * network topology change. */
            removed = removeMatchingSentinelsFromMaster(master,token[0],port,
                            token[2]);
            if (removed) {
                sentinelEvent(REDIS_NOTICE,"-dup-sentinel",master,
                    "%@ #duplicate of %s:%d or %s",
                    token[0],port,token[2]);
            }

            /* Add the new sentinel. */
            si = createSentinelRedisInstance(NULL,SRI_SENTINEL,
                            token[0],port,master->quorum,master);
            if (si) {
                sentinelEvent(REDIS_NOTICE,"+sentinel",si,"%@");
                /* The runid is NULL after a new instance creation and
                 * for Sentinels we don't have a later chance to fill it,
                 * so do it now. */
                si->runid = sdsnew(token[2]);
                sentinelFlushConfig();
            }
        }

        /* Update local current_epoch if received current_epoch is greater.*/
        if (current_epoch > sentinel.current_epoch) {
            sentinel.current_epoch = current_epoch;
            sentinelFlushConfig();
            sentinelEvent(REDIS_WARNING,"+new-epoch",master,"%llu",
                (unsigned long long) sentinel.current_epoch);
        }

        /* Update master info if received configuration is newer. */
        if (master->config_epoch < master_config_epoch) {
            master->config_epoch = master_config_epoch;
            if (master_port != master->addr->port ||
                strcmp(master->addr->ip, token[5]))
            {
                sentinelAddr *old_addr;

                sentinelEvent(REDIS_WARNING,"+config-update-from",si,"%@");
                sentinelEvent(REDIS_WARNING,"+switch-master",
                    master,"%s %s %d %s %d",
                    master->name,
                    master->addr->ip, master->addr->port,
                    token[5], master_port);

                old_addr = dupSentinelAddr(master->addr);
                sentinelResetMasterAndChangeAddress(master, token[5], master_port);
                sentinelCallClientReconfScript(master,
                    SENTINEL_OBSERVER,"start",
                    old_addr,master->addr);
                releaseSentinelAddr(old_addr);
            }
        }

        /* Update the state of the Sentinel. */
        if (si) si->last_hello_time = mstime();
    }

cleanup:
    sdsfreesplitres(token,numtokens);
}

首先，根据消息中的master_name，调用函数sentinelGetMasterByName，在字典sentinel.masters中寻找相应的主节点实例master，如果找不到，则直接退出；

然后，调用getSentinelRedisInstanceByAddrAndRunID函数，根据消息中的sentinel_ip,sentinel_port和sentinel_runid信息，在字典master->sentinels中，找到runid，ip和port都匹配的哨兵实例。

如果没有找到匹配的哨兵实例，要么这是一个新发现的哨兵，要么是某个哨兵的信息发生了变化（比如有可能某个哨兵实例重启了，导致runid发生了变化；或者网络拓扑发生了变化，导致ip或port发生了变化）。

这种情况下，首先调用函数removeMatchingSentinelsFromMaster，删除字典master->sentinels中，具有相同runid，或者具有相同ip和port的哨兵实例；然后根据HELLO消息中的ip和port信息，重新创建一个新的哨兵实例，添加到字典master->sentinels中，这样下次调用sentinelReconnectInstance时，就会向该哨兵实例进行建链了。；

如果找到了匹配的哨兵实例，并且HELLO消息中的sentinel_current_epoch，大于本实例当前的current_epoch，则更新本实例的current_epoch属性；

如果HELLO消息中的master_config_epoch，大于本实例记录的master的config_epoch，则更新本实例记录的master的config_epoch。并且如果HELLO消息中的master_ip或master_port，与本实例记录的主节点的ip或port信息不匹配的话，则说明可能发生了故障转移，某个从节点升级成为了新的主节点，因此调用sentinelResetMasterAndChangeAddress函数，重置主节点，及其从节点实例的信息；

最后，更新si->last_hello_time属性为当前时间；

3：”INFO”命令

“INFO”命令，主要用于哨兵获取主从节点实例当前的状态和信息，比如该实例当前是主节点还是从节点；该实例反馈的IP地址和PORT信息，是否与本哨兵记录的一样；该实例如果是主节点的话，那它都有哪些从节点；该实例如果是从节点的话，它与主节点是否连通，它的优先级是多少，它的复制偏移量是多少等等，这些信息在故障转移流程中，是判断实例状态的重要信息；

在sentinelSendPeriodicCommands函数中，设置的”INFO”命令的回调函数是sentinelInfoReplyCallback。该函数的代码很简单，主要是调用sentinelRefreshInstanceInfo函数对回复进行处理。因此，主要看一下sentinelRefreshInstanceInfo函数的代码：

void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {
    sds *lines;
    int numlines, j;
    int role = 0;

    /* The following fields must be reset to a given value in the case they
     * are not found at all in the INFO output. */
    ri->master_link_down_time = 0;

    /* Process line by line. */
    lines = sdssplitlen(info,strlen(info),"
",2,&numlines);
    for (j = 0; j < numlines; j++) {
        sentinelRedisInstance *slave;
        sds l = lines[j];

        /* run_id:<40 hex chars>*/
        if (sdslen(l) >= 47 && !memcmp(l,"run_id:",7)) {
            if (ri->runid == NULL) {
                ri->runid = sdsnewlen(l+7,40);
            } else {
                if (strncmp(ri->runid,l+7,40) != 0) {
                    sentinelEvent(REDIS_NOTICE,"+reboot",ri,"%@");
                    sdsfree(ri->runid);
                    ri->runid = sdsnewlen(l+7,40);
                }
            }
        }

        /* old versions: slave0:<ip>,<port>,<state>
         * new versions: slave0:ip=127.0.0.1,port=9999,... */
        if ((ri->flags & SRI_MASTER) &&
            sdslen(l) >= 7 &&
            !memcmp(l,"slave",5) && isdigit(l[5]))
        {
            char *ip, *port, *end;

            if (strstr(l,"ip=") == NULL) {
                /* Old format. */
                ip = strchr(l,':'); if (!ip) continue;
                ip++; /* Now ip points to start of ip address. */
                port = strchr(ip,','); if (!port) continue;
                *port = ''; /* nul term for easy access. */
                port++; /* Now port points to start of port number. */
                end = strchr(port,','); if (!end) continue;
                *end = ''; /* nul term for easy access. */
            } else {
                /* New format. */
                ip = strstr(l,"ip="); if (!ip) continue;
                ip += 3; /* Now ip points to start of ip address. */
                port = strstr(l,"port="); if (!port) continue;
                port += 5; /* Now port points to start of port number. */
                /* Nul term both fields for easy access. */
                end = strchr(ip,','); if (end) *end = '';
                end = strchr(port,','); if (end) *end = '';
            }

            /* Check if we already have this slave into our table,
             * otherwise add it. */
            if (sentinelRedisInstanceLookupSlave(ri,ip,atoi(port)) == NULL) {
                if ((slave = createSentinelRedisInstance(NULL,SRI_SLAVE,ip,
                            atoi(port), ri->quorum, ri)) != NULL)
                {
                    sentinelEvent(REDIS_NOTICE,"+slave",slave,"%@");
                    sentinelFlushConfig();
                }
            }
        }

        /* master_link_down_since_seconds:<seconds> */
        if (sdslen(l) >= 32 &&
            !memcmp(l,"master_link_down_since_seconds",30))
        {
            ri->master_link_down_time = strtoll(l+31,NULL,10)*1000;
        }

        /* role:<role> */
        if (!memcmp(l,"role:master",11)) role = SRI_MASTER;
        else if (!memcmp(l,"role:slave",10)) role = SRI_SLAVE;

        if (role == SRI_SLAVE) {
            /* master_host:<host> */
            if (sdslen(l) >= 12 && !memcmp(l,"master_host:",12)) {
                if (ri->slave_master_host == NULL ||
                    strcasecmp(l+12,ri->slave_master_host))
                {
                    sdsfree(ri->slave_master_host);
                    ri->slave_master_host = sdsnew(l+12);
                    ri->slave_conf_change_time = mstime();
                }
            }

            /* master_port:<port> */
            if (sdslen(l) >= 12 && !memcmp(l,"master_port:",12)) {
                int slave_master_port = atoi(l+12);

                if (ri->slave_master_port != slave_master_port) {
                    ri->slave_master_port = slave_master_port;
                    ri->slave_conf_change_time = mstime();
                }
            }

            /* master_link_status:<status> */
            if (sdslen(l) >= 19 && !memcmp(l,"master_link_status:",19)) {
                ri->slave_master_link_status =
                    (strcasecmp(l+19,"up") == 0) ?
                    SENTINEL_MASTER_LINK_STATUS_UP :
                    SENTINEL_MASTER_LINK_STATUS_DOWN;
            }

            /* slave_priority:<priority> */
            if (sdslen(l) >= 15 && !memcmp(l,"slave_priority:",15))
                ri->slave_priority = atoi(l+15);

            /* slave_repl_offset:<offset> */
            if (sdslen(l) >= 18 && !memcmp(l,"slave_repl_offset:",18))
                ri->slave_repl_offset = strtoull(l+18,NULL,10);
        }
    }
    ri->info_refresh = mstime();
    sdsfreesplitres(lines,numlines);

    /* ---------------------------- Acting half -----------------------------
     * Some things will not happen if sentinel.tilt is true, but some will
     * still be processed. */

    /* Remember when the role changed. */
    if (role != ri->role_reported) {
        ri->role_reported_time = mstime();
        ri->role_reported = role;
        if (role == SRI_SLAVE) ri->slave_conf_change_time = mstime();
        /* Log the event with +role-change if the new role is coherent or
         * with -role-change if there is a mismatch with the current config. */
        sentinelEvent(REDIS_VERBOSE,
            ((ri->flags & (SRI_MASTER|SRI_SLAVE)) == role) ?
            "+role-change" : "-role-change",
            ri, "%@ new reported role is %s",
            role == SRI_MASTER ? "master" : "slave",
            ri->flags & SRI_MASTER ? "master" : "slave");
    }

    /* None of the following conditions are processed when in tilt mode, so
     * return asap. */
    if (sentinel.tilt) return;

    /* Handle master -> slave role switch. */
    if ((ri->flags & SRI_MASTER) && role == SRI_SLAVE) {
        /* Nothing to do, but masters claiming to be slaves are
         * considered to be unreachable by Sentinel, so eventually
         * a failover will be triggered. */
    }
    ...
}

该函数首先在for循环中解析"INFO"回复信息：

首先解析出"run_id"之后的信息，保存在ri->runid中。如果该实例的runid发生了变化，还需要记录日志，向"+reboot"频道发布消息；

如果实例为主节点，则解析"slave"后的从节点信息，取出其中的ip和port信息，然后根据ip和port，调用sentinelRedisInstanceLookupSlave函数，在字典ri->slaves中寻找是否已经保存了该从节点的信息。如果没有，则调用createSentinelRedisInstance创建从节点实例，并插入到ri->slaves中，也就是发现了主节点属下的从节点，下次调用函数sentinelReconnectInstance时，就会向该从节点建链了；

解析"master_link_down_since_seconds"信息，该信息表示从节点与主节点的断链时间。将其转换成整数后，记录到ri->master_link_down_time中；

解析"role"信息，如果包含"role:master"，则置role为SRI_MASTER，说明该实例报告自己为主节点；如果包含"role:slave"，则置role为SRI_SLAVE，说明该实例报告自己为从节点；

如果role为SRI_SLAVE，找到回复信息中的"master_host:"信息，记录到ri->slave_master_host中；找到回复信息中的"master_port:"信息，记录到ri->slave_master_port中；找到回复信息中的"master_link_status:"信息，根据其值是否为"up"，记录到ri->slave_master_link_status中；找到回复信息中的"slave_priority:"信息，记录到ri->slave_priority中；找到回复信息中的"slave_repl_offset:"信息，记录到ri->slave_repl_offset中；

解析完所有"INFO"回复信息之后，更新ri->info_refresh为当前时间；

接下来根据实例的角色信息执行一些动作：

ri->role_reported的初始值是根据ri->flags得到的，如果收到"INFO"回复后，解析得到的role与ri->role_reported不同，说明该实例的角色发生了变化，比如从主节点变成了从节点，或者相反。只要role与ri->role_reported不同，就首先更新ri->role_reported_time为当前时间，并且将ri->role_reported置为role；如果role为SRI_SLAVE，还需要更新ri->slave_conf_change_time的值为当前时间；最后，还根据ri->flags中的角色是否与role，来记录日志，发布信息；

如果当前哨兵已经进入了TILT模式，则直接返回；

如果ri->flags中为主节点，但是role为从节点，这种情况无需采取动作，因为这种情况会被视为主节点不可达，最终会引发故障迁移流程；

本函数剩下的动作，与故障转移流程有关，后续在介绍。

七：判断实例是否主观下线

首先解释一下主观下线和客观下线的区别。

所谓主观下线，就是从“我”（当前实例）的角度来看，某个实例已经下线了。但是单个哨兵的视角可能是盲目的，仅从“我”的角度，就决定一个实例下线是武断的。因此，“我”还会通过命令询问其他哨兵节点，看它们是否也认为该实例已经下线了，如果超过quorum个（包括“我”）哨兵反馈认为该实例已经下线了，则“我”就会认为该实例确实已经下线了，也就是所谓的客观下线了。

判断某个实例主观下线，主要是根据其是否能及时回复”PING”命令决定的。因此，首先看一下发送”PING”命令的函数sentinelSendPing的实现：

int sentinelSendPing(sentinelRedisInstance *ri) {
    int retval = redisAsyncCommand(ri->cc,
        sentinelPingReplyCallback, NULL, "PING");
    if (retval == REDIS_OK) {
        ri->pending_commands++;
        /* We update the ping time only if we received the pong for
         * the previous ping, otherwise we are technically waiting
         * since the first ping that did not received a reply. */
        if (ri->last_ping_time == 0) ri->last_ping_time = mstime();
        return 1;
    } else {
        return 0;
    }
}

在该函数中，设置收到”PING”命令回复后的回调函数为sentinelPingReplyCallback。

需要注意的是，如果ri->last_ping_time值为0，则更新ri->last_ping_time为当前时间。而只有在收到"PING"命令的正常回复之后，ri->last_ping_time的值才会被置为0。

下面是回调函数sentinelPingReplyCallback的代码：

void sentinelPingReplyCallback(redisAsyncContext *c, void *reply, void *privdata) {
    sentinelRedisInstance *ri = c->data;
    redisReply *r;
    REDIS_NOTUSED(privdata);

    if (ri) ri->pending_commands--;
    if (!reply || !ri) return;
    r = reply;

    if (r->type == REDIS_REPLY_STATUS ||
        r->type == REDIS_REPLY_ERROR) {
        /* Update the "instance available" field only if this is an
         * acceptable reply. */
        if (strncmp(r->str,"PONG",4) == 0 ||
            strncmp(r->str,"LOADING",7) == 0 ||
            strncmp(r->str,"MASTERDOWN",10) == 0)
        {
            ri->last_avail_time = mstime();
            ri->last_ping_time = 0; /* Flag the pong as received. */
        } else {
            /* Send a SCRIPT KILL command if the instance appears to be
             * down because of a busy script. */
            if (strncmp(r->str,"BUSY",4) == 0 &&
                (ri->flags & SRI_S_DOWN) &&
                !(ri->flags & SRI_SCRIPT_KILL_SENT))
            {
                if (redisAsyncCommand(ri->cc,
                        sentinelDiscardReplyCallback, NULL,
                        "SCRIPT KILL") == REDIS_OK)
                    ri->pending_commands++;
                ri->flags |= SRI_SCRIPT_KILL_SENT;
            }
        }
    }
    ri->last_pong_time = mstime();
}

如果回复信息为"PONG"，"LOADING"或"MASTERDOWN"，表示正常回复，因此置该实例的属性ri->last_avail_time为当前时间，并且置ri->last_ping_time为0，这样下次发送"PING"命令时就会更新ri->last_ping_time的值了；

如果回复信息以"BUSY"开头，并且该实例已经被置为主观下线，并且还没有向该实例发送过"SCRIPT KILL"命令，则向该实例发送"SCRIPTKILL"命令；

最后，不管回复信息是什么，更新ri->last_pong_time为当前时间。

因此，有关”PING”命令的时间属性总结如下：

ri->last_ping_time：上一次正常发送”PING”命令的时间。需要注意的是，只有当收到"PING"命令的正常回复后，下次发送"PING"命令时才会更新该属性为当时时间戳。如果发送”PING”命令后，没有收到任何回复，或者没有收到正常回复，则下次发送”PING”命令时，就不会更新该属性。如果该属性值为0，说明已经收到了上一个"PING"命令的正常回复，但是还没有开始发送下一个"PING"命令。检测实例是否主观下线，主要就是根据该属性判断的。

ri->last_pong_time：每当收到"PING"命令的回复后，不管是否是正常恢复，都会更新该属性为当时时间戳；

在哨兵的“主函数”sentinelHandleRedisInstance中，调用sentinelCheckSubjectivelyDown函数检测实例是否主观下线，该函数同时还会检测TCP连接是否正常。该函数的代码如下：

void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) {
    mstime_t elapsed = 0;

    if (ri->last_ping_time)
        elapsed = mstime() - ri->last_ping_time;

    /* Check if we are in need for a reconnection of one of the
     * links, because we are detecting low activity.
     *
     * 1) Check if the command link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have a
     *    pending ping for more than half the timeout. */
    if (ri->cc &&
        (mstime() - ri->cc_conn_time) > SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        ri->last_ping_time != 0 && /* Ther is a pending ping... */
        /* The pending ping is delayed, and we did not received
         * error replies as well. */
        (mstime() - ri->last_ping_time) > (ri->down_after_period/2) &&
        (mstime() - ri->last_pong_time) > (ri->down_after_period/2))
    {
        sentinelKillLink(ri,ri->cc);
    }

    /* 2) Check if the pubsub link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have no
     *    activity in the Pub/Sub channel for more than
     *    SENTINEL_PUBLISH_PERIOD * 3.
     */
    if (ri->pc &&
        (mstime() - ri->pc_conn_time) > SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        (mstime() - ri->pc_last_activity) > (SENTINEL_PUBLISH_PERIOD*3))
    {
        sentinelKillLink(ri,ri->pc);
    }

    /* Update the SDOWN flag. We believe the instance is SDOWN if:
     *
     * 1) It is not replying.
     * 2) We believe it is a master, it reports to be a slave for enough time
     *    to meet the down_after_period, plus enough time to get two times
     *    INFO report from the instance. */
    if (elapsed > ri->down_after_period ||
        (ri->flags & SRI_MASTER &&
         ri->role_reported == SRI_SLAVE &&
         mstime() - ri->role_reported_time >
          (ri->down_after_period+SENTINEL_INFO_PERIOD*2)))
    {
        /* Is subjectively down */
        if ((ri->flags & SRI_S_DOWN) == 0) {
            sentinelEvent(REDIS_WARNING,"+sdown",ri,"%@");
            ri->s_down_since_time = mstime();
            ri->flags |= SRI_S_DOWN;
        }
    } else {
        /* Is subjectively up */
        if (ri->flags & SRI_S_DOWN) {
            sentinelEvent(REDIS_WARNING,"-sdown",ri,"%@");
            ri->flags &= ~(SRI_S_DOWN|SRI_SCRIPT_KILL_SENT);
        }
    }
}

ri->cc_conn_time属性表示上一次向该实例发起命令类型的TCP建链的时间；ri->pc_conn_time属性表示上一次向该实例发起订阅类型的TCP建链的时间；

首先计算elapsed的值，该值表示是当前时间与ri->last_ping_time之间的时间差；

然后判断命令类型的TCP连接是否正常，不正常的条件是：距离上次建链时已经超过了SENTINEL_MIN_LINK_RECONNECT_PERIOD，并且上次发送"PING"后还没有收到正常回复，且当前时间与ri->last_ping_time之间的时间差已经超过了ri->down_after_period/2，并且距离上次收到任何"PING"回复的时间，已经超过了ri->down_after_period/2；

如果命令类型的连接不正常了，则直接调用sentinelKillLink断开连接，释放异步上下文；

然后判断订阅类型的TCP连接是否正常，不正常的条件是：距离上次建链时已经超过了SENTINEL_MIN_LINK_RECONNECT_PERIOD，并且距离上次收到订阅频道发来的任何消息的时间，已经超过了SENTINEL_PUBLISH_PERIOD*3；

如果订阅类型的连接不正常了，则直接调用sentinelKillLink断开连接，释放异步上下文；

如果elapsed的值大于ri->down_after_period，或者：当前实例我认为它是主节点，但是它的"INFO"回复中却报告自己是从节点，并且距离上次收到它在"INFO"回复中报告自己是从节点的时间，已经超过了ri->down_after_period+SENTINEL_INFO_PERIOD*2；

满足以上任意一个条件，都认为该实例是主观下线了。因此：只要该实例还没有标志为主观下线，则将SRI_S_DOWN标记增加到实例标志位中，表示该实例主观下线；

如果不满足以上条件，但是该实例之前已经被标记为主观下线了，则认为该实例主观上线了，去掉其标志位中的SRI_S_DOWN和SRI_SCRIPT_KILL_SENT标记；

相关阅读:
深入浅出Win32多线程程序设计(一)
dm642的优化
 SpringBoot2
HZERO微服务平台09: jhipster接入hzero
如何以纯文本方式简单快速记录java代码的调用过程
 HZERO微服务平台07: 代码分析之登录日志、验证码登录、jwt token等
 HZERO微服务平台02: 认证鉴权体系介绍
 HZERO微服务平台06: 代码分析之token生成、校验、获取信息、传递
 HZERO微服务平台10: 代码分析之admin服务刷新路由、权限、swagger的过程 .md
HZERO微服务平台11: 代码分析之数据权限、sql拦截 .md
原文地址：https://www.cnblogs.com/gqtcgq/p/7247048.html