dledger 有个 preferredLeader 的设置,它的作用是,优先选择某个节点作为 leader,具体怎么实现的?
首先某个节点配置了 --preferred-leader-id 参数(可以在节点启动后通过命令行设置),并且当它取得 leader 地位后,它会把 leader 地位移交给优选的节点
DLedgerServer 的定时任务
// io.openmessaging.storage.dledger.DLedgerServer#startup public void startup() { // 存储 this.dLedgerStore.startup(); // 网络 this.dLedgerRpcService.startup(); // 日志追加 push 给 follower this.dLedgerEntryPusher.startup(); // 选举 this.dLedgerLeaderElector.startup(); // 定期检查 preferredLeader,只有在自己是 leader 时,可以把 leader 转移给指定的节点 executorService.scheduleAtFixedRate(this::checkPreferredLeader, 1000, 1000, TimeUnit.MILLISECONDS); }
// io.openmessaging.storage.dledger.DLedgerServer#checkPreferredLeader private void checkPreferredLeader() { // 当前节点不是 leader,返回 if (!memberState.isLeader()) { return; } String preferredLeaderId = dLedgerConfig.getPreferredLeaderId(); // 没有配置优选 leader,或者当前 leader 已经是优选 leader if (preferredLeaderId == null || preferredLeaderId.equals(memberState.getLeaderId())) { return; } // 配置的优选 leader 不在 peers 中 if (!memberState.isPeerMember(preferredLeaderId)) { logger.warn("preferredLeaderId = {} is not a peer member", preferredLeaderId); return; } // 已经设置了接收 leader 地位的节点,返回 if (memberState.getTransferee() != null) { return; } // 优选 leader 不活跃 if (!memberState.getPeersLiveTable().containsKey(preferredLeaderId) || memberState.getPeersLiveTable().get(preferredLeaderId) == Boolean.FALSE) { logger.warn("preferredLeaderId = {} is not online", preferredLeaderId); return; } // 如果优选 leader 的日志落后太多,返回 long fallBehind = dLedgerStore.getLedgerEndIndex() - dLedgerEntryPusher.getPeerWaterMark(memberState.currTerm(), preferredLeaderId); logger.info("transferee fall behind index : {}", fallBehind); if (fallBehind < dLedgerConfig.getMaxLeadershipTransferWaitIndex()) { LeadershipTransferRequest request = new LeadershipTransferRequest(); request.setTerm(memberState.currTerm()); request.setTransfereeId(dLedgerConfig.getPreferredLeaderId()); try { long startTransferTime = System.currentTimeMillis(); // leader 开始构造请求 LeadershipTransferResponse response = dLedgerLeaderElector.handleLeadershipTransfer(request).get(); logger.info("transfer finished. request={},response={},cost={}ms", request, response, DLedgerUtils.elapsed(startTransferTime)); } catch (Throwable t) { logger.error("[checkPreferredLeader] error, request={}", request, t); } } } // io.openmessaging.storage.dledger.DLedgerLeaderElector#handleLeadershipTransfer public CompletableFuture<LeadershipTransferResponse> handleLeadershipTransfer( LeadershipTransferRequest request) throws Exception { logger.info("handleLeadershipTransfer: {}", request); synchronized (memberState) { if (memberState.currTerm() != request.getTerm()) { logger.warn("[BUG] [HandleLeaderTransfer] currTerm={} != request.term={}", memberState.currTerm(), request.getTerm()); return CompletableFuture.completedFuture(new LeadershipTransferResponse().term(memberState.currTerm()).code(DLedgerResponseCode.INCONSISTENT_TERM.getCode())); } if (!memberState.isLeader()) { logger.warn("[BUG] [HandleLeaderTransfer] selfId={} is not leader", request.getLeaderId()); return CompletableFuture.completedFuture(new LeadershipTransferResponse().term(memberState.currTerm()).code(DLedgerResponseCode.NOT_LEADER.getCode())); } if (memberState.getTransferee() != null) { logger.warn("[BUG] [HandleLeaderTransfer] transferee={} is already set", memberState.getTransferee()); return CompletableFuture.completedFuture(new LeadershipTransferResponse().term(memberState.currTerm()).code(DLedgerResponseCode.LEADER_TRANSFERRING.getCode())); } memberState.setTransferee(request.getTransfereeId()); } LeadershipTransferRequest takeLeadershipRequest = new LeadershipTransferRequest(); takeLeadershipRequest.setGroup(memberState.getGroup()); takeLeadershipRequest.setLeaderId(memberState.getLeaderId()); takeLeadershipRequest.setLocalId(memberState.getSelfId()); takeLeadershipRequest.setRemoteId(request.getTransfereeId()); takeLeadershipRequest.setTerm(request.getTerm()); takeLeadershipRequest.setTakeLeadershipLedgerIndex(memberState.getLedgerEndIndex()); takeLeadershipRequest.setTransferId(memberState.getSelfId()); takeLeadershipRequest.setTransfereeId(request.getTransfereeId()); if (memberState.currTerm() != request.getTerm()) { logger.warn("[HandleLeaderTransfer] term changed, cur={} , request={}", memberState.currTerm(), request.getTerm()); return CompletableFuture.completedFuture(new LeadershipTransferResponse().term(memberState.currTerm()).code(DLedgerResponseCode.EXPIRED_TERM.getCode())); } // 当 CompletableFutrue 完成后,执行 thenApply 中的函数 // function 有返回值 // consumer 只消费,即没有返回值 return dLedgerRpcService.leadershipTransfer(takeLeadershipRequest).thenApply(response -> { synchronized (memberState) { if (memberState.currTerm() == request.getTerm() && memberState.getTransferee() != null) { logger.warn("leadershipTransfer failed, set transferee to null"); memberState.setTransferee(null); } } return response; }); }
优选节点准备接管 leader
// io.openmessaging.storage.dledger.DLedgerServer#handleLeadershipTransfer public CompletableFuture<LeadershipTransferResponse> handleLeadershipTransfer(LeadershipTransferRequest request) throws Exception { try { PreConditions.check(memberState.getSelfId().equals(request.getRemoteId()), DLedgerResponseCode.UNKNOWN_MEMBER, "%s != %s", request.getRemoteId(), memberState.getSelfId()); PreConditions.check(memberState.getGroup().equals(request.getGroup()), DLedgerResponseCode.UNKNOWN_GROUP, "%s != %s", request.getGroup(), memberState.getGroup()); if (memberState.getSelfId().equals(request.getTransferId())) { //It's the leader received the transfer command. PreConditions.check(memberState.isPeerMember(request.getTransfereeId()), DLedgerResponseCode.UNKNOWN_MEMBER, "transferee=%s is not a peer member", request.getTransfereeId()); PreConditions.check(memberState.currTerm() == request.getTerm(), DLedgerResponseCode.INCONSISTENT_TERM, "currTerm(%s) != request.term(%s)", memberState.currTerm(), request.getTerm()); PreConditions.check(memberState.isLeader(), DLedgerResponseCode.NOT_LEADER, "selfId=%s is not leader=%s", memberState.getSelfId(), memberState.getLeaderId()); // check fall transferee not fall behind much. long transfereeFallBehind = dLedgerStore.getLedgerEndIndex() - dLedgerEntryPusher.getPeerWaterMark(request.getTerm(), request.getTransfereeId()); PreConditions.check(transfereeFallBehind < dLedgerConfig.getMaxLeadershipTransferWaitIndex(), DLedgerResponseCode.FALL_BEHIND_TOO_MUCH, "transferee fall behind too much, diff=%s", transfereeFallBehind); return dLedgerLeaderElector.handleLeadershipTransfer(request); } else if (memberState.getSelfId().equals(request.getTransfereeId())) { // 优选节点收到接管 leader 的通知 // It's the transferee received the take leadership command. PreConditions.check(request.getTransferId().equals(memberState.getLeaderId()), DLedgerResponseCode.INCONSISTENT_LEADER, "transfer=%s is not leader", request.getTransferId()); return dLedgerLeaderElector.handleTakeLeadership(request); } else { return CompletableFuture.completedFuture(new LeadershipTransferResponse().term(memberState.currTerm()).code(DLedgerResponseCode.UNEXPECTED_ARGUMENT.getCode())); } } catch (DLedgerException e) { logger.error("[{}][handleLeadershipTransfer] failed", memberState.getSelfId(), e); LeadershipTransferResponse response = new LeadershipTransferResponse(); response.copyBaseInfo(request); response.setCode(e.getCode().getCode()); response.setLeaderId(memberState.getLeaderId()); return CompletableFuture.completedFuture(response); } } follower 切换为 candidate 开始拉票 // io.openmessaging.storage.dledger.DLedgerLeaderElector#handleTakeLeadership public CompletableFuture<LeadershipTransferResponse> handleTakeLeadership( LeadershipTransferRequest request) throws Exception { logger.debug("handleTakeLeadership.request={}", request); synchronized (memberState) { if (memberState.currTerm() != request.getTerm()) { logger.warn("[BUG] [handleTakeLeadership] currTerm={} != request.term={}", memberState.currTerm(), request.getTerm()); return CompletableFuture.completedFuture(new LeadershipTransferResponse().term(memberState.currTerm()).code(DLedgerResponseCode.INCONSISTENT_TERM.getCode())); } // 任期加一 long targetTerm = request.getTerm() + 1; memberState.setTermToTakeLeadership(targetTerm); CompletableFuture<LeadershipTransferResponse> responseFuture = new CompletableFuture<>(); // 更新请求响应对 takeLeadershipTask.update(request, responseFuture); // 通常是 follower 转换为 candidate,开始拉票 changeRoleToCandidate(targetTerm); needIncreaseTermImmediately = true; return responseFuture; } }
命令行:
leadershipTransfer --group zhang --peers n0-localhost:10911;n1-localhost:10912 --leader n1 --term 22 --transfereeId n0
leader 日志:
INFO DLedgerLeaderElector:593 - handleLeadershipTransfer: LeadershipTransferRequest{transferId='n0', transfereeId='n1', takeLeadershipLedgerIndex=0, group='zhang', remoteId='n0', localId='null', code=200, leaderId='null', term=22} INFO DLedgerLeaderElector:169 - [n0] [ChangeRoleToCandidate] from term: 23 and currTerm: 22 INFO DLedgerLeaderElector:426 - n0_[INCREASE_TERM] from 22 to 23 INFO DLedgerRpcNettyService:360 - LEADERSHIP_TRANSFER FINISHED. Request=RemotingCommand [code=51005, language=JAVA, version=0, opaque=1, flag(B)=0, remark=null, extFields=null, serializeTypeCurrentRPC=JSON], response=LeadershipTransferResponse{group='null', remoteId='null', localId='null', code=200, leaderId='null', term=23}, cost=119ms INFO DLedgerLeaderElector:184 - [n0][ChangeRoleToFollower] from term: 23 leaderId: n1 and currTerm: 23
优选节点日志:
INFO DLedgerLeaderElector:169 - [n1] [ChangeRoleToCandidate] from term: 23 and currTerm: 22 INFO DLedgerLeaderElector:403 - n1_[INCREASE_TERM] from 22 to 23 INFO DLedgerLeaderElector:434 - [n1][GetVoteResponse] {"code":200,"group":"zhang","leaderId":"n1","remoteId":"n1","term":23,"voteResult":"ACCEPT"} INFO DLedgerLeaderElector:434 - [n1][GetVoteResponse] {"code":200,"group":"zhang","leaderId":"n1","remoteId":"n0","term":22,"voteResult":"REJECT_TERM_NOT_READY"} INFO DLedgerLeaderElector:512 - [n1] [PARSE_VOTE_RESULT] cost=15 term=23 memberNum=2 allNum=2 acceptedNum=1 notReadyTermNum=1 biggerLedgerNum=0 alreadyHasLeader=false maxTerm=-1 result=REVOTE_IMMEDIATELY INFO DLedgerLeaderElector:434 - [n1][GetVoteResponse] {"code":200,"group":"zhang","leaderId":"n1","remoteId":"n1","term":23,"voteResult":"ACCEPT"} INFO DLedgerLeaderElector:434 - [n1][GetVoteResponse] {"code":200,"group":"zhang","leaderId":"n1","remoteId":"n0","term":23,"voteResult":"ACCEPT"} INFO DLedgerLeaderElector:512 - [n1] [PARSE_VOTE_RESULT] cost=5 term=23 memberNum=2 allNum=2 acceptedNum=2 notReadyTermNum=0 biggerLedgerNum=0 alreadyHasLeader=false maxTerm=-1 result=PASSED INFO DLedgerLeaderElector:516 - [n1] [VOTE_RESULT] has been elected to be the leader in term 23 INFO DLedgerLeaderElector:670 - TakeLeadershipTask finished. request=LeadershipTransferRequest{transferId='n0', transfereeId='n1', takeLeadershipLedgerIndex=8, group='zhang', remoteId='n1', localId='n0', code=200, leaderId='n0', term=22}, response=LeadershipTransferResponse{group='null', remoteId='null', localId='null', code=200, leaderId='null', term=23}, term=23, role=LEADER INFO DLedgerLeaderElector:157 - [n1] [ChangeRoleToLeader] from term: 23 and currTerm: 23 INFO DLedgerEntryPusher:115 - Initialize the pending append map in QuorumAckChecker for term=23 INFO DLedgerRpcNettyService:351 - LEADERSHIP_TRANSFER FINISHED. Request=RemotingCommand [code=51005, language=JAVA, version=0, opaque=687, flag(B)=0, remark=null, extFields=null, serializeTypeCurrentRPC=JSON], response=LeadershipTransferResponse{group='null', remoteId='null', localId='null', code=200, leaderId='null', term=23}, cost=107ms INFO DLedgerEntryPusher:104 - Initialize the watermark in QuorumAckChecker for term=23
如上是最简单的一种情况,总结一下步骤:
1. leader 通知优选节点开始接管
2. 优选节点增加任期,切换为 candidate,开始拉票
2.1 首先优选节点自己给自己投票
2.2 优选节点向 leader 拉票,leader 发现它的任期比自己大,首先增加自己的任期,返回一个响应,拒绝这次拉票请求
3. 优选节点,收到 2 个拉票响应,一个是自己,一个 REJECT_TERM_NOT_READY,于是马上再次拉票,抢在原来的 leader 前面
如果优选节点日志落后于 leader,它是不一定能成为 leader 的,需要多轮选举才能决出 leader,但胜出的不一定是优选 leader
根据代码来看,建议在 dledger 写入压力小的时候,切换 leader。