• Apache Kafka源码分析


    startup

    在onControllerFailover中被调用,

    initializePartitionState

    private def initializePartitionState() {
        for((topicPartition, replicaAssignment) <- controllerContext.partitionReplicaAssignment) { // 取出所有的partitions
          // check if leader and isr path exists for partition. If not, then it is in NEW state
          controllerContext.partitionLeadershipInfo.get(topicPartition) match {
            case Some(currentLeaderIsrAndEpoch) =>
              // else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state
              controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader) match {
                case true => // leader is alive
                  partitionState.put(topicPartition, OnlinePartition)
                case false =>
                  partitionState.put(topicPartition, OfflinePartition)
              }
            case None =>
              partitionState.put(topicPartition, NewPartition)
          }
        }
      }

    这里注意offlinePartition和newPartition的区别,
    如果controllerContext.partitionLeadershipInfo中没有这个partition的leader信息,那么说明是newPartition
    如果有leader,但leader所在broker不是alive的,那么就是offlinePartition
    当然,如果leader所在broker是alive的,那么就是onlinePartition

     

    triggerOnlinePartitionStateChange

    试图将所有offline和new partition的状态变成online

    def triggerOnlinePartitionStateChange() {
        try {
          brokerRequestBatch.newBatch()
          // try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state except partitions
          // that belong to topics to be deleted
          for((topicAndPartition, partitionState) <- partitionState
              if(!controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic))) {
            if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
              handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
                                (new CallbackBuilder).build)
          }
          brokerRequestBatch.sendRequestsToBrokers(controller.epoch, controllerContext.correlationId.getAndIncrement)
        } catch {
          case e: Throwable => error("Error while moving some partitions to the online state", e)
          // TODO: It is not enough to bail out and log an error, it is important to trigger leader election for those partitions
        }
      }

    这里看到,brokerRequestBatch,这个经常出现,ControllerBrokerRequestBatch
    这个类封装了leaderAndIsrRequestMap,stopReplicaRequestMap,updateMetadataRequestMap
    用来记录和cache,handleStateChange中产生的这些request
    最终用sendRequestsToBrokers,将这些requests,批量的发出去

    handleStateChange的逻辑后面单独看,这里看看controller.offlinePartitionSelector,这个selector实现如何为一个newPartition或offlinePartition选一个leader
    代码挺长的,注释讲的挺清楚的,就不贴代码了
    首先如果ISR里面有活的broker,那没有好说的,直接用它作为新的leader
    如果没有,这里需要看一下是否容忍unclean leader election,其实就是是否可以容忍丢数据,如果可以
    那么就看看这AR里面有没有活的broker,如果有就以它为leader,但这个既然不在ISR里面,说明这个replica是不同步的,所以一定有data loss
    如果AR里面也没有活的broker,那只能是elect失败了

    /**
     * Select the new leader, new isr and receiving replicas (for the LeaderAndIsrRequest):
     * 1. If at least one broker from the isr is alive, it picks a broker from the live isr as the new leader and the live
     *    isr as the new isr.
     * 2. Else, if unclean leader election for the topic is disabled, it throws a NoReplicaOnlineException.
     * 3. Else, it picks some alive broker from the assigned replica list as the new leader and the new isr.
     * 4. If no broker in the assigned replica list is alive, it throws a NoReplicaOnlineException
     * Replicas to receive LeaderAndIsr request = live assigned replicas
     * Once the leader is successfully registered in zookeeper, it updates the allLeaders cache
     */


    registerListeners

    在onControllerFailover中被调用,
    这里负责注册一下listener到zk,deleteTopicListener先不管
    先看看TopicChangeListener,当topics发生变化时,我们做什么处理?

     

    registerTopicChangeListener

    private def registerTopicChangeListener() = {
        zkClient.subscribeChildChanges(ZkUtils.BrokerTopicsPath, topicChangeListener) //"/brokers/topics"
      }

    Listen这个目录, /brokers/topics,如果发生变化,触发topicChangeListener

     

    TopicChangeListener

      /**
       * This is the zookeeper listener that triggers all the state transitions for a partition
       */
      class TopicChangeListener extends IZkChildListener with Logging {
        this.logIdent = "[TopicChangeListener on Controller " + controller.config.brokerId + "]: "
    
        @throws(classOf[Exception])
        def handleChildChange(parentPath : String, children : java.util.List[String]) {
          inLock(controllerContext.controllerLock) {
            if (hasStarted.get) {
              try {
                val currentChildren = {
                  import JavaConversions._
                  debug("Topic change listener fired for path %s with children %s".format(parentPath, children.mkString(",")))
                  (children: Buffer[String]).toSet
                }
                val newTopics = currentChildren -- controllerContext.allTopics //context里面没记录,但zk有的,就是新topic
                val deletedTopics = controllerContext.allTopics -- currentChildren //反之,就被删除的topic
                controllerContext.allTopics = currentChildren //更新context
    
                val addedPartitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, newTopics.toSeq) //从zk取出新topic的assignment情况
                  controllerContext.partitionReplicaAssignment = controllerContext.partitionReplicaAssignment.filter(p =>  //从context中的assignment情况中删掉deletedtopic的
                  !deletedTopics.contains(p._1.topic))
                controllerContext.partitionReplicaAssignment.++=(addedPartitionReplicaAssignment) //把新的topic的assignment加入context
                info("New topics: [%s], deleted topics: [%s], new partition replica assignment [%s]".format(newTopics,
                  deletedTopics, addedPartitionReplicaAssignment))
                if(newTopics.size > 0)
                  controller.onNewTopicCreation(newTopics, addedPartitionReplicaAssignment.keySet.toSet) //最终调用KafkaController.onNewTopicCreation
              } catch {
                case e: Throwable => error("Error while handling new topic", e )
              }
            }
          }
        }
      }

     

    onNewTopicCreation

    def onNewTopicCreation(topics: Set[String], newPartitions: Set[TopicAndPartition]) {
        info("New topic creation callback for %s".format(newPartitions.mkString(",")))
        // subscribe to partition changes
        topics.foreach(topic => partitionStateMachine.registerPartitionChangeListener(topic)) //添加partition change listener
        onNewPartitionCreation(newPartitions) //partition和replica的状态变化
      }
    def registerPartitionChangeListener(topic: String) = {
        addPartitionsListener.put(topic, new AddPartitionsListener(topic))
        zkClient.subscribeDataChanges(ZkUtils.getTopicPath(topic), addPartitionsListener(topic)) ///brokers/topics/topic-name
      }

     

    AddPartitionsListener

    和topic listener很想,就是从zk读出partition情况,和当前context里面的比较,找出新的partitions,调用

    controller.onNewPartitionCreation(partitionsToBeAdded.keySet.toSet)

    可见无论是TopicChangeListener还是AddPartitionsListener,最终都是调用到onNewPartitionCreation,毕竟topic是个逻辑概念

    onNewPartitionCreation

    def onNewPartitionCreation(newPartitions: Set[TopicAndPartition]) {
        info("New partition creation callback for %s".format(newPartitions.mkString(",")))
        partitionStateMachine.handleStateChanges(newPartitions, NewPartition)
        replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), NewReplica)
        partitionStateMachine.handleStateChanges(newPartitions, OnlinePartition, offlinePartitionSelector)
        replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), OnlineReplica)
      }

    很简单,只是首先将所有新的partition和相应的replica的状态设为new,然后再设为online

     

    handleStateChange

    这是状态机的主逻辑,

    private def handleStateChange(topic: String, partition: Int, targetState: PartitionState,
                                    leaderSelector: PartitionLeaderSelector,
                                    callbacks: Callbacks) {
        val topicAndPartition = TopicAndPartition(topic, partition)
        val currState = partitionState.getOrElseUpdate(topicAndPartition, NonExistentPartition) // 取得当前状态
        try {
          targetState match {
            case NewPartition =>
              // pre: partition did not exist before this
              assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)
              assignReplicasToPartitions(topic, partition) // 从zk取得AR,并更新controllerContext.partitionReplicaAssignment
    partitionState.put(topicAndPartition, NewPartition)
    // post: partition has been assigned replicas case OnlinePartition => assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition) partitionState(topicAndPartition) match { case NewPartition => // initialize leader and isr path for new partition initializeLeaderAndIsrForPartition(topicAndPartition) case OfflinePartition => electLeaderForPartition(topic, partition, leaderSelector) case OnlinePartition => // invoked when the leader needs to be re-elected electLeaderForPartition(topic, partition, leaderSelector) case _ => // should never come here since illegal previous states are checked above } partitionState.put(topicAndPartition, OnlinePartition) val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader stateChangeLogger.trace("Controller %d epoch %d changed partition %s from %s to %s with leader %d" .format(controllerId, controller.epoch, topicAndPartition, currState, targetState, leader)) // post: partition has a leader case OfflinePartition => // pre: partition should be in New or Online state assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition) // should be called when the leader for a partition is no longer alive stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s" .format(controllerId, controller.epoch, topicAndPartition, currState, targetState)) partitionState.put(topicAndPartition, OfflinePartition) // post: partition has no alive leader case NonExistentPartition => // pre: partition should be in Offline state assertValidPreviousStates(topicAndPartition, List(OfflinePartition), NonExistentPartition) stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s" .format(controllerId, controller.epoch, topicAndPartition, currState, targetState)) partitionState.put(topicAndPartition, NonExistentPartition) // post: partition state is deleted from all brokers and zookeeper } } catch { case t: Throwable => stateChangeLogger.error("Controller %d epoch %d initiated state change for partition %s from %s to %s failed" .format(controllerId, controller.epoch, topicAndPartition, currState, targetState), t) } }

    可以看到,对于转变到OfflinePartition,NonExistentPartition,只是单纯的设置state
    而转变到NewPartition,除了设置state,也就多了步初始化AR

    只有转变到OnlinePartition的时候比较复杂些,

    其中从NewPartition--》OnlinePartition,需要做些初始化的工作,所以调用initializeLeaderAndIsrForPartition

    initializeLeaderAndIsrForPartition

    NewPartition是在zk中,没有leaderAndISR path的,所以初始化需要创建path,创建后,就再也不能回到New的状态,只能到offline

    其中逻辑除了创建zk path,就是进行leader elect,这里的elect逻辑是写死的,初始化的时候,一定是prefered selector,即选取live AR的head

    /**
       * Invoked on the NewPartition->OnlinePartition state change. When a partition is in the New state, it does not have
       * a leader and isr path in zookeeper. Once the partition moves to the OnlinePartition state, it's leader and isr
       * path gets initialized and it never goes back to the NewPartition state. From here, it can only go to the
       * OfflinePartition state.
       * @param topicAndPartition   The topic/partition whose leader and isr path is to be initialized
       */
      private def initializeLeaderAndIsrForPartition(topicAndPartition: TopicAndPartition) {
        val replicaAssignment = controllerContext.partitionReplicaAssignment(topicAndPartition)
        val liveAssignedReplicas = replicaAssignment.filter(r => controllerContext.liveBrokerIds.contains(r)) // 找出AR中活着的replica
        liveAssignedReplicas.size match {
          case 0 => // 没有活的,那肯定无法转成online的
            val failMsg = ("encountered error during state change of partition %s from New to Online, assigned replicas are [%s], " +
                           "live brokers are [%s]. No assigned replica is alive.")
                             .format(topicAndPartition, replicaAssignment.mkString(","), controllerContext.liveBrokerIds)
            stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
            throw new StateChangeFailedException(failMsg)
          case _ =>
            debug("Live assigned replicas for partition %s are: [%s]".format(topicAndPartition, liveAssignedReplicas))
            // make the first replica in the list of assigned replicas, the leader
            val leader = liveAssignedReplicas.head  // 取出第一个活的replica,作为leader replica
            val leaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(new LeaderAndIsr(leader, liveAssignedReplicas.toList), // 封装出LeaderIsrAndControllerEpoch
              controller.epoch)
            debug("Initializing leader and isr for partition %s to %s".format(topicAndPartition, leaderIsrAndControllerEpoch))
            try {
              ZkUtils.createPersistentPath(controllerContext.zkClient,  // 创建zk的LeaderAndIsrPath,关键的初始步骤
                ZkUtils.getTopicPartitionLeaderAndIsrPath(topicAndPartition.topic, topicAndPartition.partition),
                ZkUtils.leaderAndIsrZkData(leaderIsrAndControllerEpoch.leaderAndIsr, controller.epoch))
              // NOTE: the above write can fail only if the current controller lost its zk session and the new controller
              // took over and initialized this partition. This can happen if the current controller went into a long
              // GC pause
              controllerContext.partitionLeadershipInfo.put(topicAndPartition, leaderIsrAndControllerEpoch) // 更新context中的partitionLeadershipInfo
              brokerRequestBatch.addLeaderAndIsrRequestForBrokers(liveAssignedReplicas, topicAndPartition.topic, // 添加LeaderAndIsrRequest到requestbatch
                topicAndPartition.partition, leaderIsrAndControllerEpoch, replicaAssignment)
            } catch {
              case e: ZkNodeExistsException =>
                // read the controller epoch
                val leaderIsrAndEpoch = ReplicationUtils.getLeaderIsrAndEpochForPartition(zkClient, topicAndPartition.topic,
                  topicAndPartition.partition).get
                val failMsg = ("encountered error while changing partition %s's state from New to Online since LeaderAndIsr path already " +
                               "exists with value %s and controller epoch %d")
                                 .format(topicAndPartition, leaderIsrAndEpoch.leaderAndIsr.toString(), leaderIsrAndEpoch.controllerEpoch)
                stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
                throw new StateChangeFailedException(failMsg)
            }
        }
      }

     

    OfflinePartition或OnlinePartition –》OnlinePartition

    这个相对简单,只需要重新选举一下leader

    electLeaderForPartition

    def electLeaderForPartition(topic: String, partition: Int, leaderSelector: PartitionLeaderSelector) {
        val topicAndPartition = TopicAndPartition(topic, partition)
        try {
          var zookeeperPathUpdateSucceeded: Boolean = false
          var newLeaderAndIsr: LeaderAndIsr = null
          var replicasForThisPartition: Seq[Int] = Seq.empty[Int]
          while(!zookeeperPathUpdateSucceeded) { // while,只有更新zk成功,或发生异常才会跳出,这样写是不是有点危险
            val currentLeaderIsrAndEpoch = getLeaderIsrAndEpochOrThrowException(topic, partition) // 去zk获取leaderAndIsr信息,如果取不到,抛异常,因为offline或online都应该在zk上有数据的
            val currentLeaderAndIsr = currentLeaderIsrAndEpoch.leaderAndIsr
            val controllerEpoch = currentLeaderIsrAndEpoch.controllerEpoch
            if (controllerEpoch > controller.epoch) { // 判断leaderAndISR如果已经被其他更新epoch的controller改过,那就说明当前controller已经过期了,抛异常
              val failMsg = ("aborted leader election for partition [%s,%d] since the LeaderAndIsr path was " +
                             "already written by another controller. This probably means that the current controller %d went through " +
                             "a soft failure and another controller was elected with epoch %d.")
                               .format(topic, partition, controllerId, controllerEpoch)
              stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
              throw new StateChangeFailedException(failMsg)
            }
            // elect new leader or throw exception
            val (leaderAndIsr, replicas) = leaderSelector.selectLeader(topicAndPartition, currentLeaderAndIsr) // 调用Selector来选取leader,不同的Selector会有不同的选取逻辑
            val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topic, partition, // 在zk上更新leaderAndISR
              leaderAndIsr, controller.epoch, currentLeaderAndIsr.zkVersion)
            newLeaderAndIsr = leaderAndIsr
            newLeaderAndIsr.zkVersion = newVersion
            zookeeperPathUpdateSucceeded = updateSucceeded
            replicasForThisPartition = replicas
          }
          val newLeaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(newLeaderAndIsr, controller.epoch)
          // update the leader cache
          controllerContext.partitionLeadershipInfo.put(TopicAndPartition(topic, partition), newLeaderIsrAndControllerEpoch)
          stateChangeLogger.trace("Controller %d epoch %d elected leader %d for Offline partition %s"
                                    .format(controllerId, controller.epoch, newLeaderAndIsr.leader, topicAndPartition))
          val replicas = controllerContext.partitionReplicaAssignment(TopicAndPartition(topic, partition))
          // store new leader and isr info in cache
          brokerRequestBatch.addLeaderAndIsrRequestForBrokers(replicasForThisPartition, topic, partition,
            newLeaderIsrAndControllerEpoch, replicas)
        } catch {
          case lenne: LeaderElectionNotNeededException => // swallow
          case nroe: NoReplicaOnlineException => throw nroe
          case sce: Throwable =>
            val failMsg = "encountered error while electing leader for partition %s due to: %s.".format(topicAndPartition, sce.getMessage)
            stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
            throw new StateChangeFailedException(failMsg, sce)
        }
        debug("After leader election, leader cache is updated to %s".format(controllerContext.partitionLeadershipInfo.map(l => (l._1, l._2))))
      }
  • 相关阅读:
    Java实现提取拼音首字母
    Java实现网格中移动字母
    Java实现网格中移动字母
    Java实现网格中移动字母
    SQL语句:Group By总结
    Maven学习 使用Nexus搭建Maven私服(转)
    CentOS7 搭建Git服务器(转)
    tomcat调优的几个方面(转)
    windows越用越卡怎么办?(转)
    Easyui datagrid行内【添加】、【编辑】、【上移】、【下移】
  • 原文地址:https://www.cnblogs.com/fxjwind/p/4930882.html
Copyright © 2020-2023  润新知