• 大厂面试Kafka,一定会问到的幂等性


    01 幂等性如此重要

    Kafka作为分布式MQ,大量用于分布式系统中,如消息推送系统、业务平台系统(如结算平台),就拿结算来说,业务方作为上游把数据打到结算平台,如果一份数据被计算、处理了多次,产生的后果将会特别严重。

    02 哪些因素影响幂等性

    使用Kafka时,需要保证exactly-once语义。要知道在分布式系统中,出现网络分区是不可避免的,如果kafka broker 在回复ack时,出现网络故障或者是full gc导致ack timeout,producer将会重发,如何保证producer重试时不造成重复or乱序?又或者producer 挂了,新的producer并没有old producer的状态数据,这个时候如何保证幂等?即使Kafka 发送消息满足了幂等,consumer拉取到消息后,把消息交给线程池workers,workers线程对message的处理可能包含异步操作,又会出现以下情况:

    • 先commit,再执行业务逻辑:提交成功,处理失败 。造成丢失
    • 先执行业务逻辑,再commit:提交失败,执行成功。造成重复执行
    • 先执行业务逻辑,再commit:提交成功,异步执行fail。造成丢失

    本文将针对以上问题作出讨论

    03 Kafka保证发送幂等性

          针对以上的问题,kafka在0.11版新增了幂等型producer和事务型producer。前者解决了单会话幂等性等问题,后者解决了多会话幂等性。

    单会话幂等性

        为解决producer重试引起的乱序和重复。Kafka增加了pid和seq。Producer中每个RecordBatch都有一个单调递增的seq; Broker上每个tp也会维护pid-seq的映射,并且每Commit都会更新lastSeq。这样recordBatch到来时,broker会先检查RecordBatch再保存数据:如果batch中 baseSeq(第一条消息的seq)比Broker维护的序号(lastSeq)大1,则保存数据,否则不保存(inSequence方法)。

    ProducerStateManager.scala

    private def maybeValidateAppend(producerEpoch: Short, firstSeq: Int, offset: Long): Unit = {
        validationType match {
          case ValidationType.None =>
    
          case ValidationType.EpochOnly =>
            checkProducerEpoch(producerEpoch, offset)
    
          case ValidationType.Full =>
            checkProducerEpoch(producerEpoch, offset)
            checkSequence(producerEpoch, firstSeq, offset)
        }
    }
    
    private def checkSequence(producerEpoch: Short, appendFirstSeq: Int, offset: Long): Unit = {
      if (producerEpoch != updatedEntry.producerEpoch) {
        if (appendFirstSeq != 0) {
          if (updatedEntry.producerEpoch != RecordBatch.NO_PRODUCER_EPOCH) {
            throw new OutOfOrderSequenceException(s"Invalid sequence number for new epoch at offset $offset in " +
              s"partition $topicPartition: $producerEpoch (request epoch), $appendFirstSeq (seq. number)")
          } else {
            throw new UnknownProducerIdException(s"Found no record of producerId=$producerId on the broker at offset $offset" +
              s"in partition $topicPartition. It is possible that the last message with the producerId=$producerId has " +
              "been removed due to hitting the retention limit.")
          }
        }
      } else {
        val currentLastSeq = if (!updatedEntry.isEmpty)
          updatedEntry.lastSeq
        else if (producerEpoch == currentEntry.producerEpoch)
          currentEntry.lastSeq
        else
          RecordBatch.NO_SEQUENCE
    
        if (currentLastSeq == RecordBatch.NO_SEQUENCE && appendFirstSeq != 0) {
          // We have a matching epoch, but we do not know the next sequence number. This case can happen if
          // only a transaction marker is left in the log for this producer. We treat this as an unknown
          // producer id error, so that the producer can check the log start offset for truncation and reset
          // the sequence number. Note that this check follows the fencing check, so the marker still fences
          // old producers even if it cannot determine our next expected sequence number.
          throw new UnknownProducerIdException(s"Local producer state matches expected epoch $producerEpoch " +
            s"for producerId=$producerId at offset $offset in partition $topicPartition, but the next expected " +
            "sequence number is not known.")
        } else if (!inSequence(currentLastSeq, appendFirstSeq)) {
          throw new OutOfOrderSequenceException(s"Out of order sequence number for producerId $producerId at " +
            s"offset $offset in partition $topicPartition: $appendFirstSeq (incoming seq. number), " +
            s"$currentLastSeq (current end sequence number)")
        }
      }
    }
    
      private def inSequence(lastSeq: Int, nextSeq: Int): Boolean = {
        nextSeq == lastSeq + 1L || (nextSeq == 0 && lastSeq == Int.MaxValue)
      }
    

    引申:Kafka producer 对有序性做了哪些处理

         假设我们有5个请求,batch1、batch2、batch3、batch4、batch5;如果只有batch2 ack failed,3、4、5都保存了,那2将会随下次batch重发而造成重复。我们可以设置max.in.flight.requests.per.connection=1(客户端在单个连接上能够发送的未响应请求的个数)来解决乱序,但降低了系统吞吐。

    新版本kafka设置enable.idempotence=true后能够动态调整max-in-flight-request。正常情况下max.in.flight.requests.per.connection大于1。当重试请求到来且时,batch 会根据 seq重新添加到队列的合适位置,并把max.in.flight.requests.per.connection设为1,这样它 前面的 batch序号都比它小,只有前面的都发完了,它才能发。

        private void insertInSequenceOrder(Deque<ProducerBatch> deque, ProducerBatch batch) {
            // When we are requeing and have enabled idempotence, the reenqueued batch must always have a sequence.
            if (batch.baseSequence() == RecordBatch.NO_SEQUENCE)
                throw new IllegalStateException("Trying to re-enqueue a batch which doesn't have a sequence even " +
                    "though idempotency is enabled.");
    
            if (transactionManager.nextBatchBySequence(batch.topicPartition) == null)
                throw new IllegalStateException("We are re-enqueueing a batch which is not tracked as part of the in flight " +
                    "requests. batch.topicPartition: " + batch.topicPartition + "; batch.baseSequence: " + batch.baseSequence());
    
            ProducerBatch firstBatchInQueue = deque.peekFirst();
            if (firstBatchInQueue != null && firstBatchInQueue.hasSequence() && firstBatchInQueue.baseSequence() < batch.baseSequence()) {
    
                List<ProducerBatch> orderedBatches = new ArrayList<>();
                while (deque.peekFirst() != null && deque.peekFirst().hasSequence() && deque.peekFirst().baseSequence() < batch.baseSequence())
                    orderedBatches.add(deque.pollFirst());
    
                log.debug("Reordered incoming batch with sequence {} for partition {}. It was placed in the queue at " +
                    "position {}", batch.baseSequence(), batch.topicPartition, orderedBatches.size());
                // Either we have reached a point where there are batches without a sequence (ie. never been drained
                // and are hence in order by default), or the batch at the front of the queue has a sequence greater
                // than the incoming batch. This is the right place to add the incoming batch.
                deque.addFirst(batch);
    
                // Now we have to re insert the previously queued batches in the right order.
                for (int i = orderedBatches.size() - 1; i >= 0; --i) {
                    deque.addFirst(orderedBatches.get(i));
                }
    
                // At this point, the incoming batch has been queued in the correct place according to its sequence.
            } else {
                deque.addFirst(batch);
            }
        }
    

    多会话幂等性

         在单会话幂等性中介绍,kafka通过引入pid和seq来实现单会话幂等性,但正是引入了pid,当应用重启时,新的producer并没有old producer的状态数据。可能重复保存。

    Kafka事务通过隔离机制来实现多会话幂等性

    kafka事务引入了transactionId 和Epoch,设置transactional.id后,一个transactionId只对应一个pid, 且Server 端会记录最新的 Epoch 值。这样有新的producer初始化时,会向TransactionCoordinator发送InitPIDRequest请求, TransactionCoordinator 已经有了这个 transactionId对应的 meta,会返回之前分配的 PID,并把 Epoch 自增 1 返回,这样当old producer恢复过来请求操作时,将被认为是无效producer抛出异常。 如果没有开启事务,TransactionCoordinator会为新的producer返回new pid,这样就起不到隔离效果,因此无法实现多会话幂等。

    private def maybeValidateAppend(producerEpoch: Short, firstSeq: Int, offset: Long): Unit = {
        validationType match {
          case ValidationType.None =>
    
          case ValidationType.EpochOnly =>
            checkProducerEpoch(producerEpoch, offset)
    
          case ValidationType.Full => //开始事务,执行这个判断
            checkProducerEpoch(producerEpoch, offset)
            checkSequence(producerEpoch, firstSeq, offset)
        }
    }
    
    private def checkProducerEpoch(producerEpoch: Short, offset: Long): Unit = {
        if (producerEpoch < updatedEntry.producerEpoch) {
          throw new ProducerFencedException(s"Producer's epoch at offset $offset is no longer valid in " +
            s"partition $topicPartition: $producerEpoch (request epoch), ${updatedEntry.producerEpoch} (current epoch)")
        }
      }
    

    04 Consumer端幂等性

    如上所述,consumer拉取到消息后,把消息交给线程池workers,workers对message的handle可能包含异步操作,又会出现以下情况:

    • 先commit,再执行业务逻辑:提交成功,处理失败 。造成丢失
    • 先执行业务逻辑,再commit:提交失败,执行成功。造成重复执行
    • 先执行业务逻辑,再commit:提交成功,异步执行fail。造成丢失

    对此我们常用的方法时,works取到消息后先执行如下code:

    if(cache.contain(msgId)){
      // cache中包含msgId,已经处理过
    		continue;
    }else {
      lock.lock();
      cache.put(msgId,timeout);
      commitSync();
      lock.unLock();
    }
    // 后续完成所有操作后,删除cache中的msgId,只要msgId存在cache中,就认为已经处理过。Note:需要给cache设置有消息
    

    如果喜欢我的文章,请长按二维码,关注靳刚同学

    靳刚同学

    您的转发是对我最大的支持,谢谢!

  • 相关阅读:
    Python- 索引 B+数 比如书的目录
    Python-视图 触发器 事务 存储过程
    Python-mysql 权限 pymysql 注入共计
    Shell之Sed常用用法
    python 字符编码讲解
    python enumerate枚举用法总结
    Python第三周 数据类型:集合set、文件的读写、追加操作。
    第二周Python笔记 数据类型 列表 字典
    Python第二周 str的方法
    第二周Python笔记之 变量的三元运算
  • 原文地址:https://www.cnblogs.com/jingangtx/p/11330338.html
Copyright © 2020-2023  润新知