• kafka-mirror不稳定问题分析与解决方法


         前段时间,线上环境的kafka多集群在采用mirror组件进行跨机房数据同步时,会偶尔出现hang住不稳定的情况:
    1. 现象
       a. 线上出现返回包序号不一致的现象:"Correlationid for response (13502150) does not match request"而程序hang住,cpu飙高,同步服务停止工作
      b. 发生平均频率:线上分3组group×2共6个实例,平均每天2个实例发生
      c. 类似线上问题请参考
          kafka-mirrormaker问题(https://issues.apache.org/jira/browse/KAFKA-1257)和kafka-producer问题(https://issues.apache.org/jira/browse/KAFKA-4669)

    2. 原因
       a. kafka网络协议背景
         kafka网络协议设计保证连接的请求和响应均是有序的,即对于每个单独的tcp连接下,可以保证发送请求和接收响应包序列均是有序的,同时每个发送请求包和响应包均有唯一递增id关联编号进行关联:“The server guarantees that on a single TCP connection, requests will be processed in the order they are sent and responses will return in that order as well.”出自kafka-network官网介绍;
       b. mirrormaker同步判断成功与否逻辑
        mirrormaker同步给目标kafka集群的每个数据request包均会搁在本地内存池里,直到收到相同CorrelationId的响应包,然后做两种判断: a. 发送成功,则销毁内存池中的数据请求包,b. 发送失败则数据包放回队列重新进行发送;
       c. mirrormaker同步判断线上bug原因
        而在判断函数handleCompletedReceives中: 由于条件a,默认认为每个发送请求包和响应包id号是一致的,而并未处理两者id号不一致的异常情况。所以一旦出现id编号不一致异常,则异常一直向上抛,而导致当前"发送请求包"并未得到任何响应处理,同时不会做内存释放最终导致泄露;
      d. 目前确定0.8.×、0.9.×版本均会存在线上同样问题

    3. 修复方案
         修改mirror-maker中kafka-client-0.8.2的源码: 增加出现了错乱包的异常捕获逻辑:把错乱时的数据请求包扔回内存队列进行重发。处理修改源码如下:

     /**
     * Handle any completed receives and update the response list with the responses received.
     * @param responses The list of responses to update
     * @param now The current time
     */
    private void handleCompletedReceives(List<ClientResponse> responses, long now) {
        for (NetworkReceive receive : this.selector.completedReceives()) {
            int source = receive.source();
            ResponseHeader header = ResponseHeader.parse(receive.payload());
            int compared = 0;
            ClientRequest req = null;
            short apiKey = -1;
            do{
                req = inFlightRequests.fristSent(source);
                if(req == null){
                	break;
                }
                apiKey = req.request().header().apiKey();
                compared = compareCorrelationId(req.request().header(), header, source);
            	if (compared < 0 && apiKey != ApiKeys.METADATA.id) {
            		responses.add(new ClientResponse(req, now, true, null));
            	}
            	if (compared < 0 || compared == 0){
            		req = inFlightRequests.completeNext(source);
            	}
            }while(compared < 0);
            if(req == null || compared > 0){
            	log.error("never go to this line");
            	continue;
            }
            Struct body = (Struct) ProtoUtils.currentResponseSchema(apiKey).read(receive.payload());
            if (apiKey == ApiKeys.METADATA.id) {
                handleMetadataResponse(req.request().header(), body, now);
            } else {
                // need to add body/header to response here
                responses.add(new ClientResponse(req, now, false, body));
            }
        }
    } 
  • 相关阅读:
    工作中,怎么做好规范
    每日一链
    模仿电子商务垂直菜单
    电脑不同的分辨率自适应显示
    怎样成为一位合格的程序员
    巅峰极客线上第一场ctf——RE
    恶意代码分析常见Windows函数
    巅峰极客线上第二场部分ctf
    恶意代码分析:虚拟网络环境配置
    0ctf2017 pwn babyheap
  • 原文地址:https://www.cnblogs.com/gisorange/p/8471173.html
Copyright © 2020-2023  润新知