• 记一次dubbo服务丢失的问题排查


    主要环境与用到的(关键)组件:

    Springboot2.3.2
    其中,dubbo-spring-boot-starter版本为2.7.8
    zookeeper3.5.9

    首先是服务报错:

    No provider available from registry ... for ... on consumer ... use dubbo version 2.7.8, please check status of providers(disabled, not registered or in blacklist).

    很多使用过dubbo的同学对这个报错信息应该都不陌生,一般情况下,如果没有配置好生产端、消费端或者注册中心,启动的时候就会报这个错。但这里的情况是,消费端和生产端都已经正常运行了一段时间了,大概10天左右,突然报这个错。

    这里借助网上总结的一张图,来梳理一下排查No Provider的情况:

    首先就来看看zookeeper下,生产端的provider情况,找到zookeeper的安装bin目录下:

    sh zkCli.sh
    ls /dubbo/xxx/provider

    查看生产端的信息,结果发现[],这时的生产端程序还是running状态(ps -ef进程存在),大概率就是这个生产端出了问题。由于日志有滚动删除,而且出问题的时间点也不确定,所以这个问题感觉没办法在服务本身的日志中查看。于是想到查看zookeeper下的事务日志:

    cd lib/
    java -cp .:zookeeper-3.5.9.jar:slf4j-api-1.7.25.jar:zookeeper-jute-3.5.9.jar org.apache.zookeeper.server.LogFormatter /tmp/zookeeper/version-2/log.7d1 > /tmp/zookeeper/version-2/log.log

    但也看不出什么端倪。这里不太确定是服务本身的问题,还是服务于zookeeper之间链接的问题。

    于是在消费服务中设置了一个定时任务,每秒钟去调用生产端的一个接口,并在生产和消费端都打印日志,同时加个监控,这样在除了问题之后,就能第一时间找到具体的错误信息。
    重启服务8天后,重现了这个错误,这一次直接看到了错误日志。

    首先,消费端爆出了

    2022-02-18 16:23:27.713  WARN 2386027 --- [scheduling-1] o.a.d.r.c.s.FailoverClusterInvoker       :  [DUBBO] Although retry the method heart in the service com.jf.zk.proxy.facade.BindFacade was successful by the provider 192.168.1.13:20010, but there have been failed providers [192.168.1.13:20010] (1/1) from the registry 192.168.1.13:2181 on the consumer 192.168.1.13 using the dubbo version 2.7.8. Last error is: Invoke remote method timeout. method: heart, provider: dubbo://192.168.1.13:20010/com.jf.zk.proxy.facade.BindFacade?anyhost=true&application=proxy-web&check=false&deprecated=false&dubbo=2.0.2&dynamic=true&generic=false&group=hcontrol-test&init=false&interface=com.jf.zk.proxy.facade.BindFacade&metadata-type=remote&methods=appUserBind,unbind,heart,ideUserBind&pid=2386027&qos.enable=false&register.ip=192.168.1.13&release=2.7.8&remote.application=proxy-service&revision=1.3.1&side=consumer&sticky=false&timeout=30000&timestamp=1644487367193&version=1.0, cause: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2022-02-18 16:22:39.000, end time: 2022-02-18 16:23:09.020, client elapsed: 0 ms, server elapsed: 30020 ms, timeout: 30000 ms, request: Request [id=707282, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /192.168.1.13:54056 -> /192.168.1.13:20010, dubbo version: 2.7.8, current host: 192.168.1.13
    org.apache.dubbo.rpc.RpcException: Invoke remote method timeout. method: heart, provider: dubbo://192.168.1.13:20010/com.jf.zk.proxy.facade.BindFacade?anyhost=true&application=proxy-web&check=false&deprecated=false&dubbo=2.0.2&dynamic=true&generic=false&group=hcontrol-test&init=false&interface=com.jf.zk.proxy.facade.BindFacade&metadata-type=remote&methods=appUserBind,unbind,heart,ideUserBind&pid=2386027&qos.enable=false&register.ip=192.168.1.13&release=2.7.8&remote.application=proxy-service&revision=1.3.1&side=consumer&sticky=false&timeout=30000&timestamp=1644487367193&version=1.0, cause: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2022-02-18 16:22:39.000, end time: 2022-02-18 16:23:09.020, client elapsed: 0 ms, server elapsed: 30020 ms, timeout: 30000 ms, request: Request [id=707282, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /192.168.1.13:54056 -> /192.168.1.13:20010
    ......
    Caused by: java.util.concurrent.ExecutionException: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2022-02-18 16:22:39.000, end time: 2022-02-18 16:23:09.020, client elapsed: 0 ms, server elapsed: 30020 ms, timeout: 30000 ms, request: Request [id=707282, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /192.168.1.13:54056 -> /192.168.1.13:20010
    ......
    2022-02-18 16:23:28.009 ERROR 2386027 --- [scheduling-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task
    org.apache.dubbo.rpc.RpcException: No provider available from registry ...

    这里看到是一些请求超过了consumer端的timeout,后续就是No provider,猜想应该是生产端出现了什么问题,响应越来越慢,到后来服务长时间不可用,直至与其它服务断开了链接。这里就要看看生产端到底发生了什么,关键日志如下:

    2022-02-18 16:23:47.942  INFO 2385338 --- [main-SendThread(192.168.1.13:2181)] org.apache.zookeeper.ClientCnxn          : Unable to read additional data from server sessionid 0x100000040fc01ae, likely server has closed socket, closing socket connection and attempting reconnect
    2022-02-18 16:23:32.144  WARN 2385338 --- [NettyServerWorker-6-2] o.a.d.remoting.transport.AbstractServer  :  [DUBBO] All clients has disconnected from /192.168.1.13:20010. You can graceful shutdown now., dubbo version: 2.7.8, current host: 192.168.1.13
    2022-02-18 16:27:15.830  WARN 2385338 --- [DubboClientHandler-192.168.1.13:20004-thread-2] o.a.d.r.exchange.support.DefaultFuture   :  [DUBBO] The timeout response finally returned at 2022-02-18 16:27:13.882, response status is 31, channel: /192.168.1.13:35954 -> /192.168.1.13:20004, please check provider side for detailed result., dubbo version: 2.7.8, current host: 192.168.1.13
    2022-02-18 16:27:38.590  WARN 2385338 --- [DubboClientHandler-192.168.1.13:20004-thread-3] o.a.d.r.exchange.support.DefaultFuture   :  [DUBBO] The timeout response finally returned at 2022-02-18 16:26:53.015, response status is 31, channel: /192.168.1.13:35954 -> /192.168.1.13:20004, please check provider side for detailed result., dubbo version: 2.7.8, current host: 192.168.1.13
    ......
    2022-02-18 16:23:47.117  INFO 2385338 --- [main-SendThread(192.168.1.13:2181)] org.apache.zookeeper.ClientCnxn          : Unable to read additional data from server sessionid 0x100000040fc01ad, likely server has closed socket, closing socket connection and attempting reconnect
    2022-02-18 16:35:32.357  INFO 2385338 --- [NettyClientWorker-1-2] o.a.d.r.t.netty4.NettyClientHandler      :  [DUBBO] The connection of /192.168.1.13:44900 -> /192.168.1.13:8003 is disconnected., dubbo version: 2.7.8, current host: 192.168.1.13
    2022-02-18 16:36:13.950  WARN 2385338 --- [dubbo-client-idleCheck-thread-1] o.a.d.r.e.s.header.ReconnectTimerTask    :  [DUBBO] Reconnect to channel HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [/192.168.1.13:35954 -> /192.168.1.13:20004]], because heartbeat read idle time out: 180000ms, dubbo version: 2.7.8, current host: 192.168.1.13
    ......
    2022-02-18 18:46:33.184  WARN 2385338 --- [nioEventLoopGroup-10-3] i.n.c.AbstractChannelHandlerContext      : An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
    java.lang.OutOfMemoryError: Java heap space

    这个服务虽然出现了很多与另一个服务的链接问题,但后续出现了java.lang.OutOfMemoryError: Java heap space,大概率是由于这个内存问题导致的。

    由于服务还是running(ps -ef进程存在),我们可以直接把它的内存dump出来。

    jmap -dump:file=javaDump.hprof,format=b {该服务的pid}
    而后使用jvisualvm.exe(jdk自带工具,其实JDK自带了很多有用的工具,有必要了解一下)将这个hprof文件打开,可以查看很多信息:

    查看到了一个我们代码里定义的SessionModel对象异常地多,最终发现是同事代码里每次新建一个业务会话时,都会生成一个SessionModel对象,将其放在ConcurrentHashMap中,但没有clear或者remove的逻辑,导致相关的对象越来越多,最终导致OOM。
    猜测这个服务在内存资源消耗到一定程度时,已经无法处理其他的请求,发出去的请求也无法应答,所以与之相关的rpc链接会被断开,最终这个服务从注册中心被清理。

    这里还可以看看这个生产端的服务A调用其它服务B的日志:

    ......
    2022-02-18 16:32:07.642  INFO 348283 --- [NettyServerWorker-6-3] o.a.d.r.t.netty4.NettyServerHandler      :  [DUBBO] IdleStateEvent triggered, close channel NettyChannel [channel=[id: 0x0efa3ba8, L:/192.168.1.13:20004 - R:/192.168.1.13:35954]], dubbo version: 2.7.8, current host: 192.168.1.13
    2022-02-18 16:32:07.642  INFO 348283 --- [NettyServerWorker-6-3] o.a.d.r.transport.netty4.NettyChannel    :  [DUBBO] Close netty channel [id: 0x0efa3ba8, L:/192.168.1.13:20004 - R:/192.168.1.13:35954], dubbo version: 2.7.8, current host: 192.168.1.13
    2022-02-18 16:32:07.642  INFO 348283 --- [NettyServerWorker-6-3] o.a.d.r.t.netty4.NettyServerHandler      :  [DUBBO] The connection of /192.168.1.13:35954 -> /192.168.1.13:20004 is disconnected., dubbo version: 2.7.8, current host: 192.168.1.13

    这里找到对应日志的代码:

    猜测是服务B由于长时间没有接收到来自上文提到的服务A(此时应该是客户端的角色)的心跳,持续了timeout这么长时间,于是服务B就把链接关闭了,此时对于A来说,应该就是收不到B服务的响应,所以也会在日志前期报访问服务B的timeout。

  • 相关阅读:
    webjars管理静态资源
    SpringCloud踩坑日记
    ELK日志搜索平台搭建
    新硬盘挂载到目录后目录原先数据消失解决办法
    nginx安装缺少依赖记录
    SpringCloud踩坑日记
    .bashrc配错刷新导致linux基础命令不能用修复
    nginx超时时间配置
    nginx日志切分shell脚本
    2019.10.10 实习日记
  • 原文地址:https://www.cnblogs.com/bruceChan0018/p/15930311.html
Copyright © 2020-2023  润新知