grpc报错call already cancelled

grpc报错call already cancelled
参考：https://sq.163yun.com/blog/article/180400600475590656

参考：https://stackoom.com/question/3rd87/gRPC-RPC%E8%B0%83%E7%94%A8%E4%B8%8A%E7%9A%84%E9%9A%8F%E6%9C%BACANCELED%E5%BC%82%E5%B8%B8

参考：https://blog.csdn.net/qq_14945847/article/details/102466693

GRPC是Google开源的一个高性能、跨语言的RPC框架，基于HTTP2协议，基于protobuf-3.x，基于Netty-4.x。

最近开发一个公司的游戏项目，采用分布式的服务器架构，内部服务之间均采用GRPC进行通信。但是在测试过程中发现一个重大的bug，偶现丢包。

前置信息：ServerA（作为grpc的client端）通过futureStub异步调用ServerB（作为server端）。

查看日志发现，client中有调用发送请求的日志，但server端却没有任何收到请求的日志。于是怀疑server在接收请求的过程中吃掉了异常，对整个过程加上try-catch，然而并没有任何发现，于是怀疑根本没有收到请求。通过tcpdump抓包后发现确实是client没发包到网络层，由此确定了是client端问题。

怀疑是grpc的futureStub有bug，于是在finally中加入future.get()打印出调用结果。结果一天下来一直没有发现任何丢包，”完美“。懵逼了，去掉finally中的future get以后又出现丢包。这个时候，直觉判断，由于调用futue.get()会block当前线程，所以不会出现丢包，而不block就会丢包。有种接近真相的感觉。检查代码后发现client端创建stub的时候没有指定线程池，grpc会默认使用当前线程，结合偶现的表现，觉得非常像是线程在异步发送的过程中被销毁导致。

然而，指定线程池以后，依旧发现丢包。无奈之下只有开始查看grpc源码，并打印出grpc的日志。在日志中发现了如下错误：
```
----------------OUTBOUND--------------------
    [id: 0x7dee8b32, L:/10.82.7.14:60365 - R:/10.82.7.14:9001] RST_STREAM: streamId=4025, errorCode=8
------------------------------------
```
看到errorCode=8，并结合grpc源码中对io.grpc.Status的定义：
```
 /**
     * Some resource has been exhausted, perhaps a per-user quota, or
     * perhaps the entire file system is out of space.
     */
RESOURCE_EXHAUSTED(8),
```
推断是由于系统文件句柄资源耗尽导致，查看服务器系统文件句柄限制为65535，而进程打开的限制远不止这么多。于是找sa将文件句柄限制改为100w，然后信心满满的重启了全部服务，重新进行测试。结果很快又发现了丢包，内心崩溃！

由于当future.get()方法阻塞线程时，就不会出现丢包，无法重现。于是想到把futureStub改为stub，在保持异步的同时，通过observer回调来观察调用结果,这次在回调的onError方法中发现了如下异常：
```
    CANCELLED,
    io.grpc.Status.asRuntimeException(Status.java:543), 
    io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395), 
    io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426), 
    io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:76), 
    io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:512), 
    io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:429), 
    io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:544), 
    io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52), 
    io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:117), 
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142), 
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617), 
    java.lang.Thread.run(Thread.java:745
```
终于发现原来是这个call被grpc在底层cancel掉了。查代码发现原来上面的errorCode=8是指http2的code定义：
```
    /**
     * Mapped to call cancellation when sent by a client. 
     * Mapped to CANCELLED when sent by a server. 
     * Note that servers should only use this mechanism when they need to cancel a call but the payload byte sequence is incomplete.
     */
    CANCEL(8)
```
知道了是由于被cancel才导致的丢包，那么就要查找为什么会被cancel，由于grpc的stub调用是通过异步的线性执行器执行io.grpc.ClientCall。所以堆栈中无法看到整个调用过程，只能通过源码来看。不管是哪种stub的调用，最后都会走到ClientCalls.startCall()方法中，然后进ClientCallImpl.start()中，在这个方法中，可以看到与cancel有关的代码：
```
    if (context.isCancelled()) {
      // Context is already cancelled so no need to create a real stream, just notify the observer
      // of cancellation via callback on the executor
      stream = NoopClientStream.INSTANCE;
      class ClosedByContext extends ContextRunnable {
        ClosedByContext() {
          super(context);
        }

        @Override
        public void runInContext() {
          closeObserver(observer, statusFromCancelled(context), new Metadata());
        }
      }

      callExecutor.execute(new ClosedByContext());
      return;
    }
```
可以看出来cancel的原因是由于context.isCancelled()的结果为true
```
  public boolean isCancelled() {
    if (parent == null || !cascadesCancellation) {
      return false;
    } else {
      return parent.isCancelled();
    }
  }
```
可以看出，如果parent==null，则不可能为true，所以必然是由于parent.isCancelled()为true导致的。由此我们基本可以猜测出来，当前的ClientCall被cancel的原因是由于它的parent context canceled导致。

回到ClientCallImpl中，可以看到this.context = Context.current();表明每个ClientCall被创建的时候，context都会被赋值为Context.current()
```
  public static Context current() {
    Context current = storage().current();
    if (current == null) {
      return ROOT;
    }
    return current;
  }
```
如果当前没有context，则用ROOT作为context
```
  /**
   * The logical root context which is the ultimate ancestor of all contexts. This context
   * is not cancellable and so will not cascade cancellation or retain listeners.
   *
   * Never assume this is the default context for new threads, because {@link Storage} may define
   * a default context that is different from ROOT.
   */
  public static final Context ROOT = new Context(null);
```
在注释中可以看出，ROOT是不可以被cancel的，所以可以判断，当前必然有一个grpc的context存在。

最后联想到由于业务代码中，ClientCall是在ServerA处理grpc请求的过程中执行的，那么这个Context必然是由GrpcServer处理请求的时候生成的，并赋值给了ClientCall。而由于ClientCall是线性执行器中异步执行，当真正执行ContextRunnable的时候，GrpcServer的context可能已经被cancel，导致当前ClientCall被cancel。

解决办法：可以对当前context进行fork，或者把每个call都封装为Runnable，通过Context提供的run静态方法执行。

至此，终于弄清楚了grpc丢包的原因。虽然，这个问题通过业务层代码可以避免掉，但是我觉得grpc在底层默默的把这件事做了，并且没有往上层抛出任何信息，这对开发者是非常不友好的。同时，这个cancel逻辑的处理上，我觉得也有待商榷。毕竟在GrpcServer中异步的去执行GrpcCall是一个非常普遍合理的模型。也许以后Google会改吧。

附：
相关阅读:
Java之static理解
 linux安装vsftpd
JS获取当前时间戳的三个方法
 git常用操作
 npm更新包
 springboot的mvn与gradle启动方式
 mac nginx compile
[文摘]那些一心想要离开 BAT 的人，后来怎么样了？
excel 常用法
 chrome 快捷键
原文地址：https://www.cnblogs.com/fnlingnzb-learner/p/13710854.html