问题
观察生产环境 JMX 监控, 发现应用在内存压力不大的情况下每小时会发生一次 full gc, 每次耗时约 300 毫秒. 这显然不是正常情况, 需要进行排查.
排查过程
首先查看 GC 日志, 发现 FGC 是由 System.gc()
触发的, 日志内容大致如下(日志内容来源于周期性Full GC的异常排查):
[GC (System.gc())
Desired survivor size 5767168 bytes, new threshold 1 (max 15)
[PSYoungGen: 518231K->2816K(693760K)] 1188617K->673417K(2092032K), 0.0274540 secs] [Times: user=0.06 sys=0.00, real=0.03 secs]
Heap after GC invocations=194 (full 1):
PSYoungGen total 693760K, used 2816K [0x00000007d5500000, 0x0000000800000000, 0x0000000800000000)
eden space 688128K, 0% used [0x00000007d5500000,0x00000007d5500000,0x00000007ff500000)
from space 5632K, 50% used [0x00000007ff500000,0x00000007ff7c0010,0x00000007ffa80000)
to space 5632K, 0% used [0x00000007ffa80000,0x00000007ffa80000,0x0000000800000000)
ParOldGen total 1398272K, used 670601K [0x000000077ff80000, 0x00000007d5500000, 0x00000007d5500000)
object space 1398272K, 47% used [0x000000077ff80000,0x00000007a8e627e8,0x00000007d5500000)
PSPermGen total 262144K, used 125738K [0x000000076ff80000, 0x000000077ff80000, 0x000000077ff80000)
object space 262144K, 47% used [0x000000076ff80000,0x0000000777a4a938,0x000000077ff80000)
}
对项目代码进行搜索后, 排除了代码中调用 System.gc()
的可能. 结合每小时触发一次 FGC 的周期性, 怀疑是有线程在循环调用, 使用了 jstack 找出了两个可疑线程:
"RMI RenewClean-[xxx.xxx.xxx.xxx:35539]" #323 daemon prio=5 os_prio=0 tid=0x00007f82a4022800 nid=0x144 in Object.wait() [0x00007f81d7ebf000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
- locked <0x00000007035c6ba0> (a java.lang.ref.ReferenceQueue$Lock)
at sun.rmi.transport.DGCClient$EndpointEntry$RenewCleanThread.run(DGCClient.java:563)
at java.lang.Thread.run(Thread.java:748)
"GC Daemon" #244 daemon prio=2 os_prio=0 tid=0x00007f8405f69800 nid=0xf5 in Object.wait() [0x00007f81ddb0d000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at sun.misc.GC$Daemon.run(GC.java:117)
- locked <0x00000007035bddc8> (a sun.misc.GC$LatencyL
检查 sun.misc.GC 和 sun.rmi.transport.DGCClient 两个类:
DGCClient:
private static final long gcInterval = (Long)AccessController.doPrivileged(new GetLongAction("sun.rmi.dgc.client.gcInterval", 3600000L));
private static class EndpointEntry {
public static DGCClient.EndpointEntry lookup(Endpoint var0) {
synchronized(endpointTable) {
DGCClient.EndpointEntry var2 = (DGCClient.EndpointEntry)endpointTable.get(var0);
if (var2 == null) {
var2 = new DGCClient.EndpointEntry(var0);
endpointTable.put(var0, var2);
if (gcLatencyRequest == null) {
gcLatencyRequest = GC.requestLatency(DGCClient.gcInterval);
}
}
return var2;
}
}
}
GC:
private static Thread daemon = null;
public static GC.LatencyRequest requestLatency(long var0) {
return new GC.LatencyRequest(var0);
}
由上面的代码可知开启 RMI 会通过 DGCClient
调用 GC#requestLatency(long)
方法, 创建一个守护线程, 周期性执行 System.gc()
. 这个周期为 DGCClient.gcInterval
, 默认为一小时.
另外, 项目还依赖了 cxf, cxf 默认会每 10 小时执行一次 System.gc()
:
if (!skipHack("org.apache.cxf.JDKBugHacks.gcRequestLatency")) {
Class<?> clazz = Class.forName("sun.misc.GC");
Method method = clazz.getDeclaredMethod("currentLatencyTarget");
Long l = (Long)method.invoke(null);
if (l != null && l.longValue() == 0) {
//something already set it, move on
method = clazz.getDeclaredMethod("requestLatency",
new Class[] {Long.TYPE});
method.invoke(null, Long.valueOf(36000000));
}
}
解决方法
- 将 sun.rmi.dgc.server.gcInterval 和 sun.rmi.dgc.client.gcInterval 参数设置为 9223372036854775807L (对这个值做了特殊处理 GC 守护线程会马上结束), 还需要一个参数 org.apache.cxf.JDKBugHacks.gcRequestLatency 设置为 false
-Dsun.rmi.dgc.server.gcInterval=9223372036854775807L -Dsun.rmi.dgc.client.gcInterval=9223372036854775807L -Dorg.apache.cxf.JDKBugHacks.gcRequestLatency=false
- 将 sun.rmi.dgc.server.gcInterval 和 sun.rmi.dgc.client.gcInterval 设置为 9223372036854775806L, 此时守护线程会一直阻塞, 不会再触发 GC.
- 使用 CMS 回收器, 并开启并行回收, 以减少停顿时间
-XX:+ExplicitGCInvokesConcurrent -XX:+UseConcMarkSweepGC