最近又监控到一个高内存的问题,周五下班把系统打开,周末2天没关,周一来看已经涨到5.2G,这次与以往不同,不是.net的内存泄漏,而是非托管引起的。
1. 抓dump,确定高内存的类型
//dump有5.2G,.net的堆只有191M,可以判断是unmanaged high memory
0:000> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x0000000159852b88
generation 1 starts at 0x00000001597d26a0
generation 2 starts at 0x0000000003231000
ephemeral segment allocation context: none
segment begin allocated size
0000000003230000 0000000003231000 000000000dad5f38 0xa8a4f38(176836408)
00000001596b0000 00000001596b1000 0000000159db48f8 0x7038f8(7354616)
Large object heap starts at 0x0000000013231000
segment begin allocated size
0000000013230000 0000000013231000 0000000013906b30 0x6d5b30(7166768)
Total Size: Size: 0xb67e360 (191357792) bytes.
------------------------------
GC Heap Size: Size: 0xb67e360 (191357792) bytes.
2. 用windbg找原因
于是不能用sos ext里的命令了,用CodeProject上的这篇文章里的方法试了一下,如下:
//这个命令能看到所有的内存消耗,确实分配了5G多
0:000> !address -summary
Mapping file section regions...
Mapping module regions...
Mapping PEB regions...
Mapping TEB and stack regions...
*** ERROR: Symbol file could not be found. Defaulted to export symbols for DebussyReverie.dll -
Mapping heap regions...
Mapping page heap regions...
Mapping other regions...
Mapping stack trace database regions...
Mapping activation context regions...
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 412 7fe`897c2000 ( 7.994 Tb) 99.93%
Heap 628 1`24c81000 ( 4.575 Gb) 78.18% 0.06%
<unknown> 1217 0`39381000 ( 915.504 Mb) 15.28% 0.01%
Image 2120 0`14561000 ( 325.379 Mb) 5.43% 0.00%
Stack 200 0`04080000 ( 64.500 Mb) 1.08% 0.00%
Other 13 0`001c6000 ( 1.773 Mb) 0.03% 0.00%
TEB 66 0`00084000 ( 528.000 kb) 0.01% 0.00%
PEB 1 0`00001000 ( 4.000 kb) 0.00% 0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_PRIVATE 1509 1`59760000 ( 5.398 Gb) 92.24% 0.07%
MEM_IMAGE 2683 0`184d0000 ( 388.813 Mb) 6.49% 0.00%
MEM_MAPPED 53 0`04bfe000 ( 75.992 Mb) 1.27% 0.00%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE 412 7fe`897c2000 ( 7.994 Tb) 99.93%
MEM_COMMIT 3267 1`4d29f000 ( 5.206 Gb) 88.96% 0.06%
MEM_RESERVE 978 0`2958f000 ( 661.559 Mb) 11.04% 0.01%
//再查看非托管堆,能看到有1个堆占了4.5G
0:000> !heap -s
LFH Key : 0x000000ff3fbc76ee
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-------------------------------------------------------------------------------------
Virtual block: 0000000002ca0000 - 0000000002ca0000 (size 0000000000000000)
...
000000001fc30000 00001002 4597824 4568704 4597824 568 296 288 0 1 LFH
...
External fragmentation 23 % (59 free blocks)
-------------------------------------------------------------------------------------
//查看有问题的堆,大小为27d1的某个内存块共有6cc72个,27d1*6cc72=10eb24712(4.5G)
0:000> !heap -stat -h 000000001fc30000
heap @ 000000001fc30000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
27d1 6cc72 - 10eb24712 (99.95)
128 18d5 - 1cb648 (0.04)
2740 11 - 29b40 (0.00)
3e80 1 - 3e80 (0.00)
//用size=27d1过滤,查看每个对象的指针UserPtr
0:000> !heap -flt s 27d1
_HEAP @ 90000
...
_HEAP @ 1fc30000
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
0000000025217420 0281 0000 [00] 0000000025217430 027d1 - (busy)
0000000025219c30 0281 0281 [00] 0000000025219c40 027d1 - (busy)
//随便找个对象地址,看被哪里引用了,悲剧的是,不管找哪个,都没有堆栈信息stack trace!
0:000> !heap -p -a 0000000025217430
address 0000000025217430 found in
_HEAP @ 1fc30000
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
0000000025217420 0281 0000 [00] 0000000025217430 027d1 - (busy)
3. 用DebugDiag找原因
上面的路目前看起来是走不大通了,于是换Debug大神-美女程序猿-Tess的这篇博文里的方法,改用DebugDiag试试。
- 下载DebugDiag1.2,这是目前最新的版本,已经能支持win7和x64位的系统了。根据你的os选择x86/x64的msi版本并安装。
- 重新启动你的程序,因为这个工具是在跟踪tracking内存分配的过程,所以需要在high memory出现之前就开始监控。
- 打开DebugDiag,选择Add Rule、Native (non-.NET) Memory and Handle,选择要分析的进程,Rule->Status会显示目前处于tracking的状态。
- 执行一段时间、1h 或 1天,转到tab页Processes里右键进程、Create Full Userdump。
- 然后转到Advanced Analysis里选中"Memory Pressure Analyzers"脚本,点击Add Data Files把生成的dump加进去。
- 点击"Start Analysis"就开始分析了,要耐心一点,视dump大小一般需要>10min。
因为要重新启动,下面的dump是我重新抓的,跑了一晚上,大概有1.7G,分析的结果显示是一个DebussyReverie.dll的模块在不停的分配内存。
DebussyReverie.dll is responsible for 1.15 GBytes worth of outstanding allocations. The following are the top 2 memory consuming functions:
DebussyReverie!GetLspGuid+79e: 1.07 GBytes worth of outstanding allocations.
DebussyReverie!GetLspGuid+f31df: 83.04 MBytes worth of outstanding allocations.
Image Name: C:Program FilesLeagSoftUniAccess AgentDebussyReverie.dll
Function DebussyReverie!GetLspGuid+79e
Allocation type Heap allocation(s)
Heap handle 0x00000000`00000000
Allocation Count 114716 allocation(s)
Allocation Size 1.07 GBytes
Leak Probability 81%
//堆栈如下,可惜没能显示从我的代码到问题代码之间的调用关系
Function Source Destination
LeakTrack+13277
DebussyReverie!GetLspGuid+79e
DebussyReverie!GetLspGuid+647
DebussyReverie!GetLspGuid+ea3
DebussyReverie!WSPStartup+431
ws2_32!WahOpenApcHelper+20a
ws2_32!_WSAFDIsSet+e02
ws2_32!WSCInstallProviderAndChains64_32+1971
ws2_32!WSCInstallProviderAndChains64_32+1aa0
ws2_32!WSASendTo+21fc
ws2_32!getsockname+4a
DebussyReverie!PublicService+74b
DebussyReverie!GetLspGuid+a41dd
DebussyReverie!PublicService+2c70
ws2_32!WSApSetPostRoutine+298
ws2_32!WSAConnect+af
System_ni+32f3f2
...
0x7FE8925FEA7
看起来,问题来自公司安装的网络监控软件LeagSoft里的DebussyReverie.dll。大约是下班后服务器断开了,项目里Apache.NMS.ActiveMQ连接出错时调用多线程反复重连,然后被监控软件拦截到后出的错。应该是LeagSoft里的一个Bug,如果能和他们交流一下,应该能进一步确定问题所在,甚至修复这个问题。