• debug实战:进程Hang+High CPU


    最近几周都在解决程序不稳定的问题,具体表现为程序(多进程)时不时的Hang住,同时伴随某个进程的High CPU。跟踪下来,基本都是各种死锁引起的。这里选取一个典型的场景进行分析。

    1.抓dump分析

    由于这个问题不能稳定重现,所以比较靠谱的方法是出现后抓Dump再分析。老方法:ProcDump -ma [ProcessName]。这是个多进程Hang住的情况,具体表现为主进程Main点击退出时,子进程Mkt不响应。到底是哪个进程挂掉了呢?

    2.先看Main

    首先!syncblk:

    0:000> !syncblk
    Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner
    244 0000000023d99b68            5         1 0000000029b26bc0 100  41   0000000003c95a30 System.Object
    

    41号线程持有object锁,MonitorHeld=5,持有者=1、等待者=2,所以有(5-1)/2=2个线程在等待这个锁。41号线程在干什么呢?为什么占着锁又不走下去呢?

    /*0:000> !threads
    ThreadCount:      974
    UnstartedThread:  0
    BackgroundThread: 965
    PendingThread:    0
    DeadThread:       8
    Hosted Runtime:   no
                                                                                                          Lock  
         ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
    41   45  100 0000000029b26bc0    2b228 Preemptive  0000000000000000:0000000000000000 0000000002222bb0 1     MTA */
    

    注意:这3个ID容易混淆,41是线程在windbg中的编号;45(ID)是线程在.net中的编号,对应vs调试时的托管ID;100(OSID)这个16进制ID的是操作系统给线程的唯一编号,对应于vs调试时的10进制ID,比如在这里是2560。

    /*0:000> ~41e!clrstack
    OS Thread Id: 0x100 (41)
            Child SP               IP Call Site
    000000002780ea30 0000000077ab133a Microsoft.Win32.UnsafeNativeMethods.WriteFile(Microsoft.Win32.SafeHandles.SafePipeHandle, Byte*, Int32, Int32 ByRef, IntPtr)
    000000002780ea30 000007fee5a775d2 Microsoft.Win32.UnsafeNativeMethods.WriteFile(Microsoft.Win32.SafeHandles.SafePipeHandle, Byte*, Int32, Int32 ByRef, IntPtr)
    000000002780e9f0 000007fee5a775d2 DomainBoundILStubClass.IL_STUB_PInvoke(Microsoft.Win32.SafeHandles.SafePipeHandle, Byte*, Int32, Int32 ByRef, IntPtr)
    000000002780eaf0 000007fee5af4aa0 System.IO.Pipes.PipeStream.WriteFileNative(Microsoft.Win32.SafeHandles.SafePipeHandle, Byte[], Int32, Int32, System.Threading.NativeOverlapped*, Int32 ByRef)
    000000002780eb40 000007fee5af4742 System.IO.Pipes.PipeStream.WriteCore(Byte[], Int32, Int32)*/
    

    可以看到,41号线程卡在PipeStream.Write上面,MSDN上提到,如果Pipe的另一端不读完Write过去的byte[],Write就会一直阻塞:

    Calling the PipeStream.Write() method blocks until count bytes are read or the end of the stream is reached.

    也就是说,主进程Main在给子进程write,但子进程一直不收。

    3.再看Mkt进程

    同样!syncblk:

    Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner
       42 0000000002358828            3         1 000000002063b9b0 34f8  17   0000000003909d28 System.Object
    

    17号线程持有锁,另一个线程在等待。17号在干嘛?

    /*OS Thread Id: 0x34f8 (17)
            Child SP               IP Call Site
    0000000022c8b2f0 0000000077ab131a Microsoft.Win32.UnsafeNativeMethods.ReadFile(Microsoft.Win32.SafeHandles.SafePipeHandle, Byte*, Int32, Int32 ByRef, IntPtr)
    0000000022c8b2f0 000007fee5a775d2 Microsoft.Win32.UnsafeNativeMethods.ReadFile(Microsoft.Win32.SafeHandles.SafePipeHandle, Byte*, Int32, Int32 ByRef, IntPtr)
    0000000022c8b2b0 000007fee5a775d2 DomainBoundILStubClass.IL_STUB_PInvoke(Microsoft.Win32.SafeHandles.SafePipeHandle, Byte*, Int32, Int32 ByRef, IntPtr)
    0000000022c8b3b0 000007fee5af4980 System.IO.Pipes.PipeStream.ReadFileNative(Microsoft.Win32.SafeHandles.SafePipeHandle, Byte[], Int32, Int32, System.Threading.NativeOverlapped*, Int32 ByRef)
    0000000022c8b400 000007fee5af44d4 System.IO.Pipes.PipeStream.ReadCore(Byte[], Int32, Int32)
    0000000022c8b470 000007fe88a05852 IPC.PipeTransport.Send(Byte[])
    0000000022c8b8a0 000007fe89f17b1f UI.WPF.WindowBase.UnRegisterAllIPCSubscriber()
    0000000022c8b900 000007fe89f170df UI.WPF.WindowBase.WindowBase_Closed(System.Object, System.EventArgs)*/
    

    看起来是在关闭Window的时候,17号线程拿到了锁,想要取消注册IPC却一直Read不到回复。再看是哪个线程在等锁,执行~*e!clrstack在输出里找Monitor.Enter这个函数,发现是8号线程在等锁:

    /*0:000> ~8e!clrstack
            Child SP               IP Call Site
    00000000211ae908 0000000077ab186a System.Threading.Monitor.Enter(System.Object)
    00000000211aea00 000007fe89b09f06 UI.WPF.WindowBase.OnReciveIPCMessqage(MessageRoute.IMessage)*/
    

    了解了锁的持有、等待关系之后,还是要回到代码上分析。从源码可知,Mkt的8号线程拿到锁之后要做的,就是Read Main进程Write过来的byte数组(这部分代码就不贴了)。而它之所以拿不到lock,是因为17号拿到了lock并给Main发消息要取消注册,并且在发完之后等Read。而Main的41号Write线程,却在等Mkt的8号线程Read。

    private void UnRegisterAllIPCSubscriber()
    {
        lock (this)
        {
            _subscriber.Stop();
            _ipcSubscribers.Clear();
        }
    }
    

    4.小结

    于是这样一个跨进程的死锁问题就发生了。问题的关键在于,上面这个取消注册的锁的范围太大了,_subscriber.Stop跨越了一次IPC调用,其实只有_ipcSubscribers是共享数据,缩小锁的范围就可以避免这个问题的发生。最后要感谢一下Tess Ferrandez,主要的思路来自她的这篇《A Hang Scenario, Locks and Critical Sections》

  • 相关阅读:
    JVM Inline
    Lattice
    编译技术
    sql-server-on-linux
    concurrency 方面的books
    Linux debugger lldb
    javaperformanceoptimization
    Understanding The Linux Virtual Memory Manager
    web performance tu ning
    linux io architecture
  • 原文地址:https://www.cnblogs.com/AlexanderYao/p/5289324.html
Copyright © 2020-2023  润新知