记一次 .NET 某工控视觉系统 卡死分析

一:背景1. 讲故事前段时间有位朋友找到我,说他们的工业视觉软件僵死了,让我帮忙看下到底是什么情况,哈哈,其实卡死的问题相对好定位,无非就是看主线程栈嘛,然后就是具体问题具体分析,当然难度大小就看运气了 。
前几天看一篇文章说现在的 .NET程序员 不需要学习WinDbg ,理由就是有很多好的分析工具诸如 VS,DNSpy,PerfView 可以替代,我也只能笑笑,在他们的认知中可能 .NET程序 是不需要和其他语言交互而独成一体的 。
话不多说,回到主题,上 WinDbg 说话 。
二:为什么会卡死1. 主线程在做什么刚才也说到了,卡死是比较好定位的,切到主线程看线程栈即可,简化输出如下:
0:000> ~0s;kntdll!NtDelayExecution+0x14:00007ffc`7d45fcf4 c3ret # Child-SPRetAddrCall Site00 00000000`007fd628 00007ffc`79a15631ntdll!NtDelayExecution+0x1401 00000000`007fd630 00007ffc`40b7b116KERNELBASE!SleepEx+0xa102 00000000`007fd6d0 00007ffc`40b7372ecogxstd+0x13b11603 00000000`007fd700 00007ffc`40b73ececogxstd+0x13372e...09 00000000`007fd9b0 00007ffc`7d1c77e3CogDisplay!DllUnregisterServer+0x1833f0a 00000000`007fdab0 00007ffc`7d16436crpcrt4!Invoke+0x730b 00000000`007fdb00 00007ffc`7cdbc473rpcrt4!NdrStubCall2+0x42c0c 00000000`007fe130 00007ffc`7c451bf0combase!CStdStubBuffer_Invoke+0x73 [onecorecomcombasendrndrolestub.cxx @ 1446] ...11 00000000`007fe230 00007ffc`7cdc2df6combase!DefaultStubInvoke+0x1c4 [onecorecomcombasedcomremchannelb.cxx @ 1769] 12 (Inline Function) --------`--------combase!SyncStubCall::Invoke+0x22 [onecorecomcombasedcomremchannelb.cxx @ 1826] 13 00000000`007fe380 00007ffc`7cd62e55combase!SyncServerCall::StubInvoke+0x26 [onecorecomcombasedcomremservercall.hpp @ 825] 14 (Inline Function) --------`--------combase!StubInvoke+0x265 [onecorecomcombasedcomremchannelb.cxx @ 2052] 15 00000000`007fe3c0 00007ffc`7cd8ded2combase!ServerCall::ContextInvoke+0x435 [onecorecomcombasedcomremctxchnl.cxx @ 1532] ...31 00000000`007fff60 00000000`00000000ntdll!RtlUserThreadStart+0x21从卦中看当前主线程正在 Sleep,这就很奇葩了,并且还是康耐视的 cogxstd 动态链接库的逻辑,这里我敢相信它不会有这么低级的错误,接下来我们洞察下到底 Sleep 了多久,仔细观察汇编代码,精简后如下:
ntdll!NtDelayExecution:00007ffc`7d45fce0 4c8bd1movr10, rcx00007ffc`7d45fce3 b834000000moveax, 34h00007ffc`7d45fce8 f604250803fe7f01 testbyte ptr [7FFE0308h], 100007ffc`7d45fcf0 7503jnentdll!NtDelayExecution+0x15 (7ffc7d45fcf5)00007ffc`7d45fcf2 0f05syscall 00007ffc`7d45fcf4 c3ret00007ffc`7d45fcf5 cd2eint2Eh00007ffc`7d45fcf7 c3ret00007ffc`7d45fcf8 0f1f840000000000 nopdword ptr [rax+rax]KERNELBASE!SleepEx:00007ffc`79a15590 89542410movdword ptr [rsp+10h], edx00007ffc`79a15594 4c8bdcmovr11, rsp00007ffc`79a15597 53pushrbx00007ffc`79a15598 56pushrsi00007ffc`79a15599 57pushrdi00007ffc`79a1559a 4881ec80000000subrsp, 80h00007ffc`79a155a1 8bdamovebx, edx00007ffc`79a155a3 8bf9movedi, ecx...00007ffc`79a155f4 488b9424b8000000 movrdx, qword ptr [rsp+0B8h]00007ffc`79a155fc 85dbtestebx, ebx00007ffc`79a155fe 0f8592000000jneKERNELBASE!SleepEx+0x106 (7ffc79a15696)00007ffc`79a15604 83ffffcmpedi, 0FFFFFFFFh00007ffc`79a15607 7443jeKERNELBASE!SleepEx+0xbc (7ffc79a1564c)00007ffc`79a15609 4869cf10270000imulrcx, rdi, 2710h00007ffc`79a15610 48894c2420movqword ptr [rsp+20h], rcx00007ffc`79a15615 48f7d9negrcx...00007ffc`79a15622 488d542420leardx, [rsp+20h]00007ffc`79a15627 0fb6cbmovzxecx, bl00007ffc`79a1562a 48ff15ef641400callqword ptr [KERNELBASE!__imp_NtDelayExecution (7ffc79b5bb20)]再上一段 reactos 的 C++ 方法签名 。
DWORDWINAPISleepEx(IN DWORD dwMilliseconds,IN BOOL bAlertable){}NTSTATUSNTAPINtDelayExecution(IN BOOLEAN Alertable,IN PLARGE_INTEGER DelayInterval){}我们要重点观察 NtDelayExecution 方法中 rdx 参数是怎么计算的,重点就是下面的两句汇编 。
imulrcx, rdi, 2710hnegrcx【记一次 .NET 某工控视觉系统 卡死分析】这两句汇编是什么意思呢? 转成 C++ 代码就是
interval = - (milliseconds * 0x2710);在汇编中我们是知道 interval 的,它相当于是 milliseconds 计算后的补码,即下面的 Binary: 列 。
0:000> rrax=0000000000000034 rbx=0000000000000000 rcx=0000000000000000rdx=00000000007fd650 rsi=0000000000000000 rdi=0000000000000001rip=00007ffc7d45fcf4 rsp=00000000007fd628 rbp=00000000bf1efcf8 r8=00000000007fd628r9=00000000bf1efcf8 r10=0000000000000000r11=0000000000000246 r12=0000000000000000 r13=0000000000000798r14=000000003bd064b0 r15=00000000bf1efce00:000> dp 00000000007fd650 L100000000`007fd650ffffffff`ffffd8f00:000> .formats ffffffff`ffffd8f0Evaluate expression:Hex:ffffffff`ffffd8f0Binary:11111111 11111111 11111111 11111111 11111111 11111111 11011000 11110000...那怎么求 milliseconds 呢? 其实 补码的补码 就是原码,然后再除以 0x2710 就可以获取到 milliseconds 了哈 。


推荐阅读