I was asked to investigate why our application sometimes forcefully terminated by Windows. After first few minutes I did realize that Windows terminated application with STATUS_HEAP_CORRUPTION. After that I asked our QA to follow https://docs.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps and change DumpType to 2 to collect full crash dumps. Unfortunately, after checking few of these reports, I found that application crashing in pretty much random places: WPF rendering, DirectX engine or sometimes even in network stack. It is clear, that corruption happens some time ago and no useful information. I need different approach.
Windows has built-in feature that will allow to find few common problems. To enable these features, you need to install Windows Debugging Tools from Windows SDK and run gflags. Here is link https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/gflags-commands.
I executed following command:
gflags.exe /p /enable myapp.exe /full
And restart my application. This command enables heap validation, check for corruptions and many other things. One of the coolest that it always enables buffer overwrites. As result any memory allocation will allocate block at the end of page and mark next page with No Access flag. As result if code allocates 100 bytes but will write 101 bytes, code will crash with Access Violation.
Please note that if you start gflags application without parameters, then it will run with UI that allow you to set most things, but at least few things are missing.
Anyway, after I executed that command and ran my application, I immediately caught crash. That one of easy, because it was memory overwrite, but unfortunately it was only my first problem.
Second problem was quite tricky to reproduce, but eventually I was able to get this:
VERIFIER STOP 0000000000000010: pid 0x63D8: corrupted start stamp
0000000002001000 : Heap handle
0000000158356F60 : Heap block
00000000000000A0 : Block size
00000000ABCDBBBA : Corrupted stamp
And application stopped here:
Let me explain, what happens here. With that special flags every memory allocation will allocate a bit more memory than requested and Windows will put some markers before and after that memory block. Then Windows will check these markers and if they changed then there is memory overwrite or underwrite. It can also mean that there could be corrupted pointer that writes to wrong place.
You can execute following command in WinDbg to see possible values of these markers:
!heap -p -?
And my case it looks like block was freed (value ABCDBBBA)
It took me awhile, but I found command in WinDbg that will return some information about that allocation:
!heap -p -a 0000000158356F60
I got this response:
address 0000000158356f60 found in
_DPH_HEAP_ROOT @ 2001000
in busy allocation ( DPH_HEAP_BLOCK: UserAddr UserSize - VirtAddr VirtSize)
158b76f08: 158356f30 d0 - 158356000 2000
To be honest, I do not really know what did WinDbg found because as you can see address is off by 0x30 and user size is different. But any case none of these addresses can be read. It looks like that memory was de-committed. And as you remember previously, I found that it looks like block is freed and it looks logical that memory for that block is de-committed.
But why Windows checks header of block that was de-committed? Good questions. Anyway, keep reading and answer will be revealed shortly. I spent quite some time trying to figure out what else can I do. I also found useful command to see heap block:
But it did not show anything useful to help me solve my problem. At that moment, my hypothesis was that somehow memory header block corrupted and I just could not find proper WinDbg command that can show me what is really going on. Another hypothesis was that this is random crash or perhaps memory got corrupted, but I was able to reproduce that crash few times and every time it crashed at exactly the same place.
I also checked other threads, and I found that another thread is executing code close to this thread, but I didn’t find anything suspicious.
As I mentioned before, by default after you enabled heap checks, Windows will allocate memory block at the end of page to reveal overwrites, and it will crash at code that does overwrites. But it is possible to change it to track underwrites. Here is command:
gflags.exe /p /enable myapp.exe /full /backwards
Please note that you can only check overwrites or underwrites. Also, it is not possible to set this option from UI.
And after I restarted application it crashes at this rather strange place:
cmp dword ptr [rdx-40h],0ABCDAAAAh ds:00000002`009f9fc0=????????
Call stack in my application was the same as last time but this time it crashed here:
If you did execute !heap -p -? command then you will see that it is “Light page heap allocated block” because it’s header is 0xDCBAAAAA. My first thought was that internal heap data was corrupted and I found one more command:
gflags.exe /p /enable myapp.exe /full /backwards /protect
And by the way, this setting is also missing from UI. Anyway, this command changed nothing, and it still crashed at the same place. Then I decided to examine what is located at rdx-40h but as some of you already guessed, page before rdx has No Access flag to detect underwrites. But why Windows checks it? And I again started to think that heap internal structures were corrupted but I could not find any other commands to harden checks.
I spent quite a bit time trying different things but finally I decided to check other threads and bingo, another thread executing exactly the same code:
After some investigation I found that two threads reallocating exactly the same string at exactly the same time. And my understanding that one thread was releasing this block while other thread checking that block and as result second threads believes that this block has different type and as result Windows checking it as regular heap block instead of page block and crashing.
But there is another question. Why I cannot see second thread modifying the same string with regular checks? To be honest I do not know answer. But before I saw “VERIFIER STOP” message and debugger stopped on error; I was able to see few other events from debugger in Command window. And as result I think that second thread executed few hundred instructions and as result, I cannot see second thread at the same place. Perhaps VerifierCaptureContextAndReportStop does additional verification or capturing. And after I use backwards options it crashes instantly and that allow me to find problem.
I hope it will help someone