We had an incident last Friday – couple Hyper-V cluster nodes went to blue screen and rebooted themselves. With the Windows debugging tool and some knowledge of Cluster, I think I have figured it out.
1) Run Windows Debug Tool, and set the symbol path: SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols
2) Copy the dump file to your local, the dump file is located at \\Server\c$\windows\Minidump\. And run the following commands to find the crashed process:
!analyze –v
lmvm netft
According to the following records, the crashed process is rhs.exe (Resource Hosting Subsystem in Cluster)
!process fffffa80291d5b30
It is expected that Cluster service will reboot Windows when some critical process crashed. You can find it by running the following command on your Cluster node, and check the value of HangRecoveryAction.
cluster /cluster:<cluster-name> /prop
3) Now, we know the issue is about the Cluster. Let’s generate the cluster log by running the following command. And copy the log file to your local, the file is located at: \\Server\c$\Windows\Cluster\Reports\Cluster.log
Cluster log /g
4) Let’s see what happened back that time (I use trace32 to open the log file). The ISO-Images disk was deadlocked for some reason. (I confirmed with the Network admin that an abrupt network outage happened that day around that time). Why this only happened to the ISO-Image (it is in the Available Storage group), all CSV disks are fine. I think the only shared disks in Hyper-V cluster should be CSV, so we decide to remove that ISO-Image disk to prevent the issue from happening again.
Don’t forget the cluster log time stamp is in GMT format, you need to translate it to your local time.