We have a host which recently ran into the purple screen problem twice with GP Exception 13 and PF Exception 14 separately. Not sure exactly what the root cause is, it could be caused by a faulty hardware or software bugs. This server has been running for more than 6 months, and never had such issues. Plus we have not made any changes recently, so I doubt it was caused by a faulty hardware. I ran a quick memory diagnose and found nothing. Currently, I leave it running and will see what will happen next.
The purpose of posting it here is to take a note of this issue. I will review and update it when I have any clues. If you happened to see this before or you have a suggestion, please let me know. The updates will be added to the bottom.
Part 1: ESXi version. This ESXi 5.0.0 update 2
Part 2: Error messages.
Part 3: The values in the CPU register at the time of the failure.
Part 4: The physical CPU that was running an operation at the time of the failure
Part 5: VMK uptime
Part 6: Stack trace shows what the VMkernel was doing at the time of the failure
Part 7: Core dump
Updates
[12/11/2013] The purple screen comes back with PF Exception 14.
The stacks are different between the 3 purple screen failure, it should indicate the software is not hitting the same error. I still suspect it was caused a faulty hardware. A ticket has been opened to VMware.
I found the a few MCE message saying “Memory Controller Error”. MCE (Machine Check Exception) is the output from the MCA (Machine Check Architecture) within the CPU triggered for detecting and reporting hardware errors.
~ # zcat /var/run/log/vmkernel.0.gz | grep MCE
2013-11-10T23:54:07.718Z cpu32:8224)MCE: 1278: CMCI on cpu32 bank9: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
2013-11-10T23:54:07.718Z cpu32:8224)MCE: 1282: Status bits: “Memory Controller Error.”
TSC: 104284 cpu0:0)BootConfig: 89: mcaClearBanksOnMCE = TRUE
TSC: 104284 cpu0:0)BootConfig: 89: mcaClearBanksOnMCE = TRUE
0:00:00:05.582 cpu0:8192)MCE: 186: Detected 24 MCE banks. MCG_CAP MSR:0x1000c18
0:00:00:06.572 cpu0:8192)MCE: 616: Fixed 12 MCE bank/CPU-package ownership settings
0:00:00:06.573 cpu0:8192)MCEIntel: 1331: Enabled CMCI signaling of uncorrected patrol scrub errors
0:00:00:06.573 cpu0:8192)MCEIntel: 1553: Registering Error recovery BH
~ # zcat /var/run/log/vmkernel.1.gz | grep MCE
~ # zcat /var/run/log/vmkernel.2.gz | grep MCE
0:00:00:05.583 cpu0:8192)MCE: 186: Detected 24 MCE banks. MCG_CAP MSR:0x1000c18
0:00:00:06.574 cpu0:8192)MCE: 616: Fixed 12 MCE bank/CPU-package ownership settings
0:00:00:06.575 cpu0:8192)MCEIntel: 1331: Enabled CMCI signaling of uncorrected patrol scrub errors
0:00:00:06.575 cpu0:8192)MCEIntel: 1553: Registering Error recovery BH
TSC: 104424 cpu0:0)BootConfig: 89: mcaClearBanksOnMCE = TRUE
0:00:00:05.585 cpu0:8192)MCE: 186: Detected 24 MCE banks. MCG_CAP MSR:0x1000c18
0:00:00:06.578 cpu0:8192)MCE: 616: Fixed 12 MCE bank/CPU-package ownership settings
0:00:00:06.579 cpu0:8192)MCEIntel: 1331: Enabled CMCI signaling of uncorrected patrol scrub errors
0:00:00:06.579 cpu0:8192)MCEIntel: 1553: Registering Error recovery BH
~ # zcat /var/run/log/vmkernel.3.gz | grep MCE
2013-11-05T22:35:05.092Z cpu26:8218)MCE: 1278: CMCI on cpu26 bank8: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
2013-11-05T22:35:05.092Z cpu26:8218)MCE: 1282: Status bits: “Memory Controller Error.”
2013-11-05T22:37:02.349Z cpu26:8218)MCE: 1278: CMCI on cpu26 bank9: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
2013-11-05T22:37:02.349Z cpu26:8218)MCE: 1282: Status bits: “Memory Controller Error.”
2013-11-05T22:38:20.052Z cpu26:8218)MCE: 1278: CMCI on cpu26 bank8: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled.
2013-11-05T22:38:20.052Z cpu26:8218)MCE: 1282: Status bits: “Memory Controller Error.”
TSC: 104048 cpu0:0)BootConfig: 89: mcaClearBanksOnMCE = TRUE
0:00:00:05.583 cpu0:8192)MCE: 186: Detected 24 MCE banks. MCG_CAP MSR:0x1000c18
0:00:00:06.574 cpu0:8192)MCE: 616: Fixed 12 MCE bank/CPU-package ownership settings
0:00:00:06.575 cpu0:8192)MCEIntel: 1331: Enabled CMCI signaling of uncorrected patrol scrub errors
0:00:00:06.575 cpu0:8192)MCEIntel: 1553: Registering Error recovery BH
TSC: 104424 cpu0:0)BootConfig: 89: mcaClearBanksOnMCE = TRUE
I am running a full memory test currently which will take more than 22 hours to complete…
[13/11/201] The full test all memory passed. I guess it might be the memory controller within the processor. I am going to open a ticket to IBM.
[14/11/2013] A call has logged to IBM
[25/11/2013] Logs had been sent to IBM, but no feedbacks so far since last week. I have run a full hardware test, but have not seen any errors. So I decided to rebuild the host and see how it works.
[06/12/2013] Given that the host has not run into any issues after the rebuild, I believe the issue has been fixed.
Reference:
Did you ever have the problem reoccur? I have this issue and VMware are saying memory controller error is hardware fault and needs Dell to fix