We have a new Windows 2012 R2 file server cluster running on ESXi 5.0 update 2. My colleague told me that the disk performance is terrible when he tried to use robocopy to migrate some big data to the new clusters. In the resource monitor, the worst disk response time is up to 36,000 ms!!
So I ran the esxtop and found the latency is not caused by the device but the kernel.
In the log, there are some warning messages which are about the NMP path.
2014-04-04T03:12:50.824Z cpu52:8244)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x88 (0x41250142d600, 2134331) to dev “naa.60050768028081713c0000000000007d” on path “vmhba1:C0:T3:L46” Failed: H:0x1 D:0x0 P:0x0 Possible sense data: 0x6 0x2a 0x5. Act:FAILOVER
2014-04-04T03:12:50.824Z cpu52:8244)WARNING: NMP: nmp_DeviceRetryCommand:133:Device “naa.60050768028081713c0000000000007d”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2014-04-04T03:12:50.824Z cpu52:8244)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device “naa.60050768028081713c0000000000007d” is blocked. Not starting I/O from device.
Then I checked the storage path settings, it uses Round Robin. According to VMware, the PSP_RR is only supported by MSCS that is hosted by ESXi 5.5 (the guest has to be Windows 2008 or 2012). Once I changed the the path to Most Recently Used. The performance increased right away. The disk response time within the Windows 2012 R2 was down to 20 ms.
The command is esxcli storage nmp device set -d naa.60050768028081713c0000000000007d -P VMW_PSP_MRU
The reason behind this is that the MSCS conducts SCSI-3 reservations on a disk. SCSI-3 registration sent down one path allows the MSCS cluster to do SCSI-3 reservations only on that path. When PSP_RR later switches to another path, MSCS receives an error if it tries to do a reservation or use other SCSI-3 commands. The good news is this has been fixed in ESXi 5.5, MCSC (2008 and 2012) allows to free the SCSI reservation from any path.