I think it is worth sharing what I learned Today. We found that HA is not functioning on all hosts in a vSphere cluster. Reconfiguring the vSphere HA host always failed with ‘Operation timed out’ error message. I checked the fdm.log on couples servers, and found them all have such message:
error ‘Election’ opID=SWI-cb1a0483] ReadMsg: [120 times] Wrong fault domain ID: 9148BCE8-A6E7-45D7-B591-76C15A3F6470-26-9e10b65-my-vCenter!= 9148BCE8-A6E7-45D7-B591-76C15A3F6470-26-8284ae7-my-vCenter from 192.168.1.102
My understanding of this message is that the Master election process failed due to the different fault domain between local host and the remote host (192.168.1.102). The weird thing is 192.168.1.102 is a host that has been placed into maintenance mode. So I guess it could be caused by that the 192.168.1.102 was the master in the fault domain, but somehow it failed to tell other hosts while it quit from the domain and enter into the maintenance mode. To approve my guess, I pull the host back from maintenance mode, all red alarm of the HA failure disappeared right away!!
I checked the HA status, all look good. A new master has been elected successfully. Then I place the 192.168.1.102 into maintenance again, HA on all other hosts still functioning well.