vSphere 5.0 Clustering Basic Design Principles


1) Although HA is not dependent on DNS any more, it is still recommended to register the hosts with their FQDN.

2) Ensure syslog is correctly configured and log files are offloaded to a safe location to offer the possibility of performing a root cause analysis in case disasters strikes.

3) In stateless environment, ensure vCenter and Auto Deploy are highly available as recovery time of your virtual machine might be dependent on them.

4) To maximize the chance of restarting virtual machines after a failure we recommend masking datastores on a cluster basis. Although sharing of datastore across cluster will work, it will increase complexity from an administrative perspective.

5) Network heartbeating is key for determining the state of a host. Ensure the management network is highly resilient to enable proper state determination.

6) Datastore heartbeating adds a new level of resiliency but is not the be-all end-all. In converged networking environments, the use odatastore heartbeating adds little value due to the fact that a NIC failure may result in both the network and storage becoming unavailable.

7) Virtual machines can be dependent on the availability of agent virtual machines or other virtual machines. Although HA will do its best to ensure all virtual machines are started in the correct order, this is not guaranteed. Document the proper recovery process.

8) Configuring restart priority of a virtual machine is not a guarantee that virtual machines will actually be restarted in this order. Ensure proper operational procedures are in place for restarting services or virtual machines in the appropriate order in the event of a failure.

9) Before upgrading an environment to later versions, ensure you validate the best practices and default settings. Document them, including justification, to ensure all people involved understand your reasons.

10) Select a reliable secondary isolation address. Try to minimize the number of “hops” between the host and this address.

11) Take advantage of some of the basic features vSphere has to offer like NIC teaming. Combining different physical NICs will increase overall resiliency of your solution.

12) Know your network environment, talk to the network administrators and ensure advanced features like Link State Tracking are used when possible to increase resiliency.

13) Be really careful with reservations, if there’s no need to have them on a per virtual machine basis; don’t configure them, especially when using host failures cluster tolerates. If reservations are needed, resort to resource pool based reservations.

14) Avoid using advanced settings to decrease the slot size as it could lead to more down time and adds an extra layer of complexity. if there is a large discrepancy in size and reservations we recommend using the percentage based admission control policy.

15)   When using admission control, balance your clusters and be conservative with reservations as it leads to decreased consolidation ratios.

16) Although vSphere 5.0 will utilize DRS to try to accommodate for the resource requirements of this virtual machine a guarantee cannot be given. Do the math; verify that any single host has enough resources to power-on your largest virtual machine. Also take restart priority into account for this/these virtual machine(s).

17) Admission control guarantees enough capacity is available for virtual machine failover. As such we recommend enabling it.

18) Do the math, and take customer requirements into account. We recommend using a “percentage” based admission control policy, as it is the most flexible.

19) In order to avoid wasting resources we recommend carefully selecting your N+X resiliency architecture. Calculate the required percentage based on this architecture.

20) Align das.iostatsinternal with the failure interval.

21) VM and Application monitoring can substantially increase availability. It is part of the HA stack and we strongly recommend using it!

22) We recommend enabling DRS to achieve higher consolidation ratios.

23) Configure vMotion to fully benefit from the capabilities of DRS.

24) Set automation level to Fully automated to fully benefit from DRS capabilities.

25) Separate VDI workloads and server workloads and assign different clusters to each workload to reduce the DRS invocations.

26) When sizing your environment, take virtual machine sizes into account when designing the network infrastructure. Having sufficient bandwidth for vMotion will result in faster migration times.

27) By enabling EVC, DRS-FT integration will be supported.

28) Do not use resource pools as a folder structure in Host and Cluster view, but use the appropriate folder view.

29) Attempt to keep the resource pool tree depth to a minimum.

30) It is not recommended deploying virtual machines at the same hierarchical level as resource pools.

31) Set per-VM reservations only if a virtual machine absolutely requires guaranteed resources.

32) We recommend to right-size resource pool-level reservations to avoid unnecessary reduction of the pool of unreserved resources. Adjust resource pool-level reservations according to the requirements of the current virtual machines.

33) Memory overhead reservations need to be taken into account while designing the cluster and resource pool structure.

34) Enable expandable reservation if no reservations are set at the resource pool-level.

35) When calculating the limit on a resource pool, take both reservations and memory overhead reservations into account.

36) When sizing vCenter take into account the number of clusters and virtual machines. DRS migration calculations for large numbers of virtual machines can impact the performance of vCenter.

37) The default moderate migration threshold provides a higher tolerance for migrations, while offering sufficient balance. The Algorithm considers the threshold and creates only recommendations that are worthwhile according to the tolerance. It is typically aggressive enough to maintain workload balance across hosts without creating unnecessary overhead caused by too-frequent migrations.

38) Size a virtual machine according to its current or near-future workload.

39) When designing a DRS cluster, take the possibilities of vMotion into account. By providing enough bandwidth, the cluster can reach a balanced state more quickly, resulting in better resource allocation (performance) for the virtual machines.

40) Use vm-host and vm-vm affinity rules sparingly, as rules can have an impact on the effectiveness of the load balancing calculation. The DRS algorithm has less choice when rules are configured.

41) Leave virtual machine automation mode set to default to minimize administration overhead and possibility of human error. Set automation level to Manual instead of Partially Automated if more control over placement is required.

42) Register templates on a single host and disable DPM on this host.

43) Enable EVC when adjusting MinPoweredOnCapacity settings with heterogeneous CPU/Memory configurations inside a cluster.

44) Use homogeneous clusters, as DPM will operate more efficiently.

45) Do not set DRS to the conservative migration threshold if DPM is required.

46) Even when array based features like deduplication are used, we recommend enabling Storage DRS. Its initial placement feature will help find the best location for your virtual machine.

47) Validate your array supports the Thin Provisioning primitive before migrating virtual machines.

48) Confirm with your storage vendor that they are adhering to the T10 standards to ensure VAAI offload capabilities like full copy, used by Storage vMotion, can be leveraged.

49) Understand the possible impact of upgrading VMFS-3 volumes to VMFS-5. In the case where various block sizes were used, migration to newly-created VMFS-5 datastores should be considered.

50) Leverage information provided through VASA to create your Datastore Clusters. This will help ensure that your virtual machines are provisioned on the correct storage tier.

51) Define VM Storage Profiles to ease provisioning of virtual machines and the associated virtual disks. Define tiers based on your SLAs with your (internal) customers.

52) Use the tools provided to make the right decisions during the creation of your datastore clusters and the provisioning process. This will reduce the chances of human error and unexpected behavior.

53) Design for performance and efficiency by using a single blocksize for all volumes. Remember that VMFS-3 volumes which are
upgraded will carry over the blocksize.

54) When acquiring new Storage Hardware, verify the device is VAAI capable or will be in the near future. Leveraging VAAI will increase efficiency and decrease overall time to complete storage tasks.

55) As SIOC is responsible for applying fairness during contention, make use of your array’s capabilities and consult your storage vendor and, when supported, set the queue depth to a minimum of 64.

56) Avoid mixing datastores with different block sizes.

57) Although  SDRS  is  capable  of  avoiding  vMotion  incompatibilities  and  will  avoid  creating  CPU  or  memory  imbalances,  it  is recommended to connect each datastore to each ESXi host in the cluster. Failing to do so may result compatibility islands that introduce constraints for SDRS and make it less efficient.

Leave a comment