Five Silly VMware Administrator Mistakes


by Mike Nelson

VMware administrator mistake No. 1: Virtual machine renames
This mistake is a classic. In vCenter, it’s very easy to rename virtual machines (VMs). Simply right-click on the guest object, select rename and type a new name.

But that process just renames the object pointer in the vCenter database. The directories and files associated with that guest are still under the old name. So it’s easy for a VMware administrator, in the midst of quickly cleaning up the data stores, to delete a machine directory and its files in one click — especially if he or she doesn’t’ match the guests to the directories. I’ve seen it happen, and the aftermath isn’t pretty.

VMware administrator mistake No. 2: Cramming LUNs
At a conference many years back, I attended a session on a new feature in VMware ESX 3. The presenter created a 100 GB logical unit number (LUN) on a storage area network and presented it to a two-node cluster, which he used for the demo machines.

He had three servers on the LUN, each with 32 GB drives and a shared ISO data store of 2 GB. Now do the math: (32 GB x 3) + 2 GB = 98 GB. With a 100 GB LUN, he has more than enough room, right?

One by one, he fires up all three machines. And when the third one boots, all of them displayed the Purple Screen of Death. It seems he forgot about the swap files, which are created during the boot. Those files filled the LUN. It was even funnier because he had no idea why it happened, and he tried to start the machines, again.

And yes, he was a VMware engineer.

VMware administrator mistake No. 3: Network names
I once worked as a consultant on a Citrix Systems Inc. project for a small organization. One day I got a call from the organization’s storage person, who was in charge of the new virtual environment. He was having problems with vMotion, and Distributed Resource Scheduler (DRS) was generating all kinds of errors. (Did I mention he was a storage guy?)

So I went into vCenter and found that all the ESX hosts weren’t set up for the same networking. Each virtual switch had a different name on each host, which is a common mistake when the hosts are not set up at the same time, or when no naming standards are followed (or even exist). VMotion requires that the virtual switches be the same on all the hosts in a DRS cluster.

VMware administrator mistake No. 4: Honeymoons and roles
I know a VMware administrator who had to fix a virtualization issue on his honeymoon in Mexico. Before he left, he decided to lock down the infrastructure by removing people from roles in vCenter.

But he removed the roles from the permissions on the vCenter object — not just the VMs or cluster. This action prevented access to anyone with permissions.

For the record, I heard this story from the new bride, who was not at all happy about the interruptions.

VMware administrator mistake No. 5: Network interface card wipeout
I could not wait for VMware’s Host Profiles. I heard about it a year before it materialized, and I was chomping at the bit to quickly deploy standardized hosts in an infrastructure with more than 500 hosts. But when I finally used Host Profiles, it all went very wrong, very fast.

I generated a new host profile and tested it on a lab host. It went well, and the host didn’t appear to have any issues after I tested a few VMs on it. So I decided to try it in a 16-host cluster in a production environment.

Soon after, vCenter displayed that everything went well. I was smiling for about five seconds, and then the alarms went off. All my guests and hosts were inaccessible through the network. One of the issues with Host Profiles in ESX is that no matter what the network interface card (NIC) speed settings are on the profiled host, all the hosts provisioned from that profile are set to auto-negotiate by default. (VMware calls it a feature, of course.)

This setting won’t work on a network that has every switch port hardcoded to 1000/Full with no failback. (The lab network had auto on its ports, so it worked there). This setting, applied to all the hosts, brought down the whole cluster. I had to manually redo the 14 NICs on each host, which made for a very long day.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s