Be careful with taking snapshots intentionally or unintentionally, it will not only decrease the virtual machine performance but also will prevent the host from being added to the vCenter DRS cluster. For most time, we focus on looking after the snapshots that be taken intentionally, but generally pay less attention to the snapshots that be taken unintentionally or I should call it not manually. Here is a real example that I experienced.
I found a host was disconnected from the vCenter one night. And It pop out such error when I tried to reconnect it within vCenter.According to VMware KB 2009217, this error generally is caused by that a virtual machine running on the host has more than 32 snaphosts.
Then I opened the snapshots manager, there are none! Here is the interesting part. I ssh into the host and browsed to the virtual machine folder, there were heaps of delta vmdk files, which means lots of snapshots. Why it is inconsistent??
To fix it, I tried to consolidate the snapshots, then I received another error:
This virtual machine has more than 100 redo logs in a single branch of its snapshot tree. Deleting some of the snapshots or consolidating the redo logs will improve performance. The maximum number of redo logs supported is 255.
The consolidate failed since a file is lock. What is locking the file? why it locks it? why it happens during night?
Backup! The first thing that came into my mind. We use TSM for backup, the process is:
1) TSM VMware proxy server notifies vCenter to take a snapshot of the target VM.
2) TSM VMware proxy server mounts the vmdk file and makes a copy of it once the snapshot is taken. Any changes after it will be written into the delta vmdk file.
3) TSM VMware proxy server notifies vCenter to consolidate the snapshot and remove the delta file when the backup is finished.
What will happen if the TSM VMware proxy server fails to unmount the vmdk file for some reason? Consequently, The snapshot will not be consolidated, and the delta vmdk file will stay. Day after day, there will be more and more snapshots.
Login into the TSM VMware proxy server, as expected the target VM’s VMDK is mounted there. Unmount the VMDK file, and run the consolidation again. Problem solved!
Reference:
I Jackie ..Can you please highlight if this is VMware issue or a TSM issue. In my environment we had a lot of blame game between these two teams.