Why aren't VMWare snapshots suitable for backing up guest state?

 

First of all let's understand how snapshots work. Each time we make a snapshot, the VMWare server creates a new file, usually with "delta" in its name. It starts writing all the changes into this file, leaving the original .vmdk file (which represents a virtual machine's hard drive) untouched. Changes are written on the blocks' level, which means that even if you are moving a file from one folder to another (inside the guest OS), it's already considered as a change to VM and this change is added to the "delta" file.

Basically, this means that the server is just adding changes, making "delta" bigger and bigger. There is no limit for it to grow (unless there is a free space on the storage) and snapshot can become several times bigger than VM's hard drive itself (see image below).


Now, if you create one more snapshot, the server creates another "delta" file and starts writing changes there.

We are facing two potential problems here. The first one is that a large "delta" file starts causing performance issues. The second one is that growing "delta" file can use all the available space on the storage.

The size of "delta" file becomes a problem especially if the guest is a database server, where you have small portions of data that are added and moved frequently.

All the above means that snapshots are good if you want to test something and then revert to the previous state in case there are problems, but snapshots are not for backing up the state of the VM. And of course, snapshots should be deleted right after testing.

 

What happens if there is no more free space?

 

Well, if you've forgotten to delete snapshot and it took all the free space on the storage partition, guest will shut down with the error that there is no more space for vmdk file to grow. This is what recently happened to the qa-db1 server.

Well, a logical step in this situation is to try to delete one or more snapshots to free some space... and that's where all the problems start.

In order to delete a snapshot, the server creates one more temporary "delta" file to consolidate changes from snapshot with the previous state file. The worse case scenario is that this temporary "delta" file will be the same size as the snapshot itself.

As soon as there is no free space for this temporary "delta" file, you will probably get an error message:

there is no more space for the redo log of -0000xx.vmdk.

 

You are given the option to abort or retry.

If you choose Abort, the virtual machine is powered off, the snapshot is aborted, and a Consolidate Helper snapshot is created. The Snapshot Manager UI displays that Consolidate Helper snapshot. You can delete the Consolidate Helper snapshot after you have made space available.

If you click Retry, the Snapshot Manager returns to Consolidate Helper snapshot mode unless you have made more disk space available.

(c) VMWare knowledge base

 

So, if you didn't know that it was necessary to free space before trying to remove a snapshot, you have to free some space now and click Retry. However, this option does not always work as it's described in documentation. You can get a series of unpleasant surprises, starting from the situation when server doesn't even try to remove snapshot if free space is available and ending with unresponsive VMs and frozen vm processes on the server.

 

The bad thing about this is that you never know how much free space will be necessary to successfully remove a snapshot. To be on the safe side, you have to free at least the same space as the size of the snapshot you are removing. Unfortunately, this is not always possible.

 

What if it's not possible to free enough space?

 

There are still some things that can be done:

  1. Powering off the virtual machine
    This operation deallocates the swap file. A virtual machine's swap file is usually the same size as the allocated amount of RAM. This operation does not have to be done to the machine which is running off of the snapshot. If there are non-critical machines residing on the same partition, they can be powered off to free up storage for the commit operation.
  2. Add an extent to the existing partition
    If there is a lack of disk space, the Add Extent wizard can be used to increase the amount of space available. The Add Extent operation is irreversible, and creates a dependence of multiple LUNs for a single one.
  3. Clone the virtual machine to a partition or storage that has more space
    1. Right-click on the virtual machine which had been identified
    2. Select Clone
    3. Go through the clone wizard and select a storage with adequate space
    4. Start up the clone of the original virtual machine and verify that it is functioning in the same capacity as the original virtual machine.

      According to VMWare support, it will consolidate all snapshots and create one vmdk file. You will be good to go, though, this new file can be quite big, so you should choose an appropriate storage drive.

 

Another task is in progress

 

In case you are already stuck with all those issues, do not power on guest until you are done with removing snapshots. If you try, it will only make the situation worse as you will get a hanged or an unresponsive client process. This means that vmware-mgmt service will be unable to get any responses from client process and will be thinking that the task is still in progress (even if it was finished a long ago).

In this situation you will not be able to do anything with guest through VMWare client. It will be impossible to shut down guest or reset it (as management services still thinks that another task is in progress). And if you shut down running OS from the console or remote access, you will not be able to start it again. Guest will never go to shutdown state because of the same error message: There is another task in progress.

 

Operation is timed out

 

Often, if the snapshot you are trying to delete is big enough, you will get an "Operation timed out" message. This is because there is a default 15 minutes time out in VMWare client. You will get this message after 15 minutes regardless of the task status and it doesn't mean at all that operation was interrupted, failed or aborted. Most likely it's still running and will be finished successfully.

The only way to check if the task is still running or not is to ssh to esx server, cd to /vmfs/volumes// and check if there are some files with "delta" in their name or not. If not - then snapshot was deleted successfully. If there are still some "delta" files, may be you have other snapshots for this VM or the task is still running. Check if the size of files and date/time is changing.

Also, do not choose a "Delete all" option if there are several snapshots. This operation requires more time and more free space. It's much better to delete snapshots one by one.

 

Summary

 

  1. Do not use snapshots for guest's state backup. Snapshots are for testing and they have to be deleted right after testing is done. Storage should have backup capabilities and it's advised to use them instead of snapshots.
  2. Monitor free space on datastores. In VMWare Infrastructure Client go to Datastores, choose datacenter and go to Datastores tab. From ESX server console type vdf -h. This will show all mounted partitions and details about their size and available free space on each.

    Another useful command is du -h --max-depth=1
    This command lists the directories within the given filesystem that contain the largest files. By starting at the root (/) directory and finding the largest directories you can then drill down into these directories (using cd) and execute the same command recursively until you find the files themselves which are occupying space.

 

Sources

 

  1. kb.vmware.com
  2. VMWare support (though it's not that responsive as expected for Gold support plan)