HyperFlex VM with Native VMware Snapshots VM not responding issue
I recently encountered an issue performing a snapshot-based VM backup using Avamar with the nested vCenter VM inside a HyperFlex cluster. This issue occurs when using native VMware snapshot backups and NFSv3 datastores. The issue is more prevalent with nested vCenter on HyperFlex with VMware as HyperFlex presents the storage as NFSv3 datastores to the ESXi hosts.
The Avamar initiated snapshot of the vCenter VM would be successful and Avamar would start it’s backup run. However, as soon as the backup is near completion, Avamar would initiate a snapshot deletion prior to validating if the backup is successful. During the snapshot deletion, the VM will become unresponsive and Avamar would mark the backup as failed. It is also found that the ESXi host which hosts the nested vCenter would also become unresponsive for a few minutes when this happens and recovers automatically. Upon further investigation, it would seem that the snapshot deletion does complete after the VM becomes responsive but the Avamar backup then fails with error “unable to connect to host” as the ESXi host is unresponsive.
After logging a case with VMware support, they referred me to the below VMware KB 2010953 article:
Basically, the bug affects all VM snapshots on NFSv3 datastores and is triggered when the backup appliance and VM are on different ESXi hosts. The workaround according to the VMware KB is to use NFSv4 datastores but this is not available with HyperFlex at the moment.
The workaround for HyperFlex is to create a VM affinity rule to ensure that the nested vCenter and the Avamar backup appliance sits together on the same host.
As part of the troubleshooting of this issue, I did some research on the best practice when performing snapshot-based backup of the nested vCenter on HyperFlex.
Please adhere to the following when configuring Avamar backup of nested vCenter on HyperFlex:
- Create a separate policy for the nested vCenter appliance in HyperFlex.
- Disable VMware “quiesced snapshots” in the backup policy.
- Disable CBT (change block tracking) in the backup policy.