A lesson in backup testing!

This is a real world example of why you should always test your backups.

Approximately two weeks ago while running a backup restore test an error started occurring that would cause all restores of the fileserver drive to fail, going back any length of time. A ticket was opened with Veeam support about the error and troubleshooting began later that day. After three to four more days of troubleshooting an entirely new full backup of the fileserver drive was performed, which also gave an error after completing (this takes 4ish days). With no way to backup or restore the fileserver drive this issue became priority #1 and escalated to tier 1 support (highest tier) within Veeam’s ticketing system. After another few days of troubleshooting, it was discovered that an obscure default timeout in the Veeam software was causing the backups/restores to fail before completing because of the overall size of the fileserver drive backups. This timeout presented itself as a network issue, despite being completely unrelated, thus making it much harder to detect. Veeam then provided a patch to change said timeout value from the 30 minute default to 2 hours. Once the software patch was in place restores, and backups, of the fileserver drive began running normally again. After the initial successful restore I tested several other backups (one month back, two months back) of G: drive and those were functional again as well as of yesterday evening. The lesson here is to always test your backups _AND_ restores before you need them. Had we not tested the restores, and suddenly needed to recover fileserver drive data from scratch due to a DR event, we would have had to begin the error finding process from the beginning, adding a huge delay to the recovery of critical files to the tune of possibly weeks.

 

Note: For anyone curious of the actual error, veeam would report a network reset at exactly 30 minutes after starting an instant recovery. Because the IR recovery time took longer than 30 minutes to complete, veeam would assume that the restore failed and close connection. By adding a dword in the registry for remotingTimeout of 7200 (2 hours) under HKLM/Software/Veeam/Veeam backup and replication and restarting the backup server the RPC connection will stay open long enough for the file to complete processing and the restore would properly complete.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *