k3s.live

Based on the IT journey of Michael Rickert

The case of the grind to a halt backups

Or, investigating a slow network 101

Prologue:

The office had just completed a migration from an all bare-metal to all virtualized server environment. This was the first time they were performing snapshot style backups using Veeam as opposed to copying the entire server drive 1:1 onto tape. The reasons for this are many, but one of the biggest was to reduce the speed of server backups across the board as the backup time for even a single server had gotten far to high.

The story:

The work day was coming to an end, 6:00pm was rolling around and I had decided to kick off the new Veeam backup system before leaving for the day. From a new JBOD, all new virtual servers, 10G switches, it was quite the robust and expandable backup solution we had spent a few weeks working toward. A coworker and I spent the next few minutes watching the backups go, 1 gig per second, then 2, then 5… these were the speeds we had been working so hard to achieve. The backup continued to continue on at around the 6gbps mark but with 70TB to back up we decided to call it a night and head home, expecting to see a nice full backup completed or at least mostly finished by the next morning.

Later that evening, my mind wandered and I wondered how our new backup system was making along with the large backup job. Wiping some sleep from my eyes I VPN’d into our corporate network and logged into the backup system. Things still seemed to be going well, so I moved the work laptop to the side of my desk, allowing me to occasionally view the progress while I got a gaming match in. I then noticed something strange happen out of the corner of my eye. At about 10:30pm the backups went from pulling 1+ gigs of network traffic, to less than 20mb/s, grinding them to an almost complete halt. At this new rate, the backups would take over a week to even make progress, let alone finish. Something must have gone very wrong.

With all new hardware from blade center to JBOD, I knew it wasn’t an issue with the physical hardware itself, everything was tested and humming along nicely. Upon further investigation, I found vmware giving some strange warnings. One specific LUN, on one specific blade, shot up from 10ms of latency to 1+ seconds, an over one hundred times increase! I investigated configs, poured over routes and their respective switches, tried moving data off of the affected LUN, but nothing would bring the LUN back from its 1 second delayed slumber. Meanwhile all other backup jobs not related to that LUN were fine and moving quickly, just as they were earlier in the evening. I began to think it was a drive failure in the SAN, but saw no alerts for drive errors, nor any SMART errors from the HDDs themselves.

Finally I continued to dig deeper into the latency graph built into vmware vcenter, and noticed it wasn’t so much the LUN crashing, or the disks, or the backup server. Pulling up the iSCSI pathing and following the breadcrumbs, I tried changing the type of load balancing algorithm used by vmware to communicate with the LUN/SAN from round robin to fixed. BOOM, instant 1+ gig speeds again and backups started flying! The algorithm shouldn’t cause such a destruction of traffic speeds though, it only chooses which switch port to use, and so the search continued… I didn’t have to wait long (15 min) before the LUN started tanking again, and knocking out backups with it.

This time, instead of changing the load balancing algorithm, I ‘blipped’ the switch port that LUN was on (shut/no shut), forcing it to use another port, and again bam, traffic speeds came back up to normal. Then, another 30 minutes later, the same issues arose on the new port, which left me thinking that only one issue could be left. One of the switch ports on the blade center 10gig switch, which migrates traffic from the SAN to the backup server, is filling its buffer. When its port buffer becomes full, that port crashes out, causing massive bottlenecks. An issue ticket was opened with Dell, and after quite a long support call with multiple support teams they agreed that what I had discovered earlier was likely taking place. A barrage of firmware updates ensued, which led to a final fix for the immensely slow backups experienced in the past (before, and after the new hardware was installed). The final firmware fix, along with new properly sized/scaleable backup storage hardware as well as proper snapshotting/deduplication software, and the company now has some blazing fast (and stable!) automated backups year-round.

Leave a Reply