The Backup That Would Not End

by W. Curtis PrestonOctober 28, 2024

When a client asked me to back up two Synology boxes totaling “about 40TB,” I had no idea I was about to experience the backup from hell. What followed was months of battling failing tape drives, kilobyte-per-second transfer speeds, and the discovery of directories containing millions of files – a perfect storm of everything that could possibly go wrong with a backup.

This blog post summarizes the main points of my latest podcast episode. If you’d like, you can listen to it or watch it at https://www.backupwrapup.com/)

Initial Assessment: Where Everything Started Going Wrong

The first mistake was trusting the client’s estimate of 40TB. In my defense, getting accurate sizing information would have taken days due to the extreme number of files involved. A simple ‘du’ command would hang for days, and ‘find’ commands were even worse. What we actually had was closer to 400TB – a discovery that completely invalidated our initial backup architecture.

I started with NetBackup and tape – a solid combination that’s worked for decades. The setup included a Windows server and tape library perfectly sized for 40TB. Everything looked good on paper. But then reality hit, and it hit hard.

The First Signs of the Backup from Hell

The backup speeds were the first red flag – we’re talking 3.5 kilobytes per second. Yes, kilobytes. Even when multiplexing 99 backup jobs simultaneously (NetBackup’s maximum), we only achieved 30-40 megabytes per second aggregate throughput. The math was brutal: at these speeds, we were looking at months or even years to complete the backup.

Tape Drives: Not Your Friend for Marathon Sessions

The half-height LTO drives weren’t designed for continuous operation over months. They wanted nice, fast data streams and regular breaks. Instead, I was giving them a trickle of data 24/7. The result? Constant shoe-shining and drive failures. We went through multiple drives, endless cleaning cycles, and countless reboots. When a drive failed after weeks of backup, we had to start those jobs over – pure torture.

The Million File Problem Reveals Itself

As I dug deeper, I discovered the true villain: directories containing millions of files. One directory alone had 99 million files. This is when I remembered the dreaded “million file problem” from my early backup days. But this was worse – we were dealing with this over SMB, which requires multiple round-trip conversations for each file. The network wasn’t saturated in terms of bandwidth; it was drowning in latency.

Breaking Down the Technical Nightmare

The real issue wasn’t disk I/O – during all those hundreds of simultaneous backups, I/O wait never exceeded 4%. The Synology boxes weren’t overtaxed either – no high CPU, no RAM issues. It was purely the SMB protocol overhead combined with an astronomical number of files. Each file required multiple network round-trips, creating a cascade of latency that brought everything to a crawl.

Finding Solutions Through Trial and Error

After several failed approaches, I finally developed a multi-pronged strategy:

Switched from tape to disk backup – expensive but necessary
Split the backup into 2,400 separate policies for better management
Used local tar backups for the 20 most problematic directories
Ran multiple tar backups simultaneously
Created extensive scripts to manage the whole process

The Scripting Salvation

Cygwin became both my best friend and occasional nemesis. I wrote scripts to monitor job progress, balance loads across target filers, and handle the constant Windows/Unix path translation issues. The scripting challenges were real – dealing with backslash versus forward slash issues, managing deep directory structures, and handling path translations between Windows and Unix conventions.

What Finally Worked

The breakthrough came when I realized that backing up locally using tar and then transferring the archives was infinitely faster than backing up over SMB. What had taken 60+ days with negligible progress could now complete in about a day. The difference was staggering.

Lessons from the Backup from Hell

Never trust data size estimates – verify yourself if possible
Always check file counts, not just data volume
Watch out for applications that create millions of files
Consider local backup options for extreme file count scenarios
Keep your scripting skills sharp – they’re your last line of defense
Standard backup tools may fail in extreme scenarios – have backup plans
Document everything – you’ll need it when explaining why a “simple” backup took months

The Next Time Around

If I had to do it again (please, no), I’d spend more time upfront analyzing the environment. Running find and du commands might take days, but that’s better than discovering these issues mid-backup. I’d also push harder for local backup access rather than relying on network protocols for extreme file count scenarios.

Remember: in backup, it’s not just about the total data size. The number of files can break you faster than the number of terabytes. And when someone says “it’s about 40TB,” make sure to verify that yourself – preferably before committing to any timelines.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at S2|DATA, which helps companies manage their legacy data

The Backup That Would Not End

Initial Assessment: Where Everything Started Going Wrong

The First Signs of the Backup from Hell

Tape Drives: Not Your Friend for Marathon Sessions

The Million File Problem Reveals Itself

Breaking Down the Technical Nightmare

Finding Solutions Through Trial and Error

The Scripting Salvation

What Finally Worked

Lessons from the Backup from Hell

The Next Time Around

Further reading

PyPI Software Supply Chain Attack: What To Do Now

What Is Fileless Malware and Why Can’t Your Antivirus Stop It?

What Is a Living Off the Land Attack?

New Research Exposes Password Manager Vulnerabilities

What Is an Initial Access Broker — and Should You Care?

What Is Ransomware as a Service and Why Should You Care

The Backup That Would Not End

Initial Assessment: Where Everything Started Going Wrong

The First Signs of the Backup from Hell

Tape Drives: Not Your Friend for Marathon Sessions

The Million File Problem Reveals Itself

Breaking Down the Technical Nightmare

Finding Solutions Through Trial and Error

The Scripting Salvation

What Finally Worked

Lessons from the Backup from Hell

The Next Time Around

Related Posts

Share this

Further reading

PyPI Software Supply Chain Attack: What To Do Now

What Is Fileless Malware and Why Can’t Your Antivirus Stop It?

What Is a Living Off the Land Attack?

New Research Exposes Password Manager Vulnerabilities

What Is an Initial Access Broker — and Should You Care?

What Is Ransomware as a Service and Why Should You Care