When a DR Test Goes Wrong

by W. Curtis PrestonNovember 25, 2024

Sometimes a DR test goes wrong for no reason; sometimes it’s because of our own actions. Just ask Paul Van Dyke, the IT supervisor at Kodiak Island Borough in Alaska, who decided to perform what might be the boldest storage reorganization project I’ve ever heard of.

This blog post summarizes the main points of my latest podcast episode. If you’d like, you can listen to it or watch it at https://www.backupwrapup.com/

The Setup: A Real Disaster Recovery Test Begins

Back in 2001, Paul had five Compaq ML370 servers. He noticed his storage usage was uneven across the servers – some barely used their 45GB RAID arrays while others were running out of space. His solution? Completely reorganize the drives across all servers at once. This, of course, meant deleting all his data and then restoring it.

When Testing Goes Nuclear

Paul intentionally destroyed his entire environment. We’re talking about taking down both domain controllers, the email server, file server, and application server simultaneously. The entire borough’s IT infrastructure went dark, with only backup tapes standing between Paul and unemployment.

Learning from Someone Else’s Test

What should have been a weekend project stretched into five days. The critical servers (email, file server, logins) came back by Monday morning, but only because Paul spent Sunday night sleeping on his office floor, swapping tapes. The application server took another three days to fully restore.

Lessons Learned

Here’s what Paul learned – and what we can learn without having to try it ourselves:

Always test your restores before betting your job on them
The RAID write penalty makes restores take longer than backups
Even when backups work perfectly, restoration takes time
Always have a fallback plan
Sometimes the best way to save $30,000 isn’t the smartest way

The Aftermath

While Paul successfully achieved his goal of better storage allocation, he admits that looking back two weeks later, he realized how incredibly risky his approach had been. The backups worked, but the process could have gone sideways in countless ways.

Twenty-three years later, Paul’s still the IT supervisor at Kodiak Island Borough. He’s got a longer beard now, and I’d like to think this experience contributed to a few of those gray hairs. But his story serves as a perfect example of why we need to think through our recovery processes carefully – before we need them.

Your disaster recovery test doesn’t have to be as dramatic as Paul’s story. The key is to test your restores regularly, understand your recovery time objectives, and always have a plan B. And maybe don’t intentionally destroy your entire environment just to reorganize some storage.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at S2|DATA, which helps companies manage their legacy data

When a DR Test Goes Wrong

The Setup: A Real Disaster Recovery Test Begins

When Testing Goes Nuclear

Learning from Someone Else’s Test

Lessons Learned

The Aftermath

Further reading

PyPI Software Supply Chain Attack: What To Do Now

What Is Fileless Malware and Why Can’t Your Antivirus Stop It?

What Is a Living Off the Land Attack?

New Research Exposes Password Manager Vulnerabilities

What Is an Initial Access Broker — and Should You Care?

What Is Ransomware as a Service and Why Should You Care

When a DR Test Goes Wrong

The Setup: A Real Disaster Recovery Test Begins

When Testing Goes Nuclear

Learning from Someone Else’s Test

Lessons Learned

The Aftermath

Related Posts

Share this

Further reading

PyPI Software Supply Chain Attack: What To Do Now

What Is Fileless Malware and Why Can’t Your Antivirus Stop It?

What Is a Living Off the Land Attack?

New Research Exposes Password Manager Vulnerabilities

What Is an Initial Access Broker — and Should You Care?

What Is Ransomware as a Service and Why Should You Care