Hard-Earned Disaster Recovery Lessons From 9/11 and other disasters

When your building is on fire, your data center is underwater, or ransomware has locked up all your systems, it’s too late to figure out how disaster recovery should work. The disaster recovery lessons we’ve learned from major events like 9/11 and significant hurricanes have shaped how we should approach DR planning today. These lessons weren’t theoretical – they came at a high cost to the organizations that experienced them firsthand.

This blog post summarizes the main points of my latest podcast episode. If you’d like, you can listen to it or watch it at https://www.backupwrapup.com/)

Critical Disaster Recovery Lessons About Geographic Separation

One of the most shocking disaster recovery lessons from 9/11 was the realization that many companies had put their primary data center in one World Trade Center tower and their hot backup site in the other tower. This made perfect sense from a technical standpoint – minimal latency, high bandwidth connections between sites, and the assumption that nothing would take out both buildings simultaneously.

But when both towers fell, these companies lost everything – primary and backup. This taught us that geographic separation isn’t just a good idea – it’s absolutely necessary. However, this created a new challenge: how far apart should your sites be?

Regulatory attempts after 9/11 suggested financial institutions needed DR sites at least 200 miles away from primary sites. But physics doesn’t care about regulations. The speed of light creates latency limitations that made synchronous replication almost impossible over such distances with the technology available at the time.

Disaster Recovery Lessons: Synchronous vs. Asynchronous Replication

The distance limitation brings us to another set of disaster recovery lessons about data replication methods. In synchronous replication, a write isn’t acknowledged to the client until it’s committed at both the primary and secondary sites. This guarantees zero data loss if done properly (what Prasanna called “domino mode” in our discussion), but requires low latency between sites.

Asynchronous replication acknowledges writes to the client before they’re committed at the secondary site. This creates a potential data loss window but allows for much greater distances between sites. With asynchronous replication, you can configure how far behind you’re willing to let the secondary site fall – measured in transactions or seconds – before putting back pressure on the primary site.

Today, cloud-based DR solutions make much greater distances possible, but we still need to carefully consider what consistency model works for our applications.

Overlooked Disaster Recovery Lessons: The Human Element

Some of the most important disaster recovery lessons aren’t about technology at all – they’re about people. In our podcast, we discussed a case where a hurricane devastated an island, and the recovery team flew in to restore systems from their secondary data center (which survived because it was on higher ground).

What they hadn’t planned for:

  • Where would the recovery team sleep when hotels were destroyed?
  • What would they eat when restaurants were closed?
  • How would they communicate when cell towers were down?

In that case, they converted conference rooms to sleeping quarters, ate rice and beans for weeks, and had to use satellite communications. These are the disaster recovery lessons that don’t make it into technical manuals but can make or break your recovery efforts.

Another human factor: dependency on key personnel. What happens if your primary DR expert is unavailable? Documentation becomes critical, and so do tabletop exercises where you work through scenarios with different team members.

Practical Disaster Recovery Lessons About Self-Sufficiency

When planning for disaster recovery, you need to assume you’ll be isolated from the rest of your computing infrastructure. In the island disaster example I mentioned, their authentication and authorization systems (Active Directory) were located on the mainland. When connectivity was lost, this created significant recovery complications.

The disaster recovery lessons here are clear: you can’t count on utilities (power, internet, water) being available. You need to plan for self-sufficiency or explicitly accept the risk of not being able to recover if those utilities are unavailable.

This doesn’t mean you need to spend millions on generators and satellite systems – but you should have the conversation and make conscious decisions about what level of self-sufficiency you need.

Testing: Where Disaster Recovery Lessons Get Real

I shared a personal disaster recovery lesson about assuming my backup system’s throughput would be sufficient for recovery, only to find out during an actual disaster that it was terrible. Don’t make this mistake.

Remember that restore speeds are typically slower than backup speeds, whether due to legacy issues like tape multiplexing or modern challenges like deduplication. The only way to know your actual recovery capabilities is to test them under realistic conditions.

The 3-2-1 rule remains one of the most fundamental disaster recovery lessons: maintain at least three copies of your data, on two different media types, with one copy stored offsite. This principle still holds true whether you’re using cloud storage, tape, or other technologies.

Disasters will happen. They’re never convenient, and they never follow your plans exactly. But by applying these disaster recovery lessons, testing thoroughly, and planning for both technical and human factors, you can dramatically improve your chances of successful recovery when – not if – disaster strikes.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at S2|DATA, which helps companies manage their legacy data