Disaster recovery (DR) is the discipline of bringing a whole system back to work after a catastrophe: loss of a Region, a data center, or data corruption. Unlike runtime resilience, which masks the failure of individual components, DR works with discrete copies of the entire system when the in-band mechanisms no longer help.
At the heart of the discipline are two numbers: the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO). They are derived from business impact — the cost of a minute of downtime and the value of data you can lose. The smaller RTO and RPO, the more expensive and complex the solution — from backup&restore through pilot light and warm standby to active/active.
But a recovery plan is worth exactly as much as the exercise that proves it. An untested backup is a hypothesis, and a paper RTO diverges from real failover. Regular game days and DR drills measure actual RTO with a stopwatch and expose forgotten dependencies, configuration drift, and split-brain risk before a real disaster strikes.
Practical value of this chapter
Design in practice
Turn guidance on DR as a discipline: target RTO/RPO, recovery strategies, and regular game days into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.
Decision quality
Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.
Interview articulation
Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.
Trade-off framing
Make trade-offs explicit for DR as a discipline: target RTO/RPO, recovery strategies, and regular game days: release speed, automation level, observability cost, and operational complexity.
Source
AWS Well-Architected
Reliability Pillar: planning disaster recovery, RTO/RPO objectives, and strategies.
Runtime fault tolerance is built for losing a node. When an entire Region disappears, or data arrives corrupted in every replica, the in-band mechanisms no longer help — and that is where disaster recovery (DR) begins: the discipline of bringing a whole system back to work after a disaster. It is not the same as runtime resilience (which masks the failure of individual components) and not the same as incident management (coordinating people during an outage). DR sets target RTO and RPO, picks a strategy against their cost, and regularly validates the plan in exercises — game days.
Why: catastrophe vs. ordinary fault tolerance
Loss of a Region or data center
An earthquake, fire, prolonged power loss, or network isolation takes out not a single node but an entire geographic placement zone.
Logical data corruption
A bad migration, a release bug, or replication of the error itself spreads corruption to every healthy replica. Fault tolerance does not help here — replicas faithfully store the corrupted data.
Ransomware and deletion
An attack or compromised credentials can encrypt or delete both production data and the online copies reachable by the same access path.
How DR differs from resilience
Runtime resilience masks the failure of individual components and keeps the service within its availability target. DR is about recovering discrete copies of the whole system after a disaster, when the in-band mechanisms no longer help.
RTO and RPO: what exactly we promise
RTO
Recovery Time Objective
How fast you must be back
The maximum acceptable delay between the interruption of service and its restoration. It answers the question "how fast do we have to be running again".
RPO
Recovery Point Objective
How much data you can lose
The maximum acceptable amount of data, measured in time, that you can lose. It answers the question "how much of the most recent data is safe to lose".
The smaller RTO and RPO are, the less downtime and data loss — but the higher the cost and complexity. You do not pick these two numbers as a default of zero: they are set by business impact. The cost of a minute of downtime and the value of the data you can lose determine which DR strategy is worth building at all. The link to money is easiest to see in the calculator below.
Strategies by cost and speed (the AWS model)
AWS groups DR approaches into four tiers — from cheap and slow to expensive and near-instant. The first two are active/passive (traffic goes to the primary Region, the standby waits for failover); the last one runs in all Regions at once.
Backup & Restore
Cheapest, slowestData is periodically copied to another Region; infrastructure and application are redeployed during a disaster via infrastructure as code. The cheapest option, but the slowest recovery.
RTO
hours
RPO
hours
Cost
lowest
Pilot Light
Core data live, compute offCore data replicates continuously and base infrastructure exists, but application servers are switched off and are only started on failover.
RTO
tens of minutes
RPO
minutes
Cost
low
Warm Standby
Scaled-down but liveA scaled-down but fully functional copy of the environment keeps serving traffic in the background. On disaster you do not stand it up from scratch, you scale it up to production load — hence minutes, not hours, to recover.
RTO
minutes
RPO
seconds-minutes
Cost
medium
Multi-Site Active/Active
Near-zero RTO/RPOThe workload runs in multiple Regions at once and traffic is spread across them. There is essentially no failover, but it is the most expensive and complex path.
RTO
near zero
RPO
near zero
Cost
high
The RTO/RPO estimates here are qualitative (order of magnitude from AWS documentation), not guarantees for a specific system — real numbers depend on data volume, replication lag, and automation.
Calculator: what RTO and RPO cost
Downtime loss
$20,000
over 1 h of downtime
Data loss
$22,500
over a 15 min window
Expected annual loss
$85,000
Tier for the targets
Warm Standby
Model: loss = RTO × downtime_cost + RPO × data_value. This is a simplified order-of-magnitude estimate to compare a DR budget against expected damage, not a precise forecast for a specific system.
Failover and failback: switching there and back
In an active/passive strategy, failing over to the standby and failing back to the primary are two separate, risky transitions. Between them, split-brain becomes possible if both Regions decide they are primary.
Backups: types, 3-2-1, and protection from corruption
- Combine full and incremental backups: a full backup is the anchor recovery point, while incrementals between them save space and shorten the backup window but lengthen the restore chain.
- Keep the 3-2-1 rule: three copies of the data, on two different media, one of them off-site. This is the baseline defense against a local disaster (recommended by CISA / US-CERT).
- Protect against logical corruption and ransomware: immutable copies, object versioning, and point-in-time recovery give a restore point before the moment of damage that an attacker cannot erase.
- Isolate copies by access and account: a backup that can be deleted with the same permissions as production data does not protect against an insider or compromised credentials.
- Regularly test restoration itself, not the fact that a copy was created. An untested backup is a hypothesis, not a recovery plan.
Data: copy consistency and RPO for distributed stores
- Distinguish crash-consistent from application-consistent copies: a disk snapshot taken at an arbitrary moment is like pulling the plug. A consistent database backup needs quiesced snapshots or copies via the database engine itself.
- In a distributed store the RPO is set by replication lag. Asynchronous replication yields an RPO greater than zero: anything that has not yet shipped to the recovery Region is lost when the source suffers a disaster.
- For sharded and multi-table systems a time-consistent copy requires a common point: without it different shards restore to different moments and break referential consistency between them.
- Continuous replication gives a near-zero loss window but does not by itself protect against corruption — the error replicates just as fast. That is why point-in-time copies are always added alongside replication.
Game days and DR exercises: test the plan, don't trust it
- Run failure exercises and DR days regularly: Google runs regular DiRT (Disaster Recovery Testing), creating controlled emergencies without impacting customers.
- Measure real RTO with a stopwatch, not the number in a document. An exercise reveals the actual time to recovery and its gap from the paper target.
- Keep failover runbooks current and drill people on them — a plan that only one person on vacation knows is not a plan.
- Catch configuration drift at the recovery site: over time the DR Region falls behind production in versions, quotas, and secrets. An exercise exposes this before a real disaster.
Trade-offs and common mistakes
Untested backups: copies are produced for years, but nobody has ever restored a working system from them — and on the day it matters they turn out incomplete or unreadable.
Paper RTO: the target exists in a document, but real failover needs manual steps, forgotten passwords, and a person who is not reachable.
Forgotten dependencies: DNS, secrets, queues, third-party APIs, or failover via control-plane operations that are themselves unavailable during the disaster.
No failback plan: the team recovered in the standby Region but never designed a safe return, risking a second outage and loss of fresh data.
Split-brain risk: both Regions consider themselves primary and accept writes at the same time — once connectivity returns, the data conflicts.
Takeaways
- Derive RTO and RPO from business impact, not a default of zero — that drives the budget.
- Pick a strategy (backup&restore → pilot light → warm standby → active/active) for those targets and their cost.
- Protect copies from logical corruption: immutability, versioning, point-in-time, access isolation.
- Run game days regularly and measure real RTO with a stopwatch, keeping runbooks up to date.
- Design failback and split-brain protection as carefully as the failover itself.
References
Source map: AWS Well-Architected and Prescriptive Guidance support the DR strategies, RTO/RPO, and backup practices; the Google SRE Book supports reliability validation through disaster testing; CISA anchors the basic 3-2-1 backup frame. Concrete RTO/RPO targets, downtime cost, and recovery strategy should come from business risk and be tested through game days, not copied from a table.
- AWS — Disaster Recovery of Workloads on AWS: Recovery in the Cloud (DR strategies)
- AWS Well-Architected — Reliability Pillar: Plan for Disaster Recovery (DR)
- Google SRE Book — Chapter 17: Testing for Reliability (disaster & production testing)
- AWS Prescriptive Guidance — Backup and recovery approaches on AWS (RTO/RPO, immutability)
- CISA / US-CERT — Data Backup Options (3-2-1 backup rule)
Related chapters
- Incident Management - The neighboring incident-response discipline: coordination, roles, and communication once the disaster has already happened and the recovery plan must be executed.
- Resilience Patterns - Runtime resilience: timeouts, retries, circuit breakers, and bulkheads that mask local failures before they become a disaster.
- Multi-Region & Global Systems - How cross-region replication and traffic routing work — the foundation that warm standby and active/active strategies stand on.
- Chaos Engineering Tooling - The tooling for controlled failures and game days used to validate the recovery plan and measure real RTO.
