Disaster Recovery: RTO, RPO, and Game Days

Disaster recovery (DR) is the discipline of bringing a whole system back to work after a catastrophe: loss of a Region, a data center, or data corruption. Unlike runtime resilience, which masks the failure of individual components, DR works with discrete copies of the entire system when the in-band mechanisms no longer help.

At the heart of the discipline are two numbers: the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO). They are derived from business impact — the cost of a minute of downtime and the value of data you can lose. The smaller RTO and RPO, the more expensive and complex the solution — from backup&restore through pilot light and warm standby to active/active.

But a recovery plan is worth exactly as much as the exercise that proves it. An untested backup is a hypothesis, and a paper RTO diverges from real failover. Regular game days and DR drills measure actual RTO with a stopwatch and expose forgotten dependencies, configuration drift, and split-brain risk before a real disaster strikes.

Practical value of this chapter

Design in practice

Turn guidance on DR as a discipline: target RTO/RPO, recovery strategies, and regular game days into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for DR as a discipline: target RTO/RPO, recovery strategies, and regular game days: release speed, automation level, observability cost, and operational complexity.

Source

AWS Well-Architected

Reliability Pillar: planning disaster recovery, RTO/RPO objectives, and strategies.

Перейти на сайт

Runtime fault tolerance is built for losing a node. When an entire Region disappears, or data arrives corrupted in every replica, the in-band mechanisms no longer help — and that is where disaster recovery (DR) begins: the discipline of bringing a whole system back to work after a disaster. It is not the same as runtime resilience (which masks the failure of individual components) and not the same as incident management (coordinating people during an outage). DR sets target RTO and RPO, picks a strategy against their cost, and regularly validates the plan in exercises — game days.

Why: catastrophe vs. ordinary fault tolerance

Loss of a Region or data center

An earthquake, fire, prolonged power loss, or network isolation takes out not a single node but an entire geographic placement zone.

Logical data corruption

A bad migration, a release bug, or replication of the error itself spreads corruption to every healthy replica. Fault tolerance does not help here — replicas faithfully store the corrupted data.

Ransomware and deletion

An attack or compromised credentials can encrypt or delete both production data and the online copies reachable by the same access path.

How DR differs from resilience

Runtime resilience masks the failure of individual components and keeps the service within its availability target. DR is about recovering discrete copies of the whole system after a disaster, when the in-band mechanisms no longer help.

RTO and RPO: what exactly we promise

RTO

Recovery Time Objective

How fast you must be back

The maximum acceptable delay between the interruption of service and its restoration. It answers the question "how fast do we have to be running again".

RPO

Recovery Point Objective

How much data you can lose

The maximum acceptable amount of data, measured in time, that you can lose. It answers the question "how much of the most recent data is safe to lose".

The smaller RTO and RPO are, the less downtime and data loss — but the higher the cost and complexity. You do not pick these two numbers as a default of zero: they are set by business impact. The cost of a minute of downtime and the value of the data you can lose determine which DR strategy is worth building at all. The link to money is easiest to see in the calculator below.

Strategies by cost and speed (the AWS model)

AWS groups DR approaches into four tiers — from cheap and slow to expensive and near-instant. The first two are active/passive (traffic goes to the primary Region, the standby waits for failover); the last one runs in all Regions at once.

Backup & Restore

Cheapest, slowest

Data is periodically copied to another Region; infrastructure and application are redeployed during a disaster via infrastructure as code. The cheapest option, but the slowest recovery.

RTO

hours

RPO

hours

Cost

lowest

Pilot Light

Core data live, compute off

Core data replicates continuously and base infrastructure exists, but application servers are switched off and are only started on failover.

RTO

tens of minutes

RPO

minutes

Cost

low

Warm Standby

Scaled-down but live

A scaled-down but fully functional copy of the environment keeps serving traffic in the background. On disaster you do not stand it up from scratch, you scale it up to production load — hence minutes, not hours, to recover.

RTO

minutes

RPO

seconds-minutes

Cost

medium

Multi-Site Active/Active

Near-zero RTO/RPO

The workload runs in multiple Regions at once and traffic is spread across them. There is essentially no failover, but it is the most expensive and complex path.

RTO

near zero

RPO

near zero

Cost

high

The RTO/RPO estimates here are qualitative (order of magnitude from AWS documentation), not guarantees for a specific system — real numbers depend on data volume, replication lag, and automation.

Calculator: what RTO and RPO cost

Downtime cost ($/hour)Target RTO (min)Target RPO (min)Data value ($/min)Disasters per year (est.)

Downtime loss

$20,000

over 1 h of downtime

Data loss

$22,500

over a 15 min window

Expected annual loss

$85,000

Tier for the targets

Warm Standby

Model: loss = RTO × downtime_cost + RPO × data_value. This is a simplified order-of-magnitude estimate to compare a DR budget against expected damage, not a precise forecast for a specific system.

Failover and failback: switching there and back

In an active/passive strategy, failing over to the standby and failing back to the primary are two separate, risky transitions. Between them, split-brain becomes possible if both Regions decide they are primary.

Backups: types, 3-2-1, and protection from corruption

Combine full and incremental backups: a full backup is the anchor recovery point, while incrementals between them save space and shorten the backup window but lengthen the restore chain.
Keep the 3-2-1 rule: three copies of the data, on two different media, one of them off-site. This is the baseline defense against a local disaster (recommended by CISA / US-CERT).
Protect against logical corruption and ransomware: immutable copies, object versioning, and point-in-time recovery give a restore point before the moment of damage that an attacker cannot erase.
Isolate copies by access and account: a backup that can be deleted with the same permissions as production data does not protect against an insider or compromised credentials.
Regularly test restoration itself, not the fact that a copy was created. An untested backup is a hypothesis, not a recovery plan.

Data: copy consistency and RPO for distributed stores

Distinguish crash-consistent from application-consistent copies: a disk snapshot taken at an arbitrary moment is like pulling the plug. A consistent database backup needs quiesced snapshots or copies via the database engine itself.
In a distributed store the RPO is set by replication lag. Asynchronous replication yields an RPO greater than zero: anything that has not yet shipped to the recovery Region is lost when the source suffers a disaster.
For sharded and multi-table systems a time-consistent copy requires a common point: without it different shards restore to different moments and break referential consistency between them.
Continuous replication gives a near-zero loss window but does not by itself protect against corruption — the error replicates just as fast. That is why point-in-time copies are always added alongside replication.

Game days and DR exercises: test the plan, don't trust it

Run failure exercises and DR days regularly: Google runs regular DiRT (Disaster Recovery Testing), creating controlled emergencies without impacting customers.
Measure real RTO with a stopwatch, not the number in a document. An exercise reveals the actual time to recovery and its gap from the paper target.
Keep failover runbooks current and drill people on them — a plan that only one person on vacation knows is not a plan.
Catch configuration drift at the recovery site: over time the DR Region falls behind production in versions, quotas, and secrets. An exercise exposes this before a real disaster.

Trade-offs and common mistakes

Untested backups: copies are produced for years, but nobody has ever restored a working system from them — and on the day it matters they turn out incomplete or unreadable.

Paper RTO: the target exists in a document, but real failover needs manual steps, forgotten passwords, and a person who is not reachable.

Forgotten dependencies: DNS, secrets, queues, third-party APIs, or failover via control-plane operations that are themselves unavailable during the disaster.

No failback plan: the team recovered in the standby Region but never designed a safe return, risking a second outage and loss of fresh data.

Split-brain risk: both Regions consider themselves primary and accept writes at the same time — once connectivity returns, the data conflicts.

Takeaways

Derive RTO and RPO from business impact, not a default of zero — that drives the budget.
Pick a strategy (backup&restore → pilot light → warm standby → active/active) for those targets and their cost.
Protect copies from logical corruption: immutability, versioning, point-in-time, access isolation.
Run game days regularly and measure real RTO with a stopwatch, keeping runbooks up to date.
Design failback and split-brain protection as carefully as the failover itself.

References

Source map: AWS Well-Architected and Prescriptive Guidance support the DR strategies, RTO/RPO, and backup practices; the Google SRE Book supports reliability validation through disaster testing; CISA anchors the basic 3-2-1 backup frame. Concrete RTO/RPO targets, downtime cost, and recovery strategy should come from business risk and be tested through game days, not copied from a table.

Related chapters

Incident Management - The neighboring incident-response discipline: coordination, roles, and communication once the disaster has already happened and the recovery plan must be executed.
Resilience Patterns - Runtime resilience: timeouts, retries, circuit breakers, and bulkheads that mask local failures before they become a disaster.
Multi-Region & Global Systems - How cross-region replication and traffic routing work — the foundation that warm standby and active/active strategies stand on.
Chaos Engineering Tooling - The tooling for controlled failures and game days used to validate the recovery plan and measure real RTO.