Multi-region / Global Systems — System Design Space

Global systems do not begin with a pretty map of regions. They begin with an honest discussion about failure geography, latency, and the limits of consistency.

In real design work, the chapter shows how multi-region topology, routing policy, replication strategy, disaster recovery, and legal constraints have to form one system model rather than a set of independent regional choices.

In interviews and engineering discussions, it helps explain the cost of going global through write conflicts, operational complexity, infrastructure spend, and the heavier discipline such systems demand.

Practical value of this chapter

Design in practice

Select multi-region topology based on latency targets, legal constraints, and recovery objectives.

Decision quality

Align routing policy, replication strategy, and consistency model across regions.

Interview articulation

Show the decision flow: active-active vs active-passive, failover policy, and data residency boundaries.

Trade-off framing

Explain globalization costs: operational complexity, write conflicts, and infrastructure budget growth.

Context

Cloud Native Overview

A base based on cloud-native principles, on top of which multi-region design is built.

Open chapter

Multi-region / Global Systems are services that have to survive the loss of an entire region while still holding predictable latency for users around the world. Deploying copies into several regions is the easy part; the hard part begins when latency, availability, consistency, compliance and cost pull the design in different directions and every gain has to be paid for somewhere.

Basic multi-region topology

Users

web / mobile

Global routing

Geo DNS + health

requests

Primary region

Application

primary path

Database

single writer

async replication

Secondary region

standby app

failover path

DB replica

async replication

Best fit

A simple starting point for most B2B/B2C systems with one primary region.

Trade-off

Lower operational complexity, but higher RTO/RPO and weaker latency for users far from the primary region.

Operations focus

Automate failover and failback, then verify data integrity after the secondary region is promoted.

Shape: One region serves the primary load while the secondary region stays ready and takes over during regional degradation.

These three shapes are not a classroom abstraction: each is implemented by a specific production system, and we dissect them later in this chapter. Keep the mapping in sight:

Active-Passive

Aurora Global Database

One writer region, storage replication to secondaries, managed promotion of the standby — a textbook implementation of the pattern.

Active-Active

DynamoDB Global Tables; Spanner — the quorum variant

Global Tables accept writes in every region and pay with conflicts (LWW); Spanner makes active-active strongly consistent — at the cost of a quorum RTT and commit-wait on every write.

Geo-partitioned

CockroachDB REGIONAL BY ROW; application-level tenant region-pinning

Data is pinned to the home region of a row or a tenant: local operations are fast, cross-region ones are explicit and rare.

Data

Azure latency stats

A public P50 RTT matrix between Azure regions — updated from real backbone measurements.

Перейти на сайт

The physics of distance: inter-region RTTs

Every conversation about global architecture starts with one number — the round-trip time between regions. It is the floor for every synchronous operation that crosses a region boundary: no quorum, no replication, no cross-region transaction can beat it. Reference points from public P50 measurements by cloud providers (Azure, June 2025; the AWS matrix at cloudping.co shows similar values):

US East ↔ US West

~60–70 ms

Same continent: synchronous replication is feasible, but it already eats a visible share of the latency budget.

US East ↔ Western Europe

~75–85 ms

The classic transatlantic pair: P50 East US ↔ North Europe ≈ 74 ms, ↔ West Europe ≈ 85 ms.

US East ↔ Tokyo

~165 ms

Transpacific route: a quorum write across this pair will not fit an interactive budget.

US East ↔ Singapore

~220–230 ms

Almost a quarter of a second per exchange — any two-round protocol already costs ~0.5 s.

Western Europe ↔ Singapore

~150–165 ms

The route runs through the Middle East or around India — longer than the great-circle arc.

US East ↔ São Paulo

~115–120 ms

North-south routes are more expensive than the map suggests: fewer cables, fewer direct paths.

What these numbers do to an architecture

Synchronous replication = at least 1 RTT per commit

A write that must be acknowledged by a second region before answering the client cannot beat the inter-region RTT. Across the Atlantic that is ≥75 ms before any application logic.

A quorum pays the RTT to the nearest majority

With three regions, the leader only needs an ack from one of the other two replicas: write latency is set by the closest neighbor, not the farthest. That is why quorum regions are placed close together — e.g., three regions on one side of the world.

Two-phase commit across regions ≥ 2 RTT

Prepare and commit are two inter-region rounds. A transaction touching shards in Europe and Asia costs ≥300 ms in network alone — before any locks or disk writes.

Physics does not negotiate

Light in fiber travels at ~200,000 km/s: ~1 ms per 100 km one way. New York — Singapore is ~15,000 km along the arc, so the theoretical RTT is ~150 ms; real cable routes yield 220+.

Hence the central design rule: the latency budget is spent not on the “average” operation but on the deepest chain of synchronous cross-region exchanges. If a write must be acknowledged across an ocean, a 200 ms p95 budget is consumed by the network alone — everything else (caches, asynchronous replication, local quorums) exists to keep such exchanges off the hot path.

Decision framework

Latency budget: what p95/p99 response time is acceptable in each region.
Availability target: what degradation is acceptable when an entire region falls.
Consistency model: where strict consistency is needed, and where eventual is sufficient.
Data sovereignty: where data can be physically stored and processed.
Cost profile: how much does cross-region replication, egress and backup capacity cost?

Theory

CAP Theorem

With global partition, the architecture must explicitly select CP/AP priorities.

Open chapter

Data layer: replication and consistency

Single-writer + read replicas

A good baseline for OLTP with a clear consistency model and manageable failover.

Multi-primary with conflict resolution

Suitable for write-heavy global workloads, but requires a clear merge/last-write-wins/CRDT strategy.

Regional sharding

Separating tenant/data-domain by region reduces latency and cost, and increases failure isolation.

Dual-write is not allowed without coordination

Write through outbox/eventing or an agreed replication layer, otherwise you will get out of sync and losses.

Mechanics

Consensus: Paxos and Raft

The quorum writes in Spanner and CockroachDB are Paxos/Raft stretched across regions.

Open chapter

How real systems solve it

Four production systems cover almost the entire trade-off space: from strong consistency that pays for every quorum, to eventual consistency with last-writer-wins. Read each card along the same axis: consistency model → mechanism → price.

Google Spanner

Consistency model

External consistency: if transaction T2 starts after T1 commits, T2's timestamp is strictly greater — no client can ever observe T2's effects without T1's.

Mechanism

TrueTime: the time API returns not an instant but an uncertainty interval [earliest, latest], which GPS receivers and atomic clocks in every datacenter keep within ~7 ms. A write goes through the shard's Paxos group, and before making it visible the leader performs commit-wait — it waits until the upper bound of the interval is provably in the past (a few milliseconds on average).

Price

Every commit pays the quorum RTT between replica regions plus commit-wait; specialized clock infrastructure is required; shard leader placement has to be designed around the geography of writes.

Topology

Quorum-based active-active: shard leaders are spread across regions, reads come from the nearest replica (strong or deliberately stale).

DynamoDB Global Tables

Consistency model

Eventual consistency between regions (the default MREC mode); strongly consistent reads are available only within a region.

Mechanism

Fully active-active replication: each regional replica accepts writes locally, and changes propagate to the other regions asynchronously, typically in under a second. Concurrent writes to the same item are resolved last-writer-wins — by an internal timestamp, per item.

Price

The losing concurrent write is discarded silently — the application must either prevent conflicts (region-pinning of keys) or tolerate them. The MRSC mode adds strong consistency: a write is synchronously acknowledged by at least one more region — at the cost of an inter-region RTT and explicit conflict errors (ReplicatedWriteConflictException).

Topology

The canonical active-active: every region accepts both reads and writes as a peer.

Aurora Global Database

Consistency model

Single-writer: one primary region accepts writes; up to five secondary regions serve reads only, typically lagging by less than a second.

Mechanism

Storage-level replication: changes travel to secondary regions over dedicated storage-layer infrastructure without stealing resources from the databases themselves. Write forwarding lets an application in a secondary region issue writes — Aurora transparently forwards them to the primary.

Price

Asynchrony means RPO ≈ 1 s: losing the primary region can lose the most recent writes. RTO < 1 min with managed failover. Users far from the primary region pay a full inter-region RTT for every write.

Topology

Active-passive (more precisely, active + read-only secondaries): the main pattern is DR and local reads, not global writes.

CockroachDB

Consistency model

Serializable transactions, always — the question is not "which consistency" but "where the latency lives". Instead of TrueTime: hybrid logical clocks with a configurable maximum clock offset.

Mechanism

Tables with REGIONAL BY ROW locality: every row gets a hidden crdb_region column — a home region that hosts the leaseholder of its range and the voting replicas. The survival goal sets the price: ZONE keeps the whole quorum in the home region (local writes, but the region is a single point of failure), REGION stretches the quorum across regions (survives a region, but every write pays the RTT to the nearest neighbor). Follower reads (AS OF SYSTEM TIME) serve slightly historical data from the closest replica without visiting the leaseholder.

Price

An explicit database-level choice: either region survivability or local write latency — not both. Reads from the nearest replica trade freshness.

Topology

Geo-partitioned with a tunable skew: rows live where their users are.

Note the pattern: Spanner and CockroachDB put the cost of consistency into write latency (a quorum plus bounded clock skew), DynamoDB puts it into lost conflicts, Aurora into the RPO and a single writer region. There is no free row in this table — only a choice of which currency to pay in.

Global traffic routing

Geo-DNS / latency-based routing sends an incoming request to the lowest-latency region — but the client resolver makes that call, so the reaction is measured in minutes, not seconds.
Health-aware traffic steering pulls a degraded region out of rotation quickly — but only if the checks catch its actual sickness rather than a bare ping.
Region affinity (sticky routing) keeps a user's session and cache in one region: fewer cache misses and cross-region hops, but on failover the affinity has to be reset explicitly.
A global API gateway / edge layer gives the fallback and partial-outage policy one place to live — otherwise every service invents its own degradation, and the system's behavior in an incident becomes unpredictable.

Context

Google Global Network

Anycast front ends and a private WAN are what make fast regional failover physically possible.

Open chapter

The mechanics of regional failover

"The region went down — traffic moved to the neighbor" sounds like one action, but it is three different problems: how fast the world learns about the switch, what happens to the data during a partition, and how much the capacity that waits for a disaster costs. Start with the first one — failover comes in two fundamentally different kinds.

Health-based DNS

Minutes

Health checks remove a degraded region from DNS answers (Route 53 style). But the decision is executed by client-side resolvers: until the TTL expires — and some resolvers ignore TTLs — traffic keeps flowing into the dead region.

Failover speed is limited by caches you do not control.

Anycast

Seconds to tens of seconds

The same IP is announced via BGP from many locations worldwide; the switch happens at the network layer, bypassing client DNS caches. AWS Global Accelerator detects an unhealthy endpoint within sub-second intervals and redirects new connections, typically within tens of seconds; Google's and Cloudflare's front ends work the same way.

The address never changes — what changes is where the network takes it.

The first relies on health checks layered on top of ordinary DNS, the second on anycast routing. In practice they are combined: an anycast front end for speed, DNS as an independent backup lever that still works even if the anycast layer itself breaks.

Split brain: two regions, both halves "alive"

The worst case is not a region failure but a network partition between regions in which both sides believe they survived. If each keeps accepting writes as the primary, you get split brain: two diverging data histories that cannot be merged automatically and losslessly once connectivity returns.

A quorum arbiter. An odd number of participants: a third region (or a lightweight witness node) hands the majority to exactly one side of the partition — the other must stop accepting writes.
Fencing. Promoting a new primary comes with an epoch token; the old primary holding a stale token physically cannot write — even while it still "thinks" it is in charge.
Asymmetry by default. For the CP parts of a system the rule is simple: better to reject a write in the minority than to accept it twice. AP parts (DynamoDB Global Tables) deliberately choose the opposite — and pay with conflicts.

Region evacuation: failover as a routine

Mature operators treat taking a region out of service not as a catastrophe but as a managed procedure — region evacuation: gradually draining traffic into neighboring regions, watching errors and saturation, with an instant rollback available. Google rehearses such evacuations regularly in its DiRT exercises, and game days without a real traffic drain are considered incomplete: a procedure executed only on paper will not work in a real disaster. The key question of an evacuation is not "where will the traffic go" but "do the neighbors have the capacity to take it".

2 regions active-active

Each region utilized ≤50%

Either of the two must instantly absorb 100% of global load — half of the purchased capacity idles on a normal day.

3 regions active-active

Each region utilized ≤66%

The surviving pair splits the failed region's load: the capacity reserve is 1/(N−1) of each region's normal share.

Active-passive with a warm standby

Cheaper standby, longer RTO

A minimal footprint in the standby (data + reduced compute) cuts the bill but adds scale-up time during a disaster — and the risk that the standby fails to scale when everyone rushes into one region.

This is the price of a hot standby: the fewer the regions, the more invisible capacity is paid for every day. It competes with the egress bill for replication — together they often cost more than the compute itself, and they are the first numbers to run when choosing between active-active and a warm standby.

DR, failover and operational readiness

Determine RTO/RPO for each critical service and database.

Check failover/failback regularly (game days), and not just on paper.

Automate promotion of the secondary region and the data-integrity check after switching.

Keep a runbook: who, when and based on what signals initiates regional failover.

Separate the dependency graph so that a regional failure does not cascade globally.

Untested failover is not fault tolerance but a hypothesis about it: without regular drills the architecture stays “theoretically” reliable right up to the first real disaster.

Theory

PACELC

Residency adds a third axis to the latency/consistency trade-off: where the data is allowed to live.

Open chapter

Data residency as an architectural constraint

Data residency is the one requirement in this chapter that cannot be bought with latency or money: if a law or a contract demands that EU residents' data physically never leaves the EU, that is an invariant of the data layer, not a setting. Technically it is implemented through region-pinning — anchoring rows or tenants to a home region, i.e. the same mechanics as geo-partitioning, except the region is dictated by the regulator rather than by latency.

Row

CockroachDB REGIONAL BY ROW: a row's home region determines where its quorum and leaseholder live. An EU resident's data physically stays in EU regions.

Tenant

Tenant routing at the gateway: each tenant is pinned to a home region, its requests and data never leave it; the only global piece is the "tenant → region" directory.

Deployment

Separate regional stacks (e.g., an isolated EU environment): maximum isolation and a simple story for the regulator — at the price of duplicated operations.

The tenant variant relies on tenant routing at the global gateway — and that is the deceptively easy part. The hard part is what residency imposes on everything else:

Global uniqueness (e-mail, username) requires cross-region coordination or a global directory — the one place where a residency architecture still pays an inter-region RTT.
Moving a user or a tenant between regions is an explicit business process with a data migration, not a flag flip.
Analytics and ML collect aggregates or anonymized data, not raw records: pipelines must respect the boundaries too.
Backups and replication logs inherit the constraint — a backup "to another region for safety" can itself be a violation.

References

Related chapters

Cloud Native Overview - Basic principles of cloud-native architectures and operational practices.
Kubernetes Fundamentals - Mechanics of multi-cluster, rollout and workload stability.
CAP Theorem - Fundamental limitations of distributed systems with partition.
PACELC - Why a system pays for the latency-versus-consistency choice in normal operation too, not only during a partition.
Consensus Protocols - How to negotiate state in a failover cluster.
Google Global Network - The evolution of the global network and approaches to WAN as a strategic asset.
Cost Optimization & FinOps - How to calculate the cost of multi-region solutions and long-term trade-offs.
SRE and operational reliability - Incidents, SLOs and operation of complex production systems.