System Design Space
Knowledge graphSettings

Updated: May 11, 2026 at 7:10 PM

Multi-region / Global Systems

hard

Design of global cloud-native systems: multi-region topologies, consistency trade-offs, global traffic routing and disaster recovery.

Global systems do not begin with a pretty map of regions. They begin with an honest discussion about failure geography, latency, and the limits of consistency.

In real design work, the chapter shows how multi-region topology, routing policy, replication strategy, disaster recovery, and legal constraints have to form one system model rather than a set of independent regional choices.

In interviews and engineering discussions, it helps explain the cost of going global through write conflicts, operational complexity, infrastructure spend, and the heavier discipline such systems demand.

Practical value of this chapter

Design in practice

Select multi-region topology based on latency targets, legal constraints, and recovery objectives.

Decision quality

Align routing policy, replication strategy, and consistency model across regions.

Interview articulation

Show the decision flow: active-active vs active-passive, failover policy, and data residency boundaries.

Trade-off framing

Explain globalization costs: operational complexity, write conflicts, and infrastructure budget growth.

Context

Cloud Native Overview

A base based on cloud-native principles, on top of which multi-region design is built.

Open chapter

Multi-region / Global Systems is designing services that continue to operate when a region fails and provide predictable latency to users around the world. The main engineering question here is not only “how to deploy in many regions,” but how to balance latency, availability, consistency, compliance and cost.

Basic multi-region topology

Best fit

A simple starting point for most B2B/B2C systems with one primary region.

Trade-off

Lower operational complexity, but higher RTO/RPO and weaker latency for users far from the primary region.

Operations focus

Automate failover and failback, then verify data integrity after the secondary region is promoted.

Shape: One region serves the primary load while the secondary region stays ready and takes over during regional degradation.

Decision framework

  • Latency budget: what p95/p99 response time is acceptable in each region.
  • Availability target: what degradation is acceptable when an entire region falls.
  • Consistency model: where strict consistency is needed, and where eventual is sufficient.
  • Data sovereignty: where data can be physically stored and processed.
  • Cost profile: how much does cross-region replication, egress and backup capacity cost?

Theory

CAP Theorem

With global partition, the architecture must explicitly select CP/AP priorities.

Open chapter

Data layer: replication and consistency

Single-writer + read replicas

A good baseline for OLTP with a clear consistency model and manageable failover.

Multi-primary with conflict resolution

Suitable for write-heavy global workloads, but requires a clear merge/last-write-wins/CRDT strategy.

Regional sharding

Separating tenant/data-domain by region reduces latency and cost, and increases failure isolation.

Dual-write is not allowed without coordination

Write through outbox/eventing or an agreed replication layer, otherwise you will get out of sync and losses.

Global traffic routing

  • Geo DNS / latency-based routing for incoming global traffic.
  • Health-aware traffic steering with rapid exclusion of degraded regions.
  • Region affinity (sticky routing) to keep sessions and cache local.
  • Global API gateway/edge layer with explicit policy for fallback and partial outage.

DR, failover and operational readiness

Determine RTO/RPO for each critical service and database.

Check failover/failback regularly (game days), and not just on paper.

Automate promotion secondary-region and data integrity check after switching.

Keep a runbook: who, when and based on what signals initiates regional failover.

Separate the dependency graph so that a regional failure does not cascade globally.

Without regular failover training, a multi-region architecture often remains “theoretically” fault-tolerant, but not practical.

References

Related chapters

Enable tracking in Settings