Global systems do not begin with a pretty map of regions. They begin with an honest discussion about failure geography, latency, and the limits of consistency.
In real design work, the chapter shows how multi-region topology, routing policy, replication strategy, disaster recovery, and legal constraints have to form one system model rather than a set of independent regional choices.
In interviews and engineering discussions, it helps explain the cost of going global through write conflicts, operational complexity, infrastructure spend, and the heavier discipline such systems demand.
Practical value of this chapter
Design in practice
Select multi-region topology based on latency targets, legal constraints, and recovery objectives.
Decision quality
Align routing policy, replication strategy, and consistency model across regions.
Interview articulation
Show the decision flow: active-active vs active-passive, failover policy, and data residency boundaries.
Trade-off framing
Explain globalization costs: operational complexity, write conflicts, and infrastructure budget growth.
Context
Cloud Native Overview
A base based on cloud-native principles, on top of which multi-region design is built.
Multi-region / Global Systems is designing services that continue to operate when a region fails and provide predictable latency to users around the world. The main engineering question here is not only “how to deploy in many regions,” but how to balance latency, availability, consistency, compliance and cost.
Basic multi-region topology
Когда подходит
Простой старт для большинства B2B/B2C систем с одним основным регионом.
Компромисс
Низкая операционная сложность, но выше RTO/RPO и неидеальная latency для пользователей вне primary-region.
Фокус эксплуатации
Автоматизация failover/failback и проверка целостности после promotion secondary.
Один регион обслуживает основную нагрузку, второй держится в standby и активируется при деградации primary.
Decision framework
- Latency budget: what p95/p99 response time is acceptable in each region.
- Availability target: what degradation is acceptable when an entire region falls.
- Consistency model: where strict consistency is needed, and where eventual is sufficient.
- Data sovereignty: where data can be physically stored and processed.
- Cost profile: how much does cross-region replication, egress and backup capacity cost?
Theory
CAP Theorem
With global partition, the architecture must explicitly select CP/AP priorities.
Data layer: replication and consistency
Single-writer + read replicas
A good baseline for OLTP with a clear consistency model and manageable failover.
Multi-primary with conflict resolution
Suitable for write-heavy global workloads, but requires a clear merge/last-write-wins/CRDT strategy.
Regional sharding
Separating tenant/data-domain by region reduces latency and cost, and increases failure isolation.
Dual-write is not allowed without coordination
Write through outbox/eventing or an agreed replication layer, otherwise you will get out of sync and losses.
Global traffic routing
- Geo DNS / latency-based routing for incoming global traffic.
- Health-aware traffic steering with rapid exclusion of degraded regions.
- Region affinity (sticky routing) to keep sessions and cache local.
- Global API gateway/edge layer with explicit policy for fallback and partial outage.
DR, failover and operational readiness
Determine RTO/RPO for each critical service and database.
Check failover/failback regularly (game days), and not just on paper.
Automate promotion secondary-region and data integrity check after switching.
Keep a runbook: who, when and based on what signals initiates regional failover.
Separate the dependency graph so that a regional failure does not cascade globally.
Without regular failover training, a multi-region architecture often remains “theoretically” fault-tolerant, but not practical.
References
Related chapters
- Cloud Native Overview - Basic principles of cloud-native architectures and operational practices.
- Kubernetes Fundamentals - Mechanics of multi-cluster, rollout and workload stability.
- CAP Theorem - Fundamental limitations of distributed systems with partition.
- PACELC - Trade-off latency vs consistency not only during failures.
- Consensus Protocols - How to negotiate state in a failover cluster.
- Google Global Network - The evolution of the global network and approaches to WAN as a strategic asset.
- Cost Optimization & FinOps - How to calculate the cost of multi-region solutions and long-term trade-offs.
- SRE and operational reliability - Incidents, SLOs and operation of complex production systems.
