System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Multi-region / Global Systems

hard

Design of global cloud-native systems: multi-region topologies, consistency trade-offs, global traffic routing and disaster recovery.

Context

Cloud Native Overview

A base based on cloud-native principles, on top of which multi-region design is built.

Open chapter

Multi-region / Global Systems is designing services that continue to operate when a region fails and provide predictable latency to users around the world. The main engineering question here is not only “how to deploy in many regions,” but how to balance latency, availability, consistency, compliance and cost.

Basic multi-region topology

Global usersweb / mobile clientsGlobal traffic managergeo DNS / health routingPrimary regionApp clusterserves all trafficDB primarysingle writerSecondary regionStandby appactivated on failoverDB replicaasync replicatedrequestsprimary routefailover routewriteasync replicationreads on failover

Когда подходит

Простой старт для большинства B2B/B2C систем с одним основным регионом.

Компромисс

Низкая операционная сложность, но выше RTO/RPO и неидеальная latency для пользователей вне primary-region.

Фокус эксплуатации

Автоматизация failover/failback и проверка целостности после promotion secondary.

Один регион обслуживает основную нагрузку, второй держится в standby и активируется при деградации primary.

Decision framework

  • Latency budget: what p95/p99 response time is acceptable in each region.
  • Availability target: what degradation is acceptable when an entire region falls.
  • Consistency model: where strict consistency is needed, and where eventual is sufficient.
  • Data sovereignty: where data can be physically stored and processed.
  • Cost profile: how much does cross-region replication, egress and backup capacity cost?

Theory

CAP Theorem

With global partition, the architecture must explicitly select CP/AP priorities.

Open chapter

Data layer: replication and consistency

Single-writer + read replicas

A good baseline for OLTP with a clear consistency model and manageable failover.

Multi-primary with conflict resolution

Suitable for write-heavy global workloads, but requires a clear merge/last-write-wins/CRDT strategy.

Regional sharding

Separating tenant/data-domain by region reduces latency and cost, and increases failure isolation.

Dual-write is not allowed without coordination

Write through outbox/eventing or an agreed replication layer, otherwise you will get out of sync and losses.

Global traffic routing

  • Geo DNS / latency-based routing for incoming global traffic.
  • Health-aware traffic steering with rapid exclusion of degraded regions.
  • Region affinity (sticky routing) to keep sessions and cache local.
  • Global API gateway/edge layer with explicit policy for fallback and partial outage.

DR, failover and operational readiness

Determine RTO/RPO for each critical service and database.

Check failover/failback regularly (game days), and not just on paper.

Automate promotion secondary-region and data integrity check after switching.

Keep a runbook: who, when and based on what signals initiates regional failover.

Separate the dependency graph so that a regional failure does not cascade globally.

Without regular failover training, a multi-region architecture often remains “theoretically” fault-tolerant, but not practical.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov