Context
Cloud Native Overview
A base based on cloud-native principles, on top of which multi-region design is built.
Multi-region / Global Systems is designing services that continue to operate when a region fails and provide predictable latency to users around the world. The main engineering question here is not only “how to deploy in many regions,” but how to balance latency, availability, consistency, compliance and cost.
Basic multi-region topology
Когда подходит
Простой старт для большинства B2B/B2C систем с одним основным регионом.
Компромисс
Низкая операционная сложность, но выше RTO/RPO и неидеальная latency для пользователей вне primary-region.
Фокус эксплуатации
Автоматизация failover/failback и проверка целостности после promotion secondary.
Один регион обслуживает основную нагрузку, второй держится в standby и активируется при деградации primary.
Decision framework
- Latency budget: what p95/p99 response time is acceptable in each region.
- Availability target: what degradation is acceptable when an entire region falls.
- Consistency model: where strict consistency is needed, and where eventual is sufficient.
- Data sovereignty: where data can be physically stored and processed.
- Cost profile: how much does cross-region replication, egress and backup capacity cost?
Theory
CAP Theorem
With global partition, the architecture must explicitly select CP/AP priorities.
Data layer: replication and consistency
Single-writer + read replicas
A good baseline for OLTP with a clear consistency model and manageable failover.
Multi-primary with conflict resolution
Suitable for write-heavy global workloads, but requires a clear merge/last-write-wins/CRDT strategy.
Regional sharding
Separating tenant/data-domain by region reduces latency and cost, and increases failure isolation.
Dual-write is not allowed without coordination
Write through outbox/eventing or an agreed replication layer, otherwise you will get out of sync and losses.
Global traffic routing
- Geo DNS / latency-based routing for incoming global traffic.
- Health-aware traffic steering with rapid exclusion of degraded regions.
- Region affinity (sticky routing) to keep sessions and cache local.
- Global API gateway/edge layer with explicit policy for fallback and partial outage.
DR, failover and operational readiness
Determine RTO/RPO for each critical service and database.
Check failover/failback regularly (game days), and not just on paper.
Automate promotion secondary-region and data integrity check after switching.
Keep a runbook: who, when and based on what signals initiates regional failover.
Separate the dependency graph so that a regional failure does not cascade globally.
Without regular failover training, a multi-region architecture often remains “theoretically” fault-tolerant, but not practical.
