System Design Space
Knowledge graphSettings

Updated: March 24, 2026 at 5:36 PM

Replication and sharding

medium

How to design replication and sharding: topologies, read/write path, rebalancing, hot shards, consistency and operational trade-offs.

Replication and sharding are easy to confuse precisely because both sound like 'data scaling,' while the cost of getting them wrong is very different.

The chapter separates read scale and availability from write scale, showing when primary-replica is enough, when multi-primary or leaderless topology becomes necessary, how shard keys should be chosen, and why re-sharding is almost always an expensive project.

In design reviews, this is especially useful because it helps you discuss lag, failover, hot shards, rebalancing cost, and evolving access patterns as parts of one decision rather than as a pile of isolated infrastructure tricks.

Practical value of this chapter

Data boundary

Choose partition strategy from access patterns to reduce hot shards and expensive cross-shard operations.

Replication mode

Select sync vs async replication from RPO/RTO goals and read-your-writes needs on critical paths.

Rebalance without pain

Treat re-sharding as a standard operation: dual writes, backfill, and controlled traffic cutover.

Interview rigor

Explain where the design reaches limits and how you would evolve the scheme for 10x growth.

Context

Principles of Designing Scalable Systems

This chapter deep-dives into one of the core scaling topics: replication + sharding.

Open chapter

Replication and sharding define data scale, availability, and cost profile. Replication improves resilience and read scaling; sharding increases write throughput and total capacity. Mistakes here usually surface not at launch, but during traffic growth and product complexity expansion.

Replication Models

Primary-Replica

One write leader with multiple read replicas. Works well for read-heavy workloads and straightforward operations.

Replica lag and read/write asymmetry; failover must be automated and tested.

Multi-Primary

Writes are accepted in multiple regions/nodes, reducing latency for geo-distributed writes.

Conflict resolution and merge logic are hard; significantly higher platform complexity.

Leaderless

Quorum-based reads/writes with high availability for distributed KV/NoSQL scenarios.

Harder consistency reasoning, read repair, hinted handoff, and tombstone behavior.

Replication Visualization

Replication Simulator

Request Queue

TX-401WRITE
user:42 · web
TX-402READ
user:42 · mobile
TX-403WRITE
tenant:acme · partner
TX-404READ
order:981 · web

Replication Router

Primary-Replica

Write to leader, read from replicas. A classic read-scaling model.

Primary

write leader
reads: 0
writes: 0
lag: 0ms

Replica A

read replica
reads: 0
writes: 0
lag: 8ms

Replica B

read replica
reads: 0
writes: 0
lag: 11ms
Ready

Ready for simulation. Start auto mode or run a single step.

Last decision: —

Sharding Models

  • Hash-based sharding: even distribution, but range queries are harder.
  • Range-based sharding: efficient for range queries, but hot-range risk is high.
  • Directory-based sharding: flexible data placement, requires a reliable metadata service.
  • Geo/tenant sharding: strong sovereignty and tenant isolation, but cross-tenant workflows are harder.

Sharding Visualization

Sharding Simulator

Shard Keys Queue

Q-901acme
user:42 · id=1200
Q-902globex
order:981 · id=3520
Q-903initech
invoice:77 · id=6710
Q-904umbrella
cart:55 · id=2200

Shard Router

Hash Sharding

Key is hashed into a shard ring for even distribution.

Shard 0

hash bucket 0
requests: 0hotness: 0

Shard 1

hash bucket 1
requests: 0hotness: 0

Shard 2

hash bucket 2
requests: 0hotness: 0
Ready

Ready for simulation. Start auto mode or run a single step.

Last decision: —

Related

Performance Engineering

Hot shards and migration windows directly affect p95/p99 latency.

Open chapter

Shard Rebalancing and Hot-Shard Control

Use consistent hashing and virtual nodes to reduce migration volume when node count changes.

Mitigate hot shards with adaptive partitioning, write buffering, workload-key routing, and temporal split.

Run backfill and migrations through throttled pipelines with checksum verification.

Every migration operation must support pause/resume, rollback plan, and observable progress.

Cloud Native

Cost Optimization & FinOps

Replication topology and rebalancing directly affect cloud cost and unit economics.

Open chapter

Practical Checklist

RPO/RTO are defined per data class and failure scenario.

Operations requiring strong consistency vs eventual consistency are explicitly documented.

There is a shard-key evolution plan for product growth and changing access patterns.

Replication, failover, and rebalancing are validated regularly through game days.

Cross-zone/cross-region replication cost is accounted for in the FinOps model.

Common anti-pattern: picking shard key for the current feature set without an evolution plan.

References

Related chapters

Enable tracking in Settings