A key-value database looks simple only at the API level. Underneath, it quickly becomes a discussion about partitioning, replication, quorum behavior, and background storage work.
The chapter breaks down write path, read path, storage-engine choice, membership changes, and recovery after node loss.
For interviews and engineering discussions, this case is useful because it quickly reveals whether you understand where interface simplicity ends and the real cost of reliable storage begins.
Quorum
Choosing R and W shapes not only consistency guarantees, but also the latency profile of reads and writes.
Compaction
Background storage work is not secondary: it directly affects p95/p99 behavior and the real cost of writes.
Hot Partitions
Key skew can turn a single shard into the bottleneck, which is why key design matters as much as replica count.
Rebalancing
Adding nodes and moving partitions should not create long availability drops or uncontrolled traffic shifts.
Key-value databases look simple only at the API level. Underneath, they are distributed storage systems where latency, consistency, data durability, and operating cost pull in different directions. This case is useful because it forces you to reason about horizontal scaling, storage-engine choice, and failure recovery without hiding behind product complexity.
Source
Acing the System Design Interview
Chapter 8: designing a key-value database with emphasis on partitioning, quorum, and fault tolerance.
Examples of key-value systems
- Amazon DynamoDB: managed KV storage with partitioning, replication, and global tables.
- Redis: an in-memory system built for very low latency and rich data structures.
- Cassandra: a distributed wide-column database that behaves like a KV store in many workloads.
- Riak: an AP-oriented architecture with vector clocks and explicit repair behavior.
- etcd/Consul: strongly consistent KV stores used for service discovery and configuration.
Functional requirements
Core API
PUT /kv/:key— write a value for a keyGET /kv/:key— read a value by keyDELETE /kv/:key— delete a keyBATCH /kv— execute batched operations
Extended capabilities
- TTL support and background cleanup for expired keys
- Compare-and-swap for conditional updates
- Idempotent write handling for safe retries
- Operational APIs for health checks, metrics, replica lag, and repair backlog
Non-functional requirements
| Requirement | Target | Why it matters |
|---|---|---|
| Read latency (p95) | < 20ms | The store often sits on a hot serving path for both product and infrastructure traffic. |
| Write latency (p95) | < 40ms | The system must absorb a stable stream of updates even during bursts. |
| Availability | 99.99% | For many services this is a foundational dependency, not an optional subsystem. |
| Scalability | Near-linear growth by partition count | Traffic and data growth should not force a full redesign. |
| Data durability | Survive node and zone loss | Losing configuration, sessions, or critical shared state is unacceptable. |
High-level architecture
Theory
Replication and sharding
Practical framing for data partitioning, rebalancing, and consistency choices in distributed storage.
At a high level, the platform separates write path, read path, hash-based routing, replicated shard groups, and background maintenance jobs. That separation keeps hot request handling away from heavier repair and rebalancing work.
Architecture Map
partitioning + replication + quorumThe map separates request handling, shard groups, and background maintenance and recovery processes.
Data Model Map
Logical record structure and its physical placement inside a distributed KV cluster.
Logical Record
key
user:123:session
value
opaque blob / json / bytes
metadata
Physical Placement
partitioning
hash(key) -> partition_id: 17
replica set
A-leader, A-r1, A-r2
lifecycle
active -> ttl-expired -> background sweep
Identity
The key defines storage placement and request routing inside the cluster.
Consistency
The version field supports safe CAS updates and conflict handling.
Reliability
Checksums and replicas help detect and recover corrupted data.
Read and write paths through the components
The interactive view shows how a request moves from the client into a shard group and back: writes are persisted through WAL and replica acknowledgements, while reads involve source selection, version checks, and optional repair of lagging replicas.
Key-value read/write path explorer
Interactive walkthrough of how a request flows through the coordinator, WAL, and replica group.
Write path
- The key maps into a shard group through consistent hashing, which enables horizontal growth.
- WAL protects durability between request acceptance and storage-engine application.
- W defines the latency versus durability trade-off for writes.
- Idempotency and CAS prevent duplicates and lost updates during retries.
Storage engine: B-Tree vs LSM-Tree
The storage-engine choice shapes the whole system profile. Some workloads care more about fast point lookups and range reads, while others are dominated by write throughput, compaction cost, and SSD efficiency.
B-Tree vs LSM tree: choosing a storage structure
B-Tree architecture
✓ Advantages
- Fast reads: O(log N)
- Efficient range queries
- In-place updates
✗ Drawbacks
- Write amplification
- Random I/O on writes
Used in:
Consistency and fault tolerance
Go deeper
Designing Data-Intensive Applications, 2nd Edition
Consistency, replica lag, anti-entropy, quorum behavior, and CAP/PACELC trade-offs.
There is no single consistency model that fits every KV workload. You need to choose a consistency mode, define where eventual consistency is acceptable, and design explicit failure handling around replica loss and topology changes.
Quorum reads and writes
With replica factor N, you choose read quorum R and write quorum W. The familiar rule for strong tunable consistency is:
R + W > N
- W↑, R↓ — writes become more expensive, reads become cheaper
- W↓, R↑ — writes get faster, but reads need more replica coordination
- R=1, W=1 — lowest latency, but also the weakest freshness guarantees
Repair and maintenance
Background jobs handle compaction, anti-entropy, hinted handoff, and rebalancing after node loss or cluster growth. Those paths matter just as much as the happy-path API.
- Hinted handoff: temporarily stores writes for an unavailable replica
- Read repair: converges stale copies during read traffic
- Merkle trees: compare replica state efficiently without transferring full shards
- Rebalancing: moves partitions while keeping the cluster available
Risks and common mistakes
- Hot partitions: a bad key design creates load skew and turns a single shard into the cluster bottleneck.
- Latency spikes: compaction, repair jobs, and disk contention can easily inflate p95 and p99 latency.
- Stale reads: under weaker consistency settings, clients may observe outdated state.
- Blind retries: without idempotency and versioning, correctness falls apart quickly.
- Premature complexity: teams often overdesign consistency or choose the wrong storage engine before validating the real workload profile.
What to cover in an interview
A strong answer makes the trade-off between latency, consistency, and operating cost explicit for different classes of data instead of pretending one mode solves everything.
- Which consistency mode is required for each data class and why.
- How the system behaves during node loss, zone loss, or temporary replica unavailability.
- How key design, value-size limits, and hot-partition mitigation are handled in practice.
- Which SLOs and metrics matter most: p95/p99, error rate, replica lag, compaction backlog, and rebalance duration.
Related chapters
- Replication and sharding - Core partitioning, rebalancing, and load-distribution patterns behind horizontally scaled KV stores.
- Designing Data-Intensive Applications, 2nd Edition (short summary) - Foundational model for consistency, replication, and failure behavior in distributed storage.
- Redis: in-memory database and architecture - A practical low-latency storage example with explicit operational trade-offs.
- Cassandra: distributed wide-column database - A distributed KV-style approach with tunable consistency and strong write scalability.
- Acing the System Design Interview (short summary) - Interview framework and step-by-step walkthrough for the key-value database design case.
- System design case studies overview - Cross-case context for comparing KV storage with adjacent infrastructure and product domains.
