Key-Value Database — System Design Space

A key-value database looks simple only at the API level. Underneath, it quickly becomes a discussion about partitioning, replication, quorum behavior, and background storage work.

The chapter breaks down write path, read path, storage-engine choice, membership changes, and recovery after node loss.

For interviews and engineering discussions, this case is useful because it quickly reveals whether you understand where interface simplicity ends and the real cost of reliable storage begins.

Quorum

Choosing R and W shapes not only consistency guarantees, but also the latency profile of reads and writes.

Compaction

Background storage work is not secondary: it directly affects p95/p99 behavior and the real cost of writes.

Hot Partitions

Key skew can turn a single shard into the bottleneck, which is why key design matters as much as replica count.

Rebalancing

Adding nodes and moving partitions should not create long availability drops or uncontrolled traffic shifts.

Key-value databases look simple only at the API level. Underneath sits a distributed storage system where latency, consistency, data durability, and operating cost cannot all be turned up at once: pushing one of them harder is almost always paid for by another. You reach for this kind of store when you need low latency, simple key access, and horizontal growth — without rewriting the architecture at every traffic jump.

Source

Acing the System Design Interview

Chapter 8: designing a key-value database with emphasis on partitioning, quorum, and fault tolerance.

Читать обзор

Examples of key-value systems

Amazon DynamoDB: managed KV storage with partitioning, replication, and global tables.
Redis: an in-memory system built for very low latency and rich data structures.
Cassandra: a distributed wide-column database that behaves like a KV store in many workloads.
Riak: an AP-oriented architecture with vector clocks and explicit repair behavior.
etcd/Consul: strongly consistent KV stores used for service discovery and configuration.

Functional requirements

Core API

PUT /kv/:key — write a value for a key
GET /kv/:key — read a value by key
DELETE /kv/:key — delete a key
BATCH /kv — execute batched operations

Extended capabilities

TTL support and background cleanup for expired keys
Compare-and-swap for conditional updates
Idempotent write handling for safe retries
Operational APIs for health checks, metrics, replica lag, and repair backlog

Non-functional requirements

Requirement	Target	Why it matters
Read latency (p95)	< 20ms	The store often sits on a hot serving path for both product and infrastructure traffic.
Write latency (p95)	< 40ms	The system must absorb a stable stream of updates even during bursts.
Availability	99.99%	For many services this is a foundational dependency, not an optional subsystem.
Scalability	Near-linear growth by partition count	Traffic and data growth should not force a full redesign.
Data durability	Survive node and zone loss	Losing configuration, sessions, or critical shared state is unacceptable.

High-level architecture

Theory

Replication and sharding

Practical framing for data partitioning, rebalancing, and consistency choices in distributed storage.

Читать обзор

At a high level, the platform separates write path, read path, hash-based routing, replicated shard groups, and background maintenance jobs. That separation keeps hot request handling away from heavier repair and rebalancing work, so background jobs do not eat the latency budget on the user-facing path.

Architecture Map

partitioning + replication + quorum

Request Plane

Client

service or API

Router / Coordinator

route + quorum

Consistent Hash Ring

partition map

Shard Groups

Shard Group A

leader + replicas

Shard Group B

leader + replicas

Shard Group C

leader + replicas

Storage Maintenance Plane

WAL

durable write log

Compaction

SST merge and cleanup

Repair / Rebalance

anti-entropy jobs

Client

service or API

Coordinator + Hash Ring

route + partition map

Shard Groups

A/B/C: leader + replicas

WAL + Compaction + Repair

durability + maintenance

The map separates request handling, shard groups, and background maintenance and recovery processes.

Data Model Map

Logical record structure and its physical placement inside a distributed KV cluster.

Logical Record

key

user:123:session

value

opaque blob / json / bytes

metadata

version: 42ttl: 24hchecksum

Physical Placement

partitioning

hash(key) -> partition_id: 17

replica set

A-leader, A-r1, A-r2

lifecycle

active -> ttl-expired -> background sweep

Identity

The key defines storage placement and request routing inside the cluster.

Consistency

The version field supports safe CAS updates and conflict handling.

Reliability

Checksums and replicas help detect and recover corrupted data.

Read and write paths through the components

The interactive view shows how a request moves from the client into a shard group and back: writes are persisted through WAL and replica acknowledgements, while reads involve source selection, version checks, and optional repair of lagging replicas.

Key-value read/write path explorer

Interactive walkthrough of how a request flows through the coordinator, WAL, and replica group.

Write Request

PUT / DELETE / CAS

Coordinator

hash + route

Shard Leader

WAL append

Replica ACKs

quorum W

Write Response

version + metadata

Write Request

PUT / DELETE / CAS

Coordinator

hash + route

Shard Leader

WAL append

Replica ACKs

quorum W

Write Response

version + metadata

Write path: coordinator routes the key into the right shard group, persists the update via WAL, and waits for quorum acknowledgements.

Write path

The key maps into a shard group through consistent hashing, which enables horizontal growth.
WAL protects durability between request acceptance and storage-engine application.
W defines the latency versus durability trade-off for writes.
Idempotency and CAS prevent duplicates and lost updates during retries.

Storage engine: B-Tree vs LSM-Tree

The storage engine sets the profile of the whole system, and you have to pick it against a concrete workload. At one end, fast point lookups and range reads matter most; at the other, what wins is write throughput — paid back later in compaction cost and SSD wear.

B-Tree vs LSM tree: choosing a storage structure

B-Tree architecture

[10 | 20 | 30]

[3|5|8]

[12|15|18]

[22|25|28]

Leaves contain pointers to data

✓ Advantages

Fast reads: O(log N)
Efficient range queries
In-place updates

✗ Drawbacks

Write amplification
Random I/O on writes

Used in:

PostgreSQLMySQL InnoDBOracleSQL ServerSQLite

Consistency and fault tolerance

Go deeper

Designing Data-Intensive Applications, 2nd Edition

Consistency, replica lag, anti-entropy, quorum behavior, and CAP/PACELC trade-offs.

Читать обзор

No single consistency model fits every KV workload, and this is a call to make before the first incident, not during it. First you pick a consistency mode per class of data, then mark where eventual consistency is acceptable, and only after that design explicit failure handling around replica loss and topology changes — otherwise a lost replica catches the system without a plan.

Quorum reads and writes

With replica factor N, you choose read quorum R and write quorum W. The familiar rule for strong tunable consistency is:

R + W > N

W↑, R↓ — writes become more expensive, reads become cheaper
W↓, R↑ — writes get faster, but reads need more replica coordination
R=1, W=1 — lowest latency, but also the weakest freshness guarantees

Repair and maintenance

Background jobs handle compaction, anti-entropy, hinted handoff, and rebalancing after node loss or cluster growth. These paths matter just as much as the happy-path API: they are what keeps the system standing when something breaks.

Hinted handoff: temporarily stores writes for an unavailable replica
Read repair: converges stale copies during read traffic
Merkle trees: compare replica state efficiently without transferring full shards
Rebalancing: moves partitions while keeping the cluster available

Risks and common mistakes

Hot partitions: a bad key design creates load skew and turns a single shard into the cluster bottleneck.
Latency spikes: compaction, repair jobs, and disk contention can easily inflate p95 and p99 latency.
Stale reads: under weaker consistency settings, clients may observe outdated state.
Blind retries: without idempotency and versioning, correctness falls apart quickly.
Premature complexity: teams often overdesign consistency or choose the wrong storage engine before validating the real workload profile.

What to cover in an interview

A strong answer does not pretend one mode solves everything. It names the trade-off between latency, consistency, and operating cost explicitly — and shows that it differs by class of data.

Which consistency mode is required for each data class and why.
How the system behaves during node loss, zone loss, or temporary replica unavailability.
How key design, value-size limits, and hot-partition mitigation are handled in practice.
Which SLOs and metrics matter most: p95/p99, error rate, replica lag, compaction backlog, and rebalance duration.

References

Amazon Web Services — Core components of Amazon DynamoDB (AWS Docs)Apache Cassandra — Dynamo-style architecture (Apache Cassandra Documentation)Martin Kleppmann — Designing Data-Intensive Applications (O’Reilly)

Related chapters

Replication and sharding - Where shard groups come from, and why rebalancing and load distribution get settled before the engine choice.
Designing Data-Intensive Applications, 2nd Edition (short summary) - Foundational model for consistency, replication, and failure behavior in distributed storage.
Redis: in-memory database and architecture - A practical low-latency storage example with explicit operational trade-offs.
Cassandra: distributed wide-column database - A distributed KV-style approach with tunable consistency and strong write scalability.
Acing the System Design Interview (short summary) - Interview framework and step-by-step walkthrough for the key-value database design case.
System design case studies overview - Where a KV store is the right call, and where the problem runs into other infrastructure and product constraints.