System Design Space
Knowledge graphSettings

Updated: May 2, 2026 at 5:42 PM

Key-Value Database

medium

Distributed storage with partitioning, replication, quorum reads and writes, and background repair.

A key-value database looks simple only at the API level. Underneath, it quickly becomes a discussion about partitioning, replication, quorum behavior, and background storage work.

The chapter breaks down write path, read path, storage-engine choice, membership changes, and recovery after node loss.

For interviews and engineering discussions, this case is useful because it quickly reveals whether you understand where interface simplicity ends and the real cost of reliable storage begins.

Quorum

Choosing R and W shapes not only consistency guarantees, but also the latency profile of reads and writes.

Compaction

Background storage work is not secondary: it directly affects p95/p99 behavior and the real cost of writes.

Hot Partitions

Key skew can turn a single shard into the bottleneck, which is why key design matters as much as replica count.

Rebalancing

Adding nodes and moving partitions should not create long availability drops or uncontrolled traffic shifts.

Key-value databases look simple only at the API level. Underneath, they are distributed storage systems where latency, consistency, data durability, and operating cost pull in different directions. This case is useful because it forces you to reason about horizontal scaling, storage-engine choice, and failure recovery without hiding behind product complexity.

Source

Acing the System Design Interview

Chapter 8: designing a key-value database with emphasis on partitioning, quorum, and fault tolerance.

Читать обзор

Examples of key-value systems

  • Amazon DynamoDB: managed KV storage with partitioning, replication, and global tables.
  • Redis: an in-memory system built for very low latency and rich data structures.
  • Cassandra: a distributed wide-column database that behaves like a KV store in many workloads.
  • Riak: an AP-oriented architecture with vector clocks and explicit repair behavior.
  • etcd/Consul: strongly consistent KV stores used for service discovery and configuration.

Functional requirements

Core API

  • PUT /kv/:key — write a value for a key
  • GET /kv/:key — read a value by key
  • DELETE /kv/:key — delete a key
  • BATCH /kv — execute batched operations

Extended capabilities

  • TTL support and background cleanup for expired keys
  • Compare-and-swap for conditional updates
  • Idempotent write handling for safe retries
  • Operational APIs for health checks, metrics, replica lag, and repair backlog

Non-functional requirements

RequirementTargetWhy it matters
Read latency (p95)< 20msThe store often sits on a hot serving path for both product and infrastructure traffic.
Write latency (p95)< 40msThe system must absorb a stable stream of updates even during bursts.
Availability99.99%For many services this is a foundational dependency, not an optional subsystem.
ScalabilityNear-linear growth by partition countTraffic and data growth should not force a full redesign.
Data durabilitySurvive node and zone lossLosing configuration, sessions, or critical shared state is unacceptable.

High-level architecture

Theory

Replication and sharding

Practical framing for data partitioning, rebalancing, and consistency choices in distributed storage.

Читать обзор

At a high level, the platform separates write path, read path, hash-based routing, replicated shard groups, and background maintenance jobs. That separation keeps hot request handling away from heavier repair and rebalancing work.

Architecture Map

partitioning + replication + quorum

The map separates request handling, shard groups, and background maintenance and recovery processes.

Client
service or API
Coordinator + Hash Ring
route + partition map
Shard Groups
A/B/C: leader + replicas
WAL + Compaction + Repair
durability + maintenance

Data Model Map

Logical record structure and its physical placement inside a distributed KV cluster.

Logical Record

key

user:123:session

value

opaque blob / json / bytes

metadata

version: 42ttl: 24hchecksum

Physical Placement

partitioning

hash(key) -> partition_id: 17

replica set

A-leader, A-r1, A-r2

lifecycle

active -> ttl-expired -> background sweep

Identity

The key defines storage placement and request routing inside the cluster.

Consistency

The version field supports safe CAS updates and conflict handling.

Reliability

Checksums and replicas help detect and recover corrupted data.

Read and write paths through the components

The interactive view shows how a request moves from the client into a shard group and back: writes are persisted through WAL and replica acknowledgements, while reads involve source selection, version checks, and optional repair of lagging replicas.

Key-value read/write path explorer

Interactive walkthrough of how a request flows through the coordinator, WAL, and replica group.

1
Write Request
PUT / DELETE / CAS
2
Coordinator
hash + route
3
Shard Leader
WAL append
4
Replica ACKs
quorum W
5
Write Response
version + metadata
Write path: coordinator routes the key into the right shard group, persists the update via WAL, and waits for quorum acknowledgements.

Write path

  1. The key maps into a shard group through consistent hashing, which enables horizontal growth.
  2. WAL protects durability between request acceptance and storage-engine application.
  3. W defines the latency versus durability trade-off for writes.
  4. Idempotency and CAS prevent duplicates and lost updates during retries.

Storage engine: B-Tree vs LSM-Tree

The storage-engine choice shapes the whole system profile. Some workloads care more about fast point lookups and range reads, while others are dominated by write throughput, compaction cost, and SSD efficiency.

B-Tree vs LSM tree: choosing a storage structure

B-Tree architecture

[10 | 20 | 30]
[3|5|8]
[12|15|18]
[22|25|28]
Leaves contain pointers to data
✓ Advantages
  • Fast reads: O(log N)
  • Efficient range queries
  • In-place updates
✗ Drawbacks
  • Write amplification
  • Random I/O on writes
Used in:
PostgreSQLMySQL InnoDBOracleSQL ServerSQLite

Consistency and fault tolerance

Go deeper

Designing Data-Intensive Applications, 2nd Edition

Consistency, replica lag, anti-entropy, quorum behavior, and CAP/PACELC trade-offs.

Читать обзор

There is no single consistency model that fits every KV workload. You need to choose a consistency mode, define where eventual consistency is acceptable, and design explicit failure handling around replica loss and topology changes.

Quorum reads and writes

With replica factor N, you choose read quorum R and write quorum W. The familiar rule for strong tunable consistency is:

R + W > N
  • W↑, R↓ — writes become more expensive, reads become cheaper
  • W↓, R↑ — writes get faster, but reads need more replica coordination
  • R=1, W=1 — lowest latency, but also the weakest freshness guarantees

Repair and maintenance

Background jobs handle compaction, anti-entropy, hinted handoff, and rebalancing after node loss or cluster growth. Those paths matter just as much as the happy-path API.

  • Hinted handoff: temporarily stores writes for an unavailable replica
  • Read repair: converges stale copies during read traffic
  • Merkle trees: compare replica state efficiently without transferring full shards
  • Rebalancing: moves partitions while keeping the cluster available

Risks and common mistakes

  • Hot partitions: a bad key design creates load skew and turns a single shard into the cluster bottleneck.
  • Latency spikes: compaction, repair jobs, and disk contention can easily inflate p95 and p99 latency.
  • Stale reads: under weaker consistency settings, clients may observe outdated state.
  • Blind retries: without idempotency and versioning, correctness falls apart quickly.
  • Premature complexity: teams often overdesign consistency or choose the wrong storage engine before validating the real workload profile.

What to cover in an interview

A strong answer makes the trade-off between latency, consistency, and operating cost explicit for different classes of data instead of pretending one mode solves everything.

  • Which consistency mode is required for each data class and why.
  • How the system behaves during node loss, zone loss, or temporary replica unavailability.
  • How key design, value-size limits, and hot-partition mitigation are handled in practice.
  • Which SLOs and metrics matter most: p95/p99, error rate, replica lag, compaction backlog, and rebalance duration.

Related chapters

Enable tracking in Settings