Acing SDI
Practice task from chapter 8
A key-value storage design case focused on consistency and scaling trade-offs.
Key-Value Database is a core infrastructure interview problem. It checks whether a candidate can design a storage layer with explicit trade-offs between latency, consistency, durability, and operational simplicity.
Functional requirements
- get/put/delete operations for keys.
- TTL support and background expiration cleanup.
- Batch operations for platform use-cases.
- Operational APIs for metrics and health checks.
Non-functional requirements
- 99.99% availability for the read path.
- p95 latency: read < 20ms, write < 40ms.
- Horizontal scaling without downtime.
- Predictable behavior under node and network failures.
High-Level Architecture
The diagram below shows a baseline KV setup: request plane, consistent-hash routing, shard groups with replication, and background storage-maintenance processes.
Architecture Map
partitioning + replication + quorumThe map separates request plane, shard groups, and background maintenance/recovery processes.
Read / Write Path through components
This interactive view walks through how writes are persisted via WAL and quorum ACKs, and how reads go through replica selection, version checks, and response assembly.
Key-Value Read/Write Path Explorer
Interactive walkthrough of request flow through core KV database components.
Write path
- The key maps to a shard group through consistent hashing, enabling horizontal scaling.
- WAL protects durability between request acceptance and storage engine flush.
- Quorum W sets the latency vs durability trade-off for writes.
- Idempotency and CAS prevent duplicates and lost updates during retries.
Data model
- key: string/bytes, mapped into a stable hash slot.
- value: blob/document up to bounded size.
- version: monotonic counter or timestamp.
- expires_at: for TTL behavior and cleanup.
Consistency and scaling
- Quorum reads/writes:
R + W > Nfor tunable consistency. - Hinted handoff and anti-entropy (Merkle trees) for repair.
- Virtual nodes to smooth rebalancing and reduce hot partitions.
- Read-repair for hot keys with stale replicas.
Interview discussion points
- Why this consistency mode was chosen and which business flows depend on it.
- How the system behaves under node loss, partitions, and massive retries.
- How hot keys are mitigated and which value-size limits are enforced.
- What backup/restore strategy is used and what real RTO/RPO looks like.
