Database Selection Framework — System Design Space

A strong database choice rarely looks like love at first sight with one technology. It is usually a process of ruling out the wrong options against real system constraints.

For real projects, this chapter helps compare candidates through read/write profile, SLA, query shape, consistency model, and operational cost instead of brand popularity.

For interviews and engineering reviews, it gives you a stronger position: do not just name the winner, show why the alternatives were rejected and what would trigger a future re-evaluation.

Practical value of this chapter

Criteria-based shortlist

Narrow to 2-3 database candidates using read/write profile, service-level needs, and query shape instead of technology popularity.

Decision score matrix

Compare options by consistency guarantees, p95/p99 latency, operational complexity, and total cost of ownership.

Risk register

Capture constraints early: vendor lock-in, multi-region latency, migration windows, and team expertise.

Interview defense

Explain why alternatives were rejected and which future signals would trigger a re-evaluation.

Decision frame and editorial focus

Chapter focus

database-selection framework grounded in workload profile and system constraints

Workload profile

Start from the data profile: source of truth, OLTP, analytics, search, cache, and event-stream responsibilities.

Good fit

Use this chapter as the entry frame: it sets reading order before jumping to a favorite database engine.

Boundary and risk

The main risk is mixing taxonomy, technology choice, and operational guarantees into one implicit recommendation.

Connect next

Connect conclusions to the database-selection framework, DDIA, and practical engine overviews.

Foundation

Database Internals

DBMS internals explain why one engine holds transactions while another holds scans and aggregations — and why that choice is expensive to undo.

Open chapter

A favorite engine quietly becomes the default — and a year later the team pays for it in burning on-call shifts and schema rewrites. The Database Selection Framework flips the order: product and its constraints first, storage second. Step by step that means defining the read/write profile, separating the transactional and analytical paths, and only then deciding on replication and sharding — each of those decisions carries an operational cost that someone will carry in production.

OLTP vs OLAP

OLTP

Many short transactions and high write concurrency: here the cost of a mistake is a corrupted order or a double charge, so latency and correctness matter most.

p95/p99 latency is critical for user operations.
We need ACID guarantees and predictable transactional semantics.
Frequent point reads and writes by key.

PostgreSQL / MySQL / distributed SQL depending on scale and consistency requirements.

OLAP

Heavy queries and aggregations over large ranges. Writes are rare and batched, so you can pay with a columnar engine and a more expensive write path in exchange for fast scans.

The main load is read-heavy analytics and reports.
Materialized data marts and bulk batch or streaming ingestion are expected.
A more expensive write path can be acceptable when it makes analysis fast.

ClickHouse / DWH / Lakehouse approach with a columnar engine.

Decision Framework (5 axes)

1. Read/Write profile

Everything else starts here: until you know the read/write ratio, operation size, and traffic bursts, any engine choice is guesswork.

How many writes/sec and reads/sec are expected at the start and after a year?
How large is the hot set and which keys are used to access it?
Are there traffic spikes, and how often do you need backfills or bulk loads?

2. Consistency and transactions

Strict consistency costs latency and availability, so demanding it everywhere is expensive. Map where stale reads are safe and where they break business logic.

Are stale reads acceptable, and in which user flows is that safe?
Are multi-row/multi-entity transactions required?
Which read and write guarantees does the business require?

3. Replication

Replication always has a concrete purpose — high availability, geo-distribution, read scaling, or disaster recovery — and each one calls for a different topology. Name yours, or you will configure the wrong thing.

Is sync replication necessary for critical data?
What are the RPO/RTO target metrics for disaster recovery?
How much replica lag is acceptable for the product?

4. Sharding

Sharding is a one-way ticket: distributed transactions and hot shards do not go away afterward. Take it on only once vertical scaling and indexing are already exhausted.

Which shard key minimizes hotspots and cross-shard queries?
How painful are re-sharding and key migrations?
Which operations will become distributed transactions?

5. Operational complexity

A technically perfect engine that no one can operate costs more than a familiar, slightly worse one. Count the total cost of ownership: team expertise, tooling, and runbooks.

Does the team have expertise in the selected DBMS?
How do backup/restore, observability, and on-call practices work?
How much does a cluster cost with target volume and SLA?

Quick selection matrix

Transactional backend (orders, payments, accounts)

OLTP first

Strict consistency, transactions and low-latency writes/reads.

Product analytics, BI and ad-hoc queries

OLAP first

Large scans and aggregations are more efficient in columnar stores.

Mixed workload (operations + analytics)

Polyglot persistence

Separate the OLTP write path from OLAP serving through a CDC/ETL/ELT pipeline.

Global scale with regional SLAs

Replication + selective sharding

Reads stay local to the region, availability survives a zone failure, and the cost of operations stays under control.

Practice

Replication and sharding

Practical models and visualizations: primary-replica, multi-leader, shard key, and rebalancing.

Open chapter

Replication and sharding: minimum rules

Replication

Define a sync/async strategy based on data criticality.
Measure replication lag and impact on read consistency.
Practice failover regularly instead of treating it as a paper design.

Sharding

Choose a shard key for real query patterns, not for an abstract model.
Evaluate cross-shard joins and distributed transaction cost.
Plan a migration path for re-sharding in advance.

Common mistakes in choosing a DBMS

Choosing a database by hype rather than by workload profile and service-level requirements: the bill arrives on the first traffic spike.

Sharding before vertical scaling and indexes are exhausted — and earning distributed transactions for nothing.

Counting only the “technical” cost and forgetting backups, migrations, on-call, and observability; those are what drain the team's budget.

Keeping OLTP and heavy analytics in one cluster without resource isolation — one report will tank the latency of user operations.

Treating CAP and PACELC as theory and leaving consistency assumptions implicit, until the first network fault fixes them for you.

References

Related chapters

Why understand storage systems? - Section-level map of storage approaches and where database selection decisions fit.
Introduction to data storage - Foundational consistency, integration, and API trade-offs before selecting a concrete engine.
DB Guide (short summary) - Practical companion on applying selection criteria to real workload scenarios.
PostgreSQL: history and architecture - OLTP baseline with strong transactional guarantees and mature production tooling.
MySQL: history, storage engines, and scaling - Alternative OLTP path with proven scaling patterns and different operational trade-offs.
MongoDB: document model, replication, and consistency - Document-store perspective for schema-flexible product workloads.
Cassandra: architecture and trade-offs - Distributed write-heavy model with explicit consistency trade-offs at scale.
ClickHouse: analytical DBMS - OLAP-oriented choice for high-throughput analytics and large-range scans.
Replication and sharding - Operational deep dive on replication and shard-key decisions after selecting storage class.
CAP theorem - Where the consistency-versus-availability choice comes from when the network splits — the basis for the decisions in axis 2.
PACELC theorem - Extends CAP with latency-versus-consistency trade-offs outside partition events.
Data Pipeline / ETL / ELT Architecture - Connects OLTP and OLAP via pipelines and workload isolation patterns.
Designing Data-Intensive Applications, 2nd Edition (short summary) - Conceptual foundation for data models, replication, and partitioning decisions.