Introduction to Data Storage — System Design Space

This chapter shows storage evolution without mythology: teams move from files and simple OLTP models to NoSQL, NewSQL, and HTAP because real limits start piling up, not because the technology is fashionable.

In practice, it helps you describe where state actually lives, how it moves between queues, object storage, and databases, and why that immediately creates requirements around idempotency, retries, and event ordering.

In interviews and architecture discussions, it helps explain why the data architecture became more complex in that sequence instead of jumping to a heavyweight stack too early.

Practical value of this chapter

State model

Make state location explicit: in app memory, queues, databases, or object storage, with clear guarantees per step.

API from data shape

Design API contracts around storage behavior: idempotency, retries, event ordering, and deduplication.

NewSQL and HTAP fit

Know when NewSQL/HTAP simplify architecture and when it is safer to separate transactional and analytical paths.

Interview narrative

Explain data evolution from simple persistence to distributed architecture without introducing unnecessary complexity.

Decision frame and editorial focus

Chapter focus

state-storage evolution, API contracts, and external storage choices

Workload profile

Start from the data profile: source of truth, OLTP, analytics, search, cache, and event-stream responsibilities.

Good fit

Use this chapter as the entry frame: it sets reading order before jumping to a favorite database engine.

Boundary and risk

The main risk is mixing taxonomy, technology choice, and operational guarantees into one implicit recommendation.

Connect next

Connect conclusions to the database-selection framework, DDIA, and practical engine overviews.

Source

Essential Architecture - Data

Transcript of the 4 October 2021 lecture on data storage and how storage choices shape APIs.

Перейти на сайт

This chapter connects stateless applications, stateful components, data models, API contracts, and consistency guarantees. A storage choice rarely stays inside the database: it leaks outward and decides latency, retry behavior, idempotency, deduplication, and which team owns the data when something breaks.

We then move from files and classic OLTP to NoSQL, NewSQL, HTAP, object storage, data lakes, and Data Mesh. One idea runs through all of it: moving state outside the process is only half the job. After that you still have to answer honestly where that state lives and what guarantees the API can offer because of it.

Why Data Drives APIs

Related topic

The Twelve-Factor App

Stateless applications as a foundation for scaling and resilience.

Читать обзор

A storage decision does not stay under the hood — it surfaces in the interface as a set of measurable properties:

Response latency and throughput
Boundaries between strong and eventual consistency
Error, retry, and deduplication model
Limitations on filtering/search/pagination
Idempotency for repeatable operations
Team ownership of data and contracts

Stateless as a foundation

Twelve-Factor principle: an application keeps no state inside the process. A new instance then comes up without any data migration, but the question of where state lives does not disappear — it simply has to be answered explicitly.

The Evolution of State Storage

File systems

File formats and read logic quickly leak into application code, and changing them later is expensive.

Relational databases (OLTP)

SQL and transactions provide strong guarantees and an expressive interface — the cost is that write scaling has to be solved separately.

OLAP and analytics

Cubes and star and snowflake schemas come in when the operational store can no longer carry heavy aggregates for BI.

Big Data / Hadoop

Once data stops fitting on one machine, MapReduce and the Hadoop ecosystem take over large-scale batch processing.

Object storage

Objects instead of a rigid directory hierarchy, with the S3 API as the de facto standard — cheap and almost unbounded, but without transactions or strong consistency.

NoSQL

Horizontal scaling in exchange for explicit consistency and query trade-offs.

NewSQL

SQL and ACID guarantees in distributed architecture for transactional workloads.

HTAP

Convergence of OLTP and OLAP: near-real-time analytics next to operational data.

NewSQL and HTAP in architecture decisions

When NewSQL is the right fit

Fits when you need SQL semantics, strong transactions, and horizontal growth without manual shard management.

When HTAP is the right fit

Pays off where product workflows need both operational transactions and near-real-time analytics over the same data.

Key risks

Higher operational complexity, expensive cross-region queries, and limits for heavy analytical workloads.

How to frame this in interviews

A strong answer names the concrete pain, states the trade-offs being accepted, and shows which constraints keep the risk under control.

Practical rule of thumb: use NewSQL for stateful core workflows where correctness is expensive to get wrong, and HTAP for product domains that need analytics almost in sync with operational traffic.

Related chapter: NewSQL: TiDB, CockroachDB, and YDB.

Relational databases: key concepts

Related topic

Database Internals

B-Trees, LSM and transactions within the DBMS.

Читать обзор

Normalization

Data shape drives schema and query behavior — push it too far and every read turns into a cascade of joins.

SQL

A declarative language separates the “what” from the “how” and leaves the execution plan to the engine.

Indexes

They speed up reads, but each one is extra work on every write and update.

Transactions and ACID

Atomicity, isolation, and durability shape system contracts.

Replication

Buys you failover and read scaling, but you pay for it with replica lag and explicit consistency trade-offs.

Sharding

Spreads load by shard key — and immediately complicates any query that touches more than one shard.

Go deeper: Designing Data-Intensive Applications, 2nd Edition and Database Internals.

Integration between systems

Related topic

Enterprise Integration Patterns

Files, RPC, and messaging as integration patterns.

Читать обзор

File transfer

A simple, cheap exchange pattern, but encapsulation is weak: the file format effectively becomes the contract.

Shared database

A shared schema couples teams tightly and slows delivery — you cannot change a table without affecting your neighbors.

RPC

RPC gives strong contracts, but the price is versioning discipline.

Messaging

Fits asynchronous workflows and flexible integrations — at the cost of having to reason about ordering and delivery explicitly.

A shared database creates high coupling and quietly breaks contracts between teams: any schema change becomes everyone's problem. So data ownership is split among owners, and integration is pushed into explicit interfaces.

Data Lake vs Data Mesh

Related topic

Big Data

The evolution of analytics and architectural layers.

Читать обзор

Data Lake

Centralized data collection from OLTP systems through ETL processes. As scale grows, the bottleneck is not volume but ownership and quality: one central team can no longer keep up with every domain.

Data Mesh

Domain-centric decentralization
Data as a product
Self-service platform
Federated computational governance

DDD and domain boundaries

Related topic

Learning Domain-Driven Design

Bounded contexts and domain contracts.

Читать обзор

Once domain boundaries and contracts are drawn between bounded contexts, an API survives internal rework without breaking its neighbors. DDD is exactly what helps separate the data models owned by different teams.

Related topic: Learning Domain-Driven Design.

How data is turned into a convenient API

Bridge data -> API

Predictable guarantees (ACID vs BASE)
Clear sources of truth
Clear error and retry model
Domain and contract boundaries
Idempotency and deduplication
Isolation from shared database

NoSQL through the lens of CAP/BASE

Understanding CAP and BASE tells you where you can only honestly promise a client eventual consistency, and where a retry without idempotency will produce a duplicate.

Mini-checklist of a convenient API

It is clear what consistency guarantees the system provides.
The client understands where eventual consistency is possible.
Idempotency for operations that can be repeated.
Errors, retries and timeouts are described deterministically.
There is no shared database as a hidden integration channel.
Domain boundaries are reflected in the API contract.

Practical storage-selection scenarios

FinTech ledger / billing

Relational DB or NewSQL

Strong consistency, strict transactions, and deterministic handling of retries, idempotency, and audit trails.

Real-time product reporting

HTAP or OLTP + streaming + OLAP

Fast analytics with minimal ETL lag while keeping operational workflows responsive.

Telemetry and monitoring

TSDB + object storage

High-ingest writes, retention controls, and cost-efficient long-term historical storage.

Content + search + recommendations

Polyglot persistence

One database is rarely optimal for transactions, full-text search, and vector retrieval at the same time.

Related chapters

DB Guide - Practical playbook for selecting and operating data stores across different workload profiles.
Database selection framework: how to make architecture decisions - Decision model for OLTP, OLAP, and NoSQL choices under concrete non-functional requirements.
Designing Data-Intensive Applications, 2nd Edition (short summary) - Core concepts on data models, replication, and consistency that shape API behavior.
Database Internals (short summary) - Storage engine internals: B-Tree, LSM, WAL, latency, and throughput behavior.
Enterprise Integration Patterns (short summary) - Integration patterns for choosing between file exchange, RPC, and messaging.
CAP theorem - Baseline consistency-versus-availability trade-offs under network partition scenarios.
Data Mesh in Action (short summary) - What changes when a data platform moves from a central lake toward domain ownership, and who pays for that shift.
The Twelve-Factor App: cloud-native principles - Stateless app principle as the starting point for external state storage architecture.