This chapter shows storage evolution without mythology: teams move from files and simple OLTP models to NoSQL, NewSQL, and HTAP because real limits start piling up, not because the technology is fashionable.
In practice, it helps you describe where state actually lives, how it moves between queues, object storage, and databases, and why that immediately creates requirements around idempotency, retries, and event ordering.
In interviews and architecture discussions, it helps explain why the data architecture became more complex in that sequence instead of jumping to a heavyweight stack too early.
Practical value of this chapter
State model
Make state location explicit: in app memory, queues, databases, or object storage, with clear guarantees per step.
API from data shape
Design API contracts around storage behavior: idempotency, retries, event ordering, and deduplication.
NewSQL and HTAP fit
Know when NewSQL/HTAP simplify architecture and when it is safer to separate transactional and analytical paths.
Interview narrative
Explain data evolution from simple persistence to distributed architecture without introducing unnecessary complexity.
Source
Essential Architecture - Data
Transcript of the 4 October 2021 lecture on data storage and how storage choices shape APIs.
This chapter connects stateless applications, stateful components, data models, API contracts, and consistency guarantees. Storage choices directly affect latency, retries, idempotency, deduplication, and team ownership of data.
We then move from files and classic OLTP to NoSQL, NewSQL, HTAP, object storage, data lakes, and Data Mesh. The core idea stays simple: move state outside the process, but be explicit about where it lives and what guarantees the API receives.
Why Data Drives APIs
Related topic
The Twelve-Factor App
Stateless applications as a foundation for scaling and resilience.
Architectural decisions about data turn into properties of interfaces.
- Response latency and throughput
- Boundaries between strong and eventual consistency
- Error, retry, and deduplication model
- Limitations on filtering/search/pagination
- Idempotency for repeatable operations
- Team ownership of data and contracts
Stateless as a foundation
Twelve-Factor principle: applications do not store state in the process. Scaling becomes easier, but the storage location must be chosen explicitly.
The Evolution of State Storage
File systems
File formats and read logic quickly leak into application code.
Relational databases (OLTP)
SQL and transactions provide strong guarantees and an expressive interface.
OLAP and analytics
Cubes, star and snowflake schemas, and aggregates support BI workloads.
Big Data / Hadoop
MapReduce and Hadoop ecosystem tools support large-scale batch processing.
Object storage
Objects without rigid hierarchies, with S3 API as the de facto standard.
NoSQL
Horizontal scaling in exchange for explicit consistency and query trade-offs.
NewSQL
SQL and ACID guarantees in distributed architecture for transactional workloads.
HTAP
Convergence of OLTP and OLAP: near-real-time analytics next to operational data.
NewSQL and HTAP in architecture decisions
When NewSQL is the right fit
When you need SQL semantics, strong transactions, and horizontal growth without manual shard management.
When HTAP is the right fit
When product workflows need operational transactions and near-real-time analytics over the same data.
Key risks
Higher operational complexity, expensive cross-region queries, and limits for heavy analytical workloads.
How to frame this in interviews
Explain the pain being solved, the trade-offs being accepted, and the constraints that reduce risk.
Practical rule of thumb: use NewSQL for stateful core workflows where correctness is expensive to get wrong, and HTAP for product domains that need analytics almost in sync with operational traffic.
Relational databases: key concepts
Related topic
Database Internals
B-Trees, LSM and transactions within the DBMS.
Normalization
Data shape influences schema design and query behavior.
SQL
Declarative language separates the “what” from the “how.”
Indexes
They speed up reads, but slow down writes and updates.
Transactions and ACID
Atomicity, isolation, and durability shape system contracts.
Replication
Failover and read scaling with explicit consistency trade-offs.
Sharding
Routing by shard key and distributing load across partitions.
Go deeper: Designing Data-Intensive Applications, 2nd Edition and Database Internals.
Integration between systems
Related topic
Enterprise Integration Patterns
Files, RPC, and messaging as integration patterns.
File transfer
A simple exchange pattern, but with weak encapsulation.
Shared database
High coupling and slower delivery because teams share the same schema.
RPC
Strong contracts, but requires versioning discipline.
Messaging
Asynchronous workflows and flexible integration boundaries.
A shared database creates high coupling and breaks contracts between teams. Modern systems prefer explicit ownership boundaries and interfaces.
Data Lake vs Data Mesh
Related topic
Big Data
The evolution of analytics and architectural layers.
Data Lake
Centralized data collection from OLTP systems through ETL processes. As scale grows, ownership and data quality become harder to manage.
Data Mesh
- Domain-centric decentralization
- Data as a product
- Self-service platform
- Federated computational governance
DDD and domain boundaries
Related topic
Learning Domain-Driven Design
Bounded contexts and domain contracts.
Domain boundaries and contracts between bounded contexts make APIs resilient. DDD helps separate data models owned by different teams.
How data is turned into a convenient API
Bridge data -> API
- Predictable guarantees (ACID vs BASE)
- Clear sources of truth
- Clear error and retry model
- Domain and contract boundaries
- Idempotency and deduplication
- Isolation from shared database
NoSQL through the lens of CAP/BASE
Understanding CAP and BASE helps explain eventual consistency to clients and build correct retries.
Mini-checklist of a convenient API
- It is clear what consistency guarantees the system provides.
- The client understands where eventual consistency is possible.
- Idempotency for operations that can be repeated.
- Errors, retries and timeouts are described deterministically.
- There is no shared database as a hidden integration channel.
- Domain boundaries are reflected in the API contract.
Practical storage-selection scenarios
FinTech ledger / billing
Relational DB or NewSQL
Strong consistency, strict transactions, and deterministic handling of retries, idempotency, and audit trails.
Real-time product reporting
HTAP or OLTP + streaming + OLAP
Fast analytics with minimal ETL lag while keeping operational workflows responsive.
Telemetry and monitoring
TSDB + object storage
High-ingest writes, retention controls, and cost-efficient long-term historical storage.
Content + search + recommendations
Polyglot persistence
One database is rarely optimal for transactions, full-text search, and vector retrieval at the same time.
Related chapters
- DB Guide - Practical playbook for selecting and operating data stores across different workload profiles.
- Database selection framework: how to make architecture decisions - Decision model for OLTP, OLAP, and NoSQL choices under concrete non-functional requirements.
- Designing Data-Intensive Applications, 2nd Edition (short summary) - Core concepts on data models, replication, and consistency that shape API behavior.
- Database Internals (short summary) - Storage engine internals: B-Tree, LSM, WAL, latency, and throughput behavior.
- Enterprise Integration Patterns (short summary) - Integration patterns for choosing between file exchange, RPC, and messaging.
- CAP theorem - Baseline consistency-versus-availability trade-offs under network partition scenarios.
- Data Mesh in Action (short summary) - How data platforms evolve from centralized lake models toward domain ownership.
- The Twelve-Factor App: cloud-native principles - Stateless app principle as the starting point for external state storage architecture.
