Design principles for scalable systems

Scalability rarely breaks at one giant limit forever. As load grows, the bottleneck tends to move from compute into data, then into coordination and resilience.

This chapter pulls Scale Cube, CAP, PACELC, caching, async processing, CQRS, and resilience patterns into one picture so it becomes clearer which lever actually matters at the current stage of growth.

That framing is especially useful in architecture discussions and interviews: it lets you explain scale through quality priorities, the current bottleneck, and the cost of the next decision instead of through a pile of familiar buzzwords.

Practical value of this chapter

Quality Priorities

Decide early what matters most for the system: availability, data freshness, response time, or overload resilience.

Scaling Levers

Separate the moments where cloning services helps from the moments where data architecture or coordination changes matter more.

Growth Patterns

Treat caching, asynchrony, CQRS, and resilience as a connected growth path rather than as isolated tricks.

Decision Rationale

In interviews and design reviews, explain which limit the decision removes and what new cost the system accepts in return.

Growth levers

How the Bottleneck Moves

System growth usually looks like a sequence of shifting limits: compute first, then data, coordination, and resilience.

Compute

Clone services, remove unnecessary request-path work, and watch CPU saturation.

RPS growth and tail latency

Data

Add caches, replicas, shards, and separate read models when one path no longer holds.

data volume and hot keys

Coordination

Loosen coupling through queues, events, idempotency, and explicit consistency boundaries.

expensive synchronous coupling

Resilience

Contain blast radius with degradation, retries, fallback paths, and manual controls.

partial failures and overload

Main idea

Scalability is not one trick. At each stage you choose the lever for the current bottleneck and accept its cost deliberately.

Theory

Designing Data-Intensive Applications, 2nd Edition

One of the strongest books on data growth, consistency, and architectural trade-offs.

Read chapter

Scalability is not a checklist of fashionable techniques. It is the ability of a system to keep growing without becoming unmanageable. In practice, the bottleneck rarely stays in one place: first it sits in the compute layer, then in data, then in coordination and resilience. A list of techniques is not enough on its own — what matters is the order in which they become the next useful lever.

Scale Cube

The Scale Cube is useful because it turns “we need to scale” into a concrete decision: do we clone the same application, split the system by function, or partition the data so different parts of the workload stop colliding with each other?

X-axis: cloning

Copies

The fastest lever: run multiple identical instances behind a load balancer and spread requests across them.

Great for early web-layer growth

Does not remove shared data or domain bottlenecks

Y-axis: split by function

Services

Break the system into independent pieces. Catalog, payments, notifications, and search can now grow at different speeds.

Lets expensive domains scale independently

Raises the cost of coordination and integration

Z-axis: partition the data

Segments

The same service owns only one slice of the data space, for example by region, tenant, or key range.

Helps once a single data set stops being enough

Makes rebalancing and cross-partition work much harder

Deep Dive

CAP theorem

A formal frame for reasoning about distributed-system limits.

Read chapter

CAP theorem

CAP is useful not because it sounds academic, but because it forces a real conversation: under a network split, a distributed system cannot fully preserve both consistency and availability at the same time.

Consistency

All nodes observe the same state after the operation completes.

Availability

Every valid request still gets a response even under partial failure.

Partition tolerance

The system keeps operating even when some nodes temporarily lose contact with others.

What matters in practice: when the network breaks, the real question is not “which architecture is correct”, but which failure hurts the product more right now: stale reads or missing responses.

Deep Dive

PACELC theorem

The CAP extension that keeps the normal-mode trade-off visible.

Read chapter

PACELC theorem

PACELC makes the model more honest: even when the network is healthy, the system still chooses between lower latency and stronger consistency.

Partition → Availability vs Consistency, Else → Latency vs Consistency

PA/EL systems

Under partitions they try not to drop responses, and in normal mode they optimize for fast replies.

Typical examples: Cassandra, DynamoDB

PC/EC systems

Both under splits and in normal operation they lean toward stronger consistency and accept a more expensive read/write path.

Typical examples: HBase, synchronously replicated Postgres and MySQL

Why it matters: this is where you explain why a fast system is not always strict, and why a strict system is rarely the fastest option.

Stateless architecture

Stateless compute is usually the first clean scaling lever. Once instances become replaceable, they are easier to autoscale, easier to roll forward, and easier to recover. Anything critical to correctness should live outside the process in an explicit stateful layer.

Autoscaling becomes real

Instances can be added and removed without heavy manual coordination.

Rollouts get safer

You can deploy in stages without carrying fragile user state inside the process.

Recovery speeds up

A failed node is replaced instead of being treated like a special snowflake.

Important: stateless does not mean “no in-memory optimizations.” It only means correctness must not depend on the memory of one process.

Related chapter

Replication and sharding

A deeper look at data topologies, trade-offs, and operational cost.

Read chapter

Database scaling patterns

Once one data set and one write path stop being enough, the architecture has to choose between leader-follower, multi-leader, and leaderless topologies. At that point, replication lag stops being a background detail and becomes a product-visible design constraint.

Sharding

Horizontal partitioning helps with write growth and data volume, but only if you think through routing keys and the future cost of rebalancing.

Range-basedgood for ordered access patterns

Hash-basedbetter distribution under skew

Directory-basedmore flexible but adds a routing layer

Cross-shard work and later reshaping are usually more expensive than the first diagram suggests.

Replication

Replication usually solves read scale and resilience first, but it immediately raises the question of how fresh reads need to be after writes.

Leader-followersimpler operations, one write center

Multi-leadermore flexible writes, harder conflict handling

Leaderlessfits geo-distribution, but reads cost more

The real design question is not “which topology sounds modern,” but how you handle lag, failover, and stale reads at the client boundary.

Related chapter

Caching strategies

A deeper look at cache patterns, invalidation, and the cost of buying speed.

Read chapter

Caching

Cache-aside, write-through, and refresh-ahead are all useful ways to buy speed, but each one changes where staleness risk lives and how complicated the write path becomes.

🌐

CDN

~5 ms from the client to the edge

📦

Redis / Memcached

~0.2-1 ms within the data center

💾

Local cache

~0.1 ms in process memory

🗄️

Database

~1-10 ms for indexed reads, 10-100+ ms for complex queries

Lazy loading

Read from cache first, then fall back to the database and backfill the cache on a miss.

Write-through

Writes pass through cache and synchronously update the backing store.

Write-behind

Cache batches writes and flushes them later, buying speed at the cost of more risk.

Refresh-ahead

Refresh hot objects before TTL expiry so the first slow read never reaches the user.

Latency and throughput

Throughput tells you how much work the system can process, while p99 and tail latency tell you how painful the slow end of the distribution becomes for real users. As QPS rises, averages stop describing reality unless the system has clear queueing and backpressure behavior.

Latency

Time from request to response. On the user path, distribution tails matter much more than the mean.

Main metrics: p50, p95, p99
Main causes of pain: queues, locks, and extra network hops

Throughput

Operations or bytes per second. It becomes the dominant concern in batch systems, queues, and heavy background pipelines.

Main metrics: RPS, ops/sec, bytes/sec
Main cost drivers: blocking I/O and missing parallelism

Optimize for response time

Caching, shorter critical paths, parallelism, and pre-computation.

Optimize for work volume

Batching, queues, async execution, and managed horizontal growth.

Core trade-off

As utilization climbs, latency grows non-linearly, so the system has to learn to slow down before it collapses.

Practical rule: interactive flows usually care more about response time, while background pipelines care more about steady work volume.

Related chapter

Async processing and Event-Driven Architecture

A deeper discussion of queues, events, and delivery guarantees.

Read chapter

Asynchronous processing

Asynchrony helps when heavy work must leave the user path without becoming invisible. That is why queue design is less about throughput and more about delivery semantics: what happens when the same message shows up twice, arrives late, or is retried after partial success.

Message queues

Best when heavy work must be pulled away from the response path and short spikes need to be absorbed without overwhelming downstream services.

Producer

Queue

Consumer

Kafka

for high-volume streams and durable logs

RabbitMQ

for routing flexibility and priority handling

SQS

for a managed cloud path without running brokers yourself

Event-driven interaction

Lets services react to state changes without hard synchronous coupling and makes it easier to scale consumers independently.

OrderCreated →

InventoryService

PaymentService

NotificationService

Kafka

for durable event streams

Pulsar

for multi-tenant and geo-aware scenarios

NATS

for very fast pub/sub workflows

EventBridge

for a managed event bus in AWS

Pub/Sub

for managed cloud event flows

CQRS: split commands and reads

CQRS is helpful when the write path and read path differ radically in frequency, cost, and latency expectations. It stops one data model from trying to serve two different worlds at once.

Write model

Optimized for correctness, validation, and transactional changes.

Read model

Optimized for fast lookups, aggregation, and client-facing views.

Most useful when reads and writes stress the system in different ways.

Related chapter

Resilience patterns

A deeper discussion of degradation, overload, and cascade containment.

Read chapter

Resilience patterns

Circuit breakers, bulkheads, retries, and idempotency are not decorative extras. They are the control mechanisms that limit blast radius once a dependency has already degraded and the system must survive overload without cascading failure.

Circuit breaker

Temporarily stops calling an unstable dependency and gives it a chance to recover without dragging down the primary request flow.

Closed

→

Open

→

Half-Open

Backpressure

Teaches the system to reject overload before queues become unbounded and everything starts timing out together.

bounded queues
admission control such as HTTP 429
degraded but predictable responses

Retries and idempotency

Retries are unavoidable, but they should never create double charges, duplicate orders, or other costly side effects.

Idempotency-Key: abc-123-def

Bulkheads

Separate critical resources so one overloaded path cannot drain every thread, connection, and timeout budget from its neighbors.

The key idea is simple: isolate consequences instead of assuming dependencies will behave well.

Foundations

HTTP protocol

L7 balancing is built on top of HTTP semantics and routing rules.

Read chapter

Related chapter

Load balancing algorithms

A separate chapter on distribution algorithms and their behavior under different loads.

Read chapter

Load balancing

Algorithms

Round Robinsequential distribution

Weighted RRcapacity-aware routing

Least Connectionsroute to the least busy instance

IP Hashkeeps sessions sticky

Layers

L4 (Transport)

fast and simple at the TCP/UDP layer

L7 (Application)

gives routing by URL, headers, and other request attributes

DNS-based

used for geo-routing and regional failover

Pattern summary table

Pattern	What it buys	What it costs
Horizontal scaling	Fast elastic compute growth	More coordination across shared dependencies
Sharding	Growth in writes and data volume	Rebalancing, complex queries, and operational cost
Replication	More read scale and higher resilience	Lag, failover complexity, and fresher-read pressure
Caching	Lower latency and lighter database load	Invalidation, staleness, and extra failure modes
Async processing	Higher throughput off the critical path	Delayed results, duplicates, and coordination cost
Circuit breakers and bulkheads	Better containment of cascading failure	More control logic and more edge states
CQRS	Independent optimization of reads and writes	Harder model evolution and synchronization

Key takeaways

Find the current growth lever first. Not every system needs sharding early, but almost every system benefits from replaceable compute.

Say the trade-offs out loud. CAP, PACELC, caching, and replication only help when their consequences for the product are explicit.

Scaling is always also about data. Growth quickly moves from services into write paths, freshness, and state management.

Async work reduces critical-path pressure, not responsibility. Queues and events help users, but they introduce duplicates, delays, and a new operating discipline.

Resilience must be designed up front. Circuit breakers, bounded queues, and resource isolation only help if the team knows how the system should degrade.

A good architecture explanation starts from the bottleneck. In interviews and real projects, the strongest answer is the one that shows which limit the next decision actually removes.

References

Martin Kleppmann — Designing Data-Intensive Applications (official book site)Google — Site Reliability Engineering (free online book, O'Reilly)Chris Richardson — The Scale Cube (microservices.io)

Related chapters

CAP theorem - gives the formal frame for discussing consistency, availability, and system behavior during network splits.
PACELC theorem - extends the CAP discussion into normal operation and makes the latency-versus-consistency choice explicit.
Replication and sharding - covers data topologies, rebalancing, and the operational cost that appears once simple growth stops being enough.
Caching strategies - goes deeper into cache patterns, invalidation, and the real cost of buying latency with extra layers.
Load balancing algorithms - shows how traffic distribution choices depend on workload shape, state, and failure behavior.
Event-Driven Architecture - expands the async part of the story: queues, events, delivery guarantees, and the cost of looser coupling.
Resilience patterns - adds the degradation and failure-containment layer that keeps scaling decisions usable under stress.
Why distributed systems and consistency matter - connects system growth with the limits of networking, coordination, and state management.