System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Acing the System Design Interview (short summary)

mid

Analysis of the book

Review: Acing the System Design Interview

Detailed analysis of the book from Alexander Polomodov on the Code of Architecture blog

Перейти на сайт

Acing the System Design Interview

Authors: Zhiyong Tan
Publisher: Manning Publications
Length: 472 pages

Analysis of the book Zhiyong Tan: design methodology, distributed transactions and common services.

Acing the System Design Interview - original coverOriginal
Acing the System Design Interview - translated editionTranslated

Book "Acing the System Design Interview" from Zhiyong Tan is a practical alternative to popular System Design materials. The author, an engineering manager at PayPal with experience at Uber and Teradata, offers a structured approach to preparation that I found even more useful than Alex Xu's book.

Key Difference

Unlike Alex Xu, who focuses on analyzing specific systems, Zhiyong Tan pays more attention design processand interview structure. The first 6 chapters are a methodological framework, not a set of ready-made solutions.

About the author

Zhiyong Tan — engineering manager at PayPal. Before that, he was a senior full-stack engineer at Uber, a data engineer at startups, and an engineer at Teradata.

This diverse experience allowed him to look at the issues of system design and hiring of employees in companies of different sizes and levels of maturity - from startups to enterprise giants.

Book structure

1Part 1: Introduction to System Design Interview (6 chapters)

A structured introduction to the design process: from basic concepts to distributed transactions. This is the methodological core of the book.

2Part 2: Practical problems (11 cases)

Examples of System Design tasks with analysis. Classic and non-standard scenarios for practicing design skills.

3Applications

Additional materials that may be covered during the interview:

Monolith vs MicroservicesOAuth 2.0 & OIDCC4 Model2-Phase Commit

Detailed analysis of the first part

The first six chapters form the methodological basis of the book. Let's look at each of them in detail.

1

A Walkthrough of System Design Concepts

In the first chapter, the author introduces the basic concepts of System Design and explains the main idea: System Design is about a discussion around compromises, which must be taken when designing a solution.

Topics covered:

Scaling services
GeoDNS and global distribution
Caching and CDN
Horizontal vs vertical scaling
ETL and Analytics
Bare metal vs Cloud vs FaaS
2

A Typical System Design Interview Flow

The second chapter is devoted to the structure of the interview and the important division of requirements:

Functional Requirements

What the system should do: features, use cases, user stories

Non-Functional Requirements

How the system should work: performance, scalability, reliability

Also considered: API specification, data modeling And high level architecture.

3

Non-Functional Requirements

The third chapter dives deep into non-functional requirements - the so-called "-ilities":

Scalability

Ability to grow with load

Availability

Availability 99.9%+

Reliability

Correct operation

Maintainability

Easy to support

Performance

Latency and throughput

Security

Data protection

4

Scaling Databases

Chapter four focuses on database scaling, one of the key topics in any System Design interview.

Key techniques:

Replication

Master-slave, multi-master replication

Sharding

Horizontal data partitioning

Event Aggregation

Event aggregation for analytics

Caching Strategies

Read-through, write-through, write-behind

5

Distributed Transactions

The fifth chapter is one of the most valuable. It deals with the complex topic of distributed transactions, which is rarely covered well in other books.

Patterns considered:

Event-Driven Architecture

Asynchronous communication through events

Change Data Capture (CDC)

Capturing changes from the database

Saga Pattern

Offsetting transactions

Transaction Supervisor

Coordination of distributed operations

6

Common Services

The sixth chapter examines common services that are found in almost every system:

Authentication

JWT, Sessions, OAuth

Error Handling

Retry, Circuit Breaker

Rate Limiting

Token Bucket, Leaky Bucket

Service Mesh

Istio, Linkerd, Sidecars

API Protocols

REST, RPC, GraphQL

Logging & Monitoring

Observability stack

Part 2: Practical problems (chapters 7-17)

The book includes 11 practical cases. Below is the complete list in the same order as in the book.

7

Design URL Shortener

URL Shortener

API shape, short ID strategy, redirects, anti-abuse controls, and scaling.

Open case

Focus: Short links for sharing flows, fast redirects, and collision safety.

Product context: users and campaign tools create short URLs, while the dominant traffic pattern is low-latency redirects.

What to clarify in the interview

  • What is the read/write ratio and redirect SLA?
  • Do we need custom aliases, TTL, and deletion?
  • Do we need near-realtime click analytics?

Architecture focus

  • ID generation strategy (counter/snowflake/hash) + collision handling.
  • Hot link caching and edge/CDN acceleration.
  • Anti-abuse controls: rate limiting, blacklist, URL validation.

Typical risks

  • ID enumeration attacks.
  • Hot keys for viral links.
8

Design Key-Value Database

Key-Value Database

Sharding, replication, quorum choices, and failure recovery.

Open case

Focus: A core storage engine for high-volume reads/writes.

Product context: internal platform service used by multiple teams for simple, scalable key-value workloads.

What to clarify in the interview

  • What consistency guarantees are required?
  • What are value size limits and workload profile?
  • Do we need multi-region, backup/restore, and TTL?

Architecture focus

  • Replication + quorum reads/writes to balance latency and correctness.
  • Sharding and online rebalancing.
  • Storage engine choices (LSM/B-tree), compaction, write amplification trade-offs.

Typical risks

  • Hot partitions due to poor key design.
  • Slow recovery/re-sync after failures.
9

Design Distributed Message Queue

Distributed Message Queue

Partitioned log, delivery semantics, retry/DLQ, and lag control.

Open case

Focus: Reliable async backbone for services and background jobs.

Product context: decoupling service interactions, absorbing bursts, and preserving delivery guarantees.

What to clarify in the interview

  • Which semantics are required: at-most-once / at-least-once / effectively-once?
  • Is ordering global or per partition?
  • What are latency and retention targets?

Architecture focus

  • Partitioning + consumer groups for scale.
  • Retry policy, DLQ, and idempotent consumers.
  • Backpressure and flow control under spikes.

Typical risks

  • Poison messages breaking consumers.
  • Growing consumer lag under uneven load.
10

Design Social Media App

Twitter/X

Social feed design: fanout strategy, caching, and high-load trade-offs.

Open case

Focus: Consumer social product with feed and interactions.

Product context: content publishing, following graph, personalized timeline, and viral traffic behavior.

What to clarify in the interview

  • Which actions are core: post, follow, like, comment?
  • What are DAU and p95 feed-open targets?
  • Do we need ranking/personalization in v1?

Architecture focus

  • Feed strategy: fanout-on-write vs fanout-on-read vs hybrid.
  • Multi-layer caching for timeline/media/metadata.
  • Async pipelines for media processing and counters.

Typical risks

  • Celebrity fanout explosion.
  • Cache inconsistency vs source of truth.
11

Design Ad Click Event Aggregator

Ad Click Event Aggregator

Streaming aggregation pipeline with freshness and billing accuracy constraints.

Open case

Focus: Analytics pipeline for ad clicks and reporting.

Product context: near-realtime event aggregation with strict quality requirements for reporting and billing.

What to clarify in the interview

  • Do we target realtime dashboards or hourly batch?
  • What accuracy is required for billing use-cases?
  • How do we handle out-of-order and late events?

Architecture focus

  • Event ingestion + idempotent deduplication.
  • Windowed aggregates + watermark strategy.
  • Realtime + historical recomputation compatibility.

Typical risks

  • Double counting from retries.
  • Drift between online and offline numbers.
12

Design Object Storage Service

Object Storage

Object storage architecture, metadata/data split, and durability mechanisms.

Open case

Focus: Durable large-object storage for media and backups.

Product context: cost-efficient, high-durability storage with simple API and lifecycle controls.

What to clarify in the interview

  • What durability/availability targets are needed?
  • What object size distribution and R/W profile do we expect?
  • Do we need versioning, lifecycle, and storage tiering?

Architecture focus

  • Metadata/data separation with independent scaling.
  • Erasure coding/replication and background repair.
  • Multipart upload, checksum validation, pre-signed URLs.

Typical risks

  • Metadata bottleneck at namespace scale.
  • High cross-region replication cost.
13

Design Online Payment App

Payment System

Idempotency, auth/capture/refund flow, payment orchestration, and reconciliation.

Open case

Focus: Payment processing with strict correctness guarantees.

Product context: money movement where correctness and auditability are more important than raw latency.

What to clarify in the interview

  • Which payment flows are required: auth/capture/refund/chargeback?
  • What compliance and audit constraints apply?
  • How are duplicate requests and partial failures handled?

Architecture focus

  • Idempotency keys + transaction state machine.
  • Double-entry ledger as source of truth.
  • PSP reconciliation and compensating actions.

Typical risks

  • Duplicate charges from retries.
  • Ledger vs PSP mismatch.
14

Design Social Media App (Infrastructure View)

Social Media Infrastructure View

SLO-driven social platform operations: degradation, isolation, and observability.

Open case

Focus: Same domain, but from platform/operations perspective.

Product context: moving from feature-level design to SLO-driven runtime architecture and operability.

What to clarify in the interview

  • What SLO/error budget applies to key user journeys?
  • Where do we need autoscaling and graceful degradation?
  • How do we limit blast radius across services?

Architecture focus

  • Service boundaries, API contracts, and versioning.
  • Observability baseline: logs/metrics/traces + alert routing.
  • Deployment topology: multi-AZ rollout and rollback.

Typical risks

  • Cascading failures without bulkheads.
  • Opaque incidents without end-to-end tracing.
15

Design Room Reservation and Marketplace App

Airbnb

Marketplace search, availability calendar, and contention on booking slots.

Open case

Focus: Reservation marketplace with high contention on inventory.

Product context: search + atomic booking under race conditions and strict user trust expectations.

What to clarify in the interview

  • What are hold/booking/cancel rules and confirmation SLA?
  • How much concurrent contention per slot is expected?
  • How should search/ranking/filtering work?

Architecture focus

  • Inventory model + optimistic/pessimistic locking.
  • Reservation workflow with hold timeout.
  • Search index with eventual consistency vs transactional store.

Typical risks

  • Overbooking under contention.
  • Poor UX from slow confirmations.
16

Design Access Control and Authorization for Media App

Access Control for Media App

RBAC/ABAC/ReBAC, PDP/PEP split, auditability, and safe cache invalidation.

Open case

Focus: Authorization layer for media platform workflows.

Product context: unified policy model for web/mobile/admin with least-privilege defaults.

What to clarify in the interview

  • What initial roles/resources are business-critical?
  • Do we need RBAC only, or RBAC + ABAC/ReBAC?
  • Do we need explain/audit APIs for investigations?

Architecture focus

  • Policy decision point + policy enforcement point.
  • Decision caching with correct invalidation.
  • Tenant isolation and immutable audit trail.

Typical risks

  • Privilege escalation from policy gaps.
  • Stale authorization cache after policy changes.
17

Design Top Products Dashboard

Top Products Dashboard

Analytics serving layer with pre-aggregations, metric consistency, and freshness controls.

Open case

Focus: Analytical dashboard for product and operations teams.

Product context: decision-making UI where metric explainability and freshness transparency are essential.

What to clarify in the interview

  • What refresh frequency and latency are acceptable?
  • Which dimensions and filters are required?
  • Do we need drill-down to raw events?

Architecture focus

  • Dedicated serving layer for dashboard workloads.
  • Pre-aggregations/materialized views for p95 latency.
  • Combination of realtime stream + scheduled backfill.

Typical risks

  • Heavy ad-hoc queries impacting OLTP.
  • Metric mismatch across dashboards.

Conclusions and recommendations

Who is this book suitable for?

For those who want to understand "why" — the book provides a methodological basis, and not just a set of ready-made solutions.

Engineers with experience — a deep analysis of distributed transactions and common services will be especially useful.

For long term training — the book forms a systematic understanding of architecture, rather than superficial knowledge.

Recommendation: Use this book as a primary source for understanding the process and methodology, and Alex Xu's book as a reference for specific systems.

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov