Acing the System Design Interview (short summary)

Zhiyong Tan’s book matters not because it offers another universal answer template, but because it teaches you to treat a system design problem as an engineering situation with context, clarifying questions, and deliberate depth choices. This chapter focuses on that more methodical side of the material.

In real engineering work, it is valuable because it helps you break a system into practical layers: interfaces, data, async flows, failure paths, security, and operations, and then decide which parts deserve deeper analysis.

For interview prep, the value of this chapter is that it shows how to turn a generic diagram into a structured engineering walkthrough: constraints first, architecture frame next, critical deep dives after that, and only then trade-offs and system evolution.

Practical value of this chapter

Problem Decomposition

Helps split a system problem into practical layers: API, data, async flow, and failure handling.

Depth Control

Teaches where to go deep based on interviewer signals and time constraints.

Risk-First Reasoning

Makes failure points and operational risks explicit before finalizing architecture.

Decision Communication

Improves answer clarity around assumptions, constraints, choices, and evolution path.

Analysis of the book

Review: Acing the System Design Interview

Detailed analysis of the book from Alexander Polomodov on the Code of Architecture blog

Перейти на сайт

Acing the System Design Interview

Authors: Zhiyong Tan
Publisher: Manning Publications
Length: 472 pages

Analysis of Zhiyong Tan's book: interview structure, design methodology, practical cases, and common platform services.

Original

Translated

The easy trap in a system design interview is to fall back on memorized patterns and a diagram drawn from habit. “Acing the System Design Interview” by Zhiyong Tan is valuable because it pulls the conversation back to engineering: start from context and constraints, then the architecture skeleton, and only then the nodes that genuinely deserve depth. Knowing where to go deep and where to stop is its own skill, and the book trains exactly that.

Key Difference

Alex Xu moves faster into concrete systems; Zhiyong Tan spends more time on the design process and interview rhythm. The first six chapters are a methodology layer, not a catalog of ready-made answers — they address how you reach an answer, not what the answer looks like.

About the author

Zhiyong Tan is an engineering manager at PayPal. Before that, he worked across Uber, startups, and Teradata in roles spanning application engineering, platform work, and data systems.

That mix of experience shows up in the book. Drawing the architecture is never enough for him — he keeps coming back to why it is shaped that way, how it behaves in production, and how to hold it together once the interviewer starts pressing on the trade-offs.

Book structure

1Part 1: Methodology foundation (6 chapters)

The design process step by step: from first principles and requirements to distributed transactions and common services. This is the methodological core of the book — the answer format here is the one you later carry into every practical case.

2Part 2: Practical cases (11 chapters)

Classic and less obvious cases where what gets practiced is the answer structure, depth selection, and an honest conversation about risk — not memorizing one specific diagram.

3Applications

Supporting topics worth revisiting before interviews:

Monolith vs MicroservicesOAuth 2.0 & OIDCC4 Model2-Phase Commit

Part 1 breakdown

The first six chapters establish the method behind the book. Requirements, scalability, availability, and reliability are not treated as separate topics here — the author folds them into one conversation about holding the interview structure when time is short and requirements keep shifting.

1. Overview of Core System Design Concepts

The first chapter introduces the core language of system design and sets the tone for the whole book: a strong answer is really a conversation about trade-offs, not a performance of memorized components.

Topics covered:

Scaling services

GeoDNS and global distribution

Caching and CDN

Horizontal vs vertical scaling

ETL and analytics pipelines

Bare metal vs Cloud vs FaaS

2. Typical System Design Interview Flow

The second chapter gives the answer rhythm Tan wants you to internalize: clarify the problem, set the boundaries, sketch the design, and only then decide where deeper analysis is justified.

Functional Requirements

What the system should do: features, use cases, user stories

Non-Functional Requirements

How the system should work: performance, scalability, reliability

It also highlights three recurring anchors of a strong answer: API shape, data model, and high-level architecture.

3. Non-Functional Requirements

Non-functional requirements collapse easily into a stock “99.99%” mantra. The third chapter takes them apart in detail and shows which system properties actually drive the architecture and which you can name and defer.

Scalability

Ability to grow with load

Availability

Availability 99.9%+

Reliability

Correct operation

Maintainability

Easy to support

Performance

Latency and throughput

Security

Data protection

Scaling Databases

Chapter four covers database scaling, one of the recurring pressure points in system design interviews, and explains when replication, sharding, and caching actually change the design.

Key techniques:

Replication

Primary-replica and multi-leader approaches

Sharding

Horizontal data partitioning

Event Aggregation

Analytical pipelines and aggregated views

Caching Strategies

Cache-aside, read-through, write-through, write-back

Distributed Transactions

The fifth chapter is one of the strongest parts of the book. It explains distributed transactions as a practical coordination problem rather than a purely academic one.

Patterns considered:

Event-Driven Architecture

Asynchronous communication through events

Change Data Capture (CDC)

Capturing changes from the database

Saga Pattern

Compensating steps across services

Transaction Supervisor

Explicit coordination of distributed work

Common Services

The sixth chapter covers the common services that show up in almost every system and ties them back to the interview conversation instead of treating them as unrelated side topics.

Authentication

JWT, Sessions, OAuth

Error Handling

Retries, timeouts, circuit breakers

Rate Limiting

Token Bucket, Leaky Bucket

Service Mesh

Istio, Linkerd, Sidecars

API Protocols

REST, RPC, GraphQL

Logging & Monitoring

Observability and incident analysis

Part 2: Practical cases

The book includes 11 practical cases. Below they appear in the same order as in the book, with emphasis on what is most useful to train for interviews.

Design URL Shortener

URL Shortener

API shape, short ID strategy, redirects, anti-abuse controls, and scaling.

Open case

Focus: Short links for sharing flows, fast redirects, and collision safety.

Product context: users and campaign tools create short URLs, while the dominant traffic pattern is low-latency redirects.

What to clarify in the interview

•What is the read/write ratio and redirect SLA?
•Do we need custom aliases, TTL, and deletion?
•Do we need near-realtime click analytics?

Architecture focus

•ID generation strategy (counter/snowflake/hash) + collision handling.
•Hot link caching and edge/CDN acceleration.
•Anti-abuse controls: rate limiting, blacklist, URL validation.

Typical risks

•ID enumeration attacks.
•Hot keys for viral links.

Design Key-Value Database

Key-Value Database

Sharding, replication, quorum choices, and failure recovery.

Open case

Focus: A core storage engine for high-volume reads/writes.

Product context: internal platform service used by multiple teams for simple, scalable key-value workloads.

What to clarify in the interview

•What consistency guarantees are required?
•What are value size limits and workload profile?
•Do we need multi-region, backup/restore, and TTL?

Architecture focus

•Replication + quorum reads/writes to balance latency and correctness.
•Sharding and online rebalancing.
•Storage engine choices (LSM/B-tree), compaction, write amplification trade-offs.

Typical risks

•Hot partitions due to poor key design.
•Slow recovery/re-sync after failures.

Design Distributed Message Queue

Distributed Message Queue

Partitioned log, delivery semantics, retry/DLQ, and lag control.

Open case

Focus: Reliable async backbone for services and background jobs.

Product context: decoupling service interactions, absorbing bursts, and preserving delivery guarantees.

What to clarify in the interview

•Which semantics are required: at-most-once / at-least-once / effectively-once?
•Is ordering global or per partition?
•What are latency and retention targets?

Architecture focus

•Partitioning + consumer groups for scale.
•Retry policy, DLQ, and idempotent consumers.
•Backpressure and flow control under spikes.

Typical risks

•Poison messages breaking consumers.
•Growing consumer lag under uneven load.

Design Social Media App

Twitter/X

Social feed design: fanout strategy, caching, and high-load trade-offs.

Open case

Focus: Consumer social product with feed and interactions.

Product context: content publishing, following graph, personalized timeline, and viral traffic behavior.

What to clarify in the interview

•Which actions are core: post, follow, like, comment?
•What are DAU and p95 feed-open targets?
•Do we need ranking/personalization in v1?

Architecture focus

•Feed strategy: fanout-on-write vs fanout-on-read vs hybrid.
•Multi-layer caching for timeline/media/metadata.
•Async pipelines for media processing and counters.

Typical risks

•Celebrity fanout explosion.
•Cache inconsistency vs source of truth.

Design Ad Click Event Aggregator

Ad Click Event Aggregator

Streaming aggregation pipeline with freshness and billing accuracy constraints.

Open case

Focus: Analytics pipeline for ad clicks and reporting.

Product context: near-realtime event aggregation with strict quality requirements for reporting and billing.

What to clarify in the interview

•Do we target realtime dashboards or hourly batch?
•What accuracy is required for billing use-cases?
•How do we handle out-of-order and late events?

Architecture focus

•Event ingestion + idempotent deduplication.
•Windowed aggregates + watermark strategy.
•Realtime + historical recomputation compatibility.

Typical risks

•Double counting from retries.
•Drift between online and offline numbers.

Design Object Storage Service

Object Storage

Object storage architecture, metadata/data split, and durability mechanisms.

Open case

Focus: Durable large-object storage for media and backups.

Product context: cost-efficient, high-durability storage with simple API and lifecycle controls.

What to clarify in the interview

•What durability/availability targets are needed?
•What object size distribution and R/W profile do we expect?
•Do we need versioning, lifecycle, and storage tiering?

Architecture focus

•Metadata/data separation with independent scaling.
•Erasure coding/replication and background repair.
•Multipart upload, checksum validation, pre-signed URLs.

Typical risks

•Metadata bottleneck at namespace scale.
•High cross-region replication cost.

Design Online Payment App

Payment System

Idempotency, auth/capture/refund flow, payment orchestration, and reconciliation.

Open case

Focus: Payment processing with strict correctness guarantees.

Product context: money movement where correctness and auditability are more important than raw latency.

What to clarify in the interview

•Which payment flows are required: auth/capture/refund/chargeback?
•What compliance and audit constraints apply?
•How are duplicate requests and partial failures handled?

Architecture focus

•Idempotency keys + transaction state machine.
•Double-entry ledger as source of truth.
•PSP reconciliation and compensating actions.

Typical risks

•Duplicate charges from retries.
•Ledger vs PSP mismatch.

Design Social Media App (Infrastructure View)

Social Media Infrastructure View

SLO-driven social platform operations: degradation, isolation, and observability.

Open case

Focus: Same domain, but from platform/operations perspective.

Product context: moving from feature-level design to SLO-driven runtime architecture and operability.

What to clarify in the interview

•What SLO/error budget applies to key user journeys?
•Where do we need autoscaling and graceful degradation?
•How do we limit blast radius across services?

Architecture focus

•Service boundaries, API contracts, and versioning.
•Observability baseline: logs/metrics/traces + alert routing.
•Deployment topology: multi-AZ rollout and rollback.

Typical risks

•Cascading failures without bulkheads.
•Opaque incidents without end-to-end tracing.

Design Room Reservation and Marketplace App

Airbnb

Marketplace search, availability calendar, and contention on booking slots.

Open case

Focus: Reservation marketplace with high contention on inventory.

Product context: search + atomic booking under race conditions and strict user trust expectations.

What to clarify in the interview

•What are hold/booking/cancel rules and confirmation SLA?
•How much concurrent contention per slot is expected?
•How should search/ranking/filtering work?

Architecture focus

•Inventory model + optimistic/pessimistic locking.
•Reservation workflow with hold timeout.
•Search index with eventual consistency vs transactional store.

Typical risks

•Overbooking under contention.
•Poor UX from slow confirmations.

Design Access Control and Authorization for Media App

Access Control for Media App

RBAC/ABAC/ReBAC, PDP/PEP split, auditability, and safe cache invalidation.

Open case

Focus: Authorization layer for media platform workflows.

Product context: unified policy model for web/mobile/admin with least-privilege defaults.

What to clarify in the interview

•What initial roles/resources are business-critical?
•Do we need RBAC only, or RBAC + ABAC/ReBAC?
•Do we need explain/audit APIs for investigations?

Architecture focus

•Policy decision point + policy enforcement point.
•Decision caching with correct invalidation.
•Tenant isolation and immutable audit trail.

Typical risks

•Privilege escalation from policy gaps.
•Stale authorization cache after policy changes.

Design Top Products Dashboard

Conclusions and recommendations

Who is this book suitable for?

For people who want to understand the logic behind the answer — the book gives you a method, not just a library of familiar diagrams.

For experienced engineers — the chapters on distributed transactions and common services are especially useful if you want more than a surface-level review.

For long-term preparation — the book helps build durable architecture instincts rather than short-lived pattern memorization.

Recommendation: keep this book as the primary source for process and answer rhythm, and Alex Xu nearby as the faster reference for classic system cases. One teaches you how to think in the interview; the other saves time on familiar cases.

Related chapters

Why Read System Design Interview Books - Section map and placement of Zhiyong Tan's book in the broader interview preparation path.
System Design Interview: An Insider's Guide (short summary) - Companion source with a faster entry into classic interview cases and a simpler answer frame.
Hacking the System Design Interview (short summary) - Alternative framework and extra case set for structured interview rehearsal.
System Design Primer (short summary) - Foundation source for recurring review of core patterns and regular interview practice.
Distributed Message Queue - Book case on partitioning, delivery semantics, retries, and consumer lag control.
Top Products Dashboard - Analytics case from the book: KPI freshness, serving-layer choices, and reconciliation loops.
Social Media Infrastructure View - Platform view shaped by SLOs, fault isolation, and graceful degradation choices.
Access Control for Media App - Authorization models, policy enforcement, and audit trail design under scale.

Where to find the book

Original

oreilly.com

Acing the System Design Interview

Translated

piter.com

System Design: пережить интервью