Zhiyong Tan’s book matters not because it offers another universal answer template, but because it teaches you to treat a system design problem as an engineering situation with context, clarifying questions, and deliberate depth choices. This chapter focuses on that more methodical side of the material.
In real engineering work, it is valuable because it helps you break a system into practical layers: interfaces, data, async flows, failure paths, security, and operations, and then decide which parts deserve deeper analysis.
For interview prep, the value of this chapter is that it shows how to turn a generic diagram into a structured engineering walkthrough: constraints first, architecture frame next, critical deep dives after that, and only then trade-offs and system evolution.
Practical value of this chapter
Problem Decomposition
Helps split a system problem into practical layers: API, data, async flow, and failure handling.
Depth Control
Teaches where to go deep based on interviewer signals and time constraints.
Risk-First Reasoning
Makes failure points and operational risks explicit before finalizing architecture.
Decision Communication
Improves answer clarity around assumptions, constraints, choices, and evolution path.
Analysis of the book
Review: Acing the System Design Interview
Detailed analysis of the book from Alexander Polomodov on the Code of Architecture blog
Acing the System Design Interview
Authors: Zhiyong Tan
Publisher: Manning Publications
Length: 472 pages
Analysis of Zhiyong Tan's book: interview structure, design methodology, practical cases, and common platform services.
"Acing the System Design Interview" is valuable because it teaches you to approach an interview problem as an engineering situation with context, constraints, and deliberate depth choices. Zhiyong Tan is less interested in giving you ready-made diagrams and more interested in showing how a strong answer is built step by step.
Key Difference
Unlike Alex Xu, who moves faster into concrete systems, Zhiyong Tan spends more time on design process and interview rhythm. The first six chapters are a methodology layer, not a catalog of ready-made answers.
About the author
Zhiyong Tan is an engineering manager at PayPal. Before that, he worked across Uber, startups, and Teradata in roles spanning application engineering, platform work, and data systems.
That mix of experience shows up in the book. It is written by someone who cares not only about drawing the architecture, but also about explaining why it is shaped that way and how it will behave in production.
Book structure
1Part 1: Methodology foundation (6 chapters)
A structured introduction to the design process: from first principles and requirements to distributed transactions and common services. This is the methodological core of the book.
2Part 2: Practical cases (11 chapters)
A set of classic and less obvious cases you can use to practice structure, depth selection, and risk-first design reasoning.
3Applications
Supporting topics worth revisiting before interviews:
Part 1 breakdown
The first six chapters establish the method behind the book. They are the reason this material feels more like an interview playbook than a loose collection of case studies.
1. Overview of Core System Design Concepts
The first chapter introduces the core language of system design and sets the tone for the whole book: a strong answer is really a conversation about trade-offs, not a performance of memorized components.
Topics covered:
2. Typical System Design Interview Flow
The second chapter gives the answer rhythm Tan wants you to internalize: clarify the problem, set the boundaries, sketch the design, and only then decide where deeper analysis is justified.
Functional Requirements
What the system should do: features, use cases, user stories
Non-Functional Requirements
How the system should work: performance, scalability, reliability
It also highlights three recurring anchors of a strong answer: API shape, data model, and high-level architecture.
3. Non-Functional Requirements
The third chapter goes beyond the vague habit of saying “99.99%” and instead turns non-functional requirements into architectural drivers.
Scalability
Ability to grow with load
Availability
Availability 99.9%+
Reliability
Correct operation
Maintainability
Easy to support
Performance
Latency and throughput
Security
Data protection
Scaling Databases
Chapter four covers database scaling, one of the recurring pressure points in system design interviews, and explains when replication, sharding, and caching actually change the design.
Key techniques:
Replication
Primary-replica and multi-leader approaches
Sharding
Horizontal data partitioning
Event Aggregation
Analytical pipelines and aggregated views
Caching Strategies
Cache-aside, read-through, write-through, write-back
Distributed Transactions
The fifth chapter is one of the strongest parts of the book. It explains distributed transactions as a practical coordination problem rather than a purely academic one.
Patterns considered:
Event-Driven Architecture
Asynchronous communication through events
Change Data Capture (CDC)
Capturing changes from the database
Saga Pattern
Compensating steps across services
Transaction Supervisor
Explicit coordination of distributed work
Common Services
The sixth chapter covers the common services that show up in almost every system and ties them back to the interview conversation instead of treating them as unrelated side topics.
Authentication
JWT, Sessions, OAuth
Error Handling
Retries, timeouts, circuit breakers
Rate Limiting
Token Bucket, Leaky Bucket
Service Mesh
Istio, Linkerd, Sidecars
API Protocols
REST, RPC, GraphQL
Logging & Monitoring
Observability and incident analysis
Part 2: Practical cases
The book includes 11 practical cases. Below they appear in the same order as in the book, with emphasis on what is most useful to train for interviews.
Design URL Shortener
URL Shortener
API shape, short ID strategy, redirects, anti-abuse controls, and scaling.
Focus: Short links for sharing flows, fast redirects, and collision safety.
Product context: users and campaign tools create short URLs, while the dominant traffic pattern is low-latency redirects.
What to clarify in the interview
- •What is the read/write ratio and redirect SLA?
- •Do we need custom aliases, TTL, and deletion?
- •Do we need near-realtime click analytics?
Architecture focus
- •ID generation strategy (counter/snowflake/hash) + collision handling.
- •Hot link caching and edge/CDN acceleration.
- •Anti-abuse controls: rate limiting, blacklist, URL validation.
Typical risks
- •ID enumeration attacks.
- •Hot keys for viral links.
Design Key-Value Database
Key-Value Database
Sharding, replication, quorum choices, and failure recovery.
Focus: A core storage engine for high-volume reads/writes.
Product context: internal platform service used by multiple teams for simple, scalable key-value workloads.
What to clarify in the interview
- •What consistency guarantees are required?
- •What are value size limits and workload profile?
- •Do we need multi-region, backup/restore, and TTL?
Architecture focus
- •Replication + quorum reads/writes to balance latency and correctness.
- •Sharding and online rebalancing.
- •Storage engine choices (LSM/B-tree), compaction, write amplification trade-offs.
Typical risks
- •Hot partitions due to poor key design.
- •Slow recovery/re-sync after failures.
Design Distributed Message Queue
Distributed Message Queue
Partitioned log, delivery semantics, retry/DLQ, and lag control.
Focus: Reliable async backbone for services and background jobs.
Product context: decoupling service interactions, absorbing bursts, and preserving delivery guarantees.
What to clarify in the interview
- •Which semantics are required: at-most-once / at-least-once / effectively-once?
- •Is ordering global or per partition?
- •What are latency and retention targets?
Architecture focus
- •Partitioning + consumer groups for scale.
- •Retry policy, DLQ, and idempotent consumers.
- •Backpressure and flow control under spikes.
Typical risks
- •Poison messages breaking consumers.
- •Growing consumer lag under uneven load.
Design Social Media App
Twitter/X
Social feed design: fanout strategy, caching, and high-load trade-offs.
Focus: Consumer social product with feed and interactions.
Product context: content publishing, following graph, personalized timeline, and viral traffic behavior.
What to clarify in the interview
- •Which actions are core: post, follow, like, comment?
- •What are DAU and p95 feed-open targets?
- •Do we need ranking/personalization in v1?
Architecture focus
- •Feed strategy: fanout-on-write vs fanout-on-read vs hybrid.
- •Multi-layer caching for timeline/media/metadata.
- •Async pipelines for media processing and counters.
Typical risks
- •Celebrity fanout explosion.
- •Cache inconsistency vs source of truth.
Design Ad Click Event Aggregator
Ad Click Event Aggregator
Streaming aggregation pipeline with freshness and billing accuracy constraints.
Focus: Analytics pipeline for ad clicks and reporting.
Product context: near-realtime event aggregation with strict quality requirements for reporting and billing.
What to clarify in the interview
- •Do we target realtime dashboards or hourly batch?
- •What accuracy is required for billing use-cases?
- •How do we handle out-of-order and late events?
Architecture focus
- •Event ingestion + idempotent deduplication.
- •Windowed aggregates + watermark strategy.
- •Realtime + historical recomputation compatibility.
Typical risks
- •Double counting from retries.
- •Drift between online and offline numbers.
Design Object Storage Service
Object Storage
Object storage architecture, metadata/data split, and durability mechanisms.
Focus: Durable large-object storage for media and backups.
Product context: cost-efficient, high-durability storage with simple API and lifecycle controls.
What to clarify in the interview
- •What durability/availability targets are needed?
- •What object size distribution and R/W profile do we expect?
- •Do we need versioning, lifecycle, and storage tiering?
Architecture focus
- •Metadata/data separation with independent scaling.
- •Erasure coding/replication and background repair.
- •Multipart upload, checksum validation, pre-signed URLs.
Typical risks
- •Metadata bottleneck at namespace scale.
- •High cross-region replication cost.
Design Online Payment App
Payment System
Idempotency, auth/capture/refund flow, payment orchestration, and reconciliation.
Focus: Payment processing with strict correctness guarantees.
Product context: money movement where correctness and auditability are more important than raw latency.
What to clarify in the interview
- •Which payment flows are required: auth/capture/refund/chargeback?
- •What compliance and audit constraints apply?
- •How are duplicate requests and partial failures handled?
Architecture focus
- •Idempotency keys + transaction state machine.
- •Double-entry ledger as source of truth.
- •PSP reconciliation and compensating actions.
Typical risks
- •Duplicate charges from retries.
- •Ledger vs PSP mismatch.
Design Social Media App (Infrastructure View)
Social Media Infrastructure View
SLO-driven social platform operations: degradation, isolation, and observability.
Focus: Same domain, but from platform/operations perspective.
Product context: moving from feature-level design to SLO-driven runtime architecture and operability.
What to clarify in the interview
- •What SLO/error budget applies to key user journeys?
- •Where do we need autoscaling and graceful degradation?
- •How do we limit blast radius across services?
Architecture focus
- •Service boundaries, API contracts, and versioning.
- •Observability baseline: logs/metrics/traces + alert routing.
- •Deployment topology: multi-AZ rollout and rollback.
Typical risks
- •Cascading failures without bulkheads.
- •Opaque incidents without end-to-end tracing.
Design Room Reservation and Marketplace App
Airbnb
Marketplace search, availability calendar, and contention on booking slots.
Focus: Reservation marketplace with high contention on inventory.
Product context: search + atomic booking under race conditions and strict user trust expectations.
What to clarify in the interview
- •What are hold/booking/cancel rules and confirmation SLA?
- •How much concurrent contention per slot is expected?
- •How should search/ranking/filtering work?
Architecture focus
- •Inventory model + optimistic/pessimistic locking.
- •Reservation workflow with hold timeout.
- •Search index with eventual consistency vs transactional store.
Typical risks
- •Overbooking under contention.
- •Poor UX from slow confirmations.
Design Access Control and Authorization for Media App
Access Control for Media App
RBAC/ABAC/ReBAC, PDP/PEP split, auditability, and safe cache invalidation.
Focus: Authorization layer for media platform workflows.
Product context: unified policy model for web/mobile/admin with least-privilege defaults.
What to clarify in the interview
- •What initial roles/resources are business-critical?
- •Do we need RBAC only, or RBAC + ABAC/ReBAC?
- •Do we need explain/audit APIs for investigations?
Architecture focus
- •Policy decision point + policy enforcement point.
- •Decision caching with correct invalidation.
- •Tenant isolation and immutable audit trail.
Typical risks
- •Privilege escalation from policy gaps.
- •Stale authorization cache after policy changes.
Design Top Products Dashboard
Top Products Dashboard
Analytics serving layer with pre-aggregations, metric consistency, and freshness controls.
Focus: Analytical dashboard for product and operations teams.
Product context: decision-making UI where metric explainability and freshness transparency are essential.
What to clarify in the interview
- •What refresh frequency and latency are acceptable?
- •Which dimensions and filters are required?
- •Do we need drill-down to raw events?
Architecture focus
- •Dedicated serving layer for dashboard workloads.
- •Pre-aggregations/materialized views for p95 latency.
- •Combination of realtime stream + scheduled backfill.
Typical risks
- •Heavy ad-hoc queries impacting OLTP.
- •Metric mismatch across dashboards.
Conclusions and recommendations
Who is this book suitable for?
For people who want to understand the logic behind the answer — the book gives you a method, not just a library of familiar diagrams.
For experienced engineers — the chapters on distributed transactions and common services are especially useful if you want more than a surface-level review.
For long-term preparation — the book helps build durable architecture instincts rather than short-lived pattern memorization.
Recommendation: Use this book as the primary source for process and answer rhythm, and keep Alex Xu nearby as the faster reference for classic system cases.
Related chapters
- Why Read System Design Interview Books - Section map and placement of Zhiyong Tan's book in the broader interview preparation path.
- System Design Interview: An Insider's Guide (short summary) - Companion source with a faster entry into classic interview cases and a simpler answer frame.
- Hacking the System Design Interview (short summary) - Alternative framework and extra case set for structured interview rehearsal.
- System Design Primer (short summary) - Foundation source for recurring review of core patterns and regular interview practice.
- Distributed Message Queue - Book case on partitioning, delivery semantics, retries, and consumer lag control.
- Top Products Dashboard - Analytics case from the book: KPI freshness, serving-layer choices, and reconciliation loops.
- Social Media Infrastructure View - Platform view shaped by SLOs, fault isolation, and graceful degradation choices.
- Access Control for Media App - Authorization models, policy enforcement, and audit trail design under scale.
