Analysis of the book
Review: Acing the System Design Interview
Detailed analysis of the book from Alexander Polomodov on the Code of Architecture blog
Acing the System Design Interview
Authors: Zhiyong Tan
Publisher: Manning Publications
Length: 472 pages
Analysis of the book Zhiyong Tan: design methodology, distributed transactions and common services.
Original
TranslatedBook "Acing the System Design Interview" from Zhiyong Tan is a practical alternative to popular System Design materials. The author, an engineering manager at PayPal with experience at Uber and Teradata, offers a structured approach to preparation that I found even more useful than Alex Xu's book.
Key Difference
Unlike Alex Xu, who focuses on analyzing specific systems, Zhiyong Tan pays more attention design processand interview structure. The first 6 chapters are a methodological framework, not a set of ready-made solutions.
About the author
Zhiyong Tan — engineering manager at PayPal. Before that, he was a senior full-stack engineer at Uber, a data engineer at startups, and an engineer at Teradata.
This diverse experience allowed him to look at the issues of system design and hiring of employees in companies of different sizes and levels of maturity - from startups to enterprise giants.
Book structure
1Part 1: Introduction to System Design Interview (6 chapters)
A structured introduction to the design process: from basic concepts to distributed transactions. This is the methodological core of the book.
2Part 2: Practical problems (11 cases)
Examples of System Design tasks with analysis. Classic and non-standard scenarios for practicing design skills.
3Applications
Additional materials that may be covered during the interview:
Detailed analysis of the first part
The first six chapters form the methodological basis of the book. Let's look at each of them in detail.
A Walkthrough of System Design Concepts
In the first chapter, the author introduces the basic concepts of System Design and explains the main idea: System Design is about a discussion around compromises, which must be taken when designing a solution.
Topics covered:
A Typical System Design Interview Flow
The second chapter is devoted to the structure of the interview and the important division of requirements:
Functional Requirements
What the system should do: features, use cases, user stories
Non-Functional Requirements
How the system should work: performance, scalability, reliability
Also considered: API specification, data modeling And high level architecture.
Non-Functional Requirements
The third chapter dives deep into non-functional requirements - the so-called "-ilities":
Scalability
Ability to grow with load
Availability
Availability 99.9%+
Reliability
Correct operation
Maintainability
Easy to support
Performance
Latency and throughput
Security
Data protection
Scaling Databases
Chapter four focuses on database scaling, one of the key topics in any System Design interview.
Key techniques:
Replication
Master-slave, multi-master replication
Sharding
Horizontal data partitioning
Event Aggregation
Event aggregation for analytics
Caching Strategies
Read-through, write-through, write-behind
Distributed Transactions
The fifth chapter is one of the most valuable. It deals with the complex topic of distributed transactions, which is rarely covered well in other books.
Patterns considered:
Event-Driven Architecture
Asynchronous communication through events
Change Data Capture (CDC)
Capturing changes from the database
Saga Pattern
Offsetting transactions
Transaction Supervisor
Coordination of distributed operations
Common Services
The sixth chapter examines common services that are found in almost every system:
Authentication
JWT, Sessions, OAuth
Error Handling
Retry, Circuit Breaker
Rate Limiting
Token Bucket, Leaky Bucket
Service Mesh
Istio, Linkerd, Sidecars
API Protocols
REST, RPC, GraphQL
Logging & Monitoring
Observability stack
Part 2: Practical problems (chapters 7-17)
The book includes 11 practical cases. Below is the complete list in the same order as in the book.
Design URL Shortener
URL Shortener
API shape, short ID strategy, redirects, anti-abuse controls, and scaling.
Focus: Short links for sharing flows, fast redirects, and collision safety.
Product context: users and campaign tools create short URLs, while the dominant traffic pattern is low-latency redirects.
What to clarify in the interview
- •What is the read/write ratio and redirect SLA?
- •Do we need custom aliases, TTL, and deletion?
- •Do we need near-realtime click analytics?
Architecture focus
- •ID generation strategy (counter/snowflake/hash) + collision handling.
- •Hot link caching and edge/CDN acceleration.
- •Anti-abuse controls: rate limiting, blacklist, URL validation.
Typical risks
- •ID enumeration attacks.
- •Hot keys for viral links.
Design Key-Value Database
Key-Value Database
Sharding, replication, quorum choices, and failure recovery.
Focus: A core storage engine for high-volume reads/writes.
Product context: internal platform service used by multiple teams for simple, scalable key-value workloads.
What to clarify in the interview
- •What consistency guarantees are required?
- •What are value size limits and workload profile?
- •Do we need multi-region, backup/restore, and TTL?
Architecture focus
- •Replication + quorum reads/writes to balance latency and correctness.
- •Sharding and online rebalancing.
- •Storage engine choices (LSM/B-tree), compaction, write amplification trade-offs.
Typical risks
- •Hot partitions due to poor key design.
- •Slow recovery/re-sync after failures.
Design Distributed Message Queue
Distributed Message Queue
Partitioned log, delivery semantics, retry/DLQ, and lag control.
Focus: Reliable async backbone for services and background jobs.
Product context: decoupling service interactions, absorbing bursts, and preserving delivery guarantees.
What to clarify in the interview
- •Which semantics are required: at-most-once / at-least-once / effectively-once?
- •Is ordering global or per partition?
- •What are latency and retention targets?
Architecture focus
- •Partitioning + consumer groups for scale.
- •Retry policy, DLQ, and idempotent consumers.
- •Backpressure and flow control under spikes.
Typical risks
- •Poison messages breaking consumers.
- •Growing consumer lag under uneven load.
Design Social Media App
Twitter/X
Social feed design: fanout strategy, caching, and high-load trade-offs.
Focus: Consumer social product with feed and interactions.
Product context: content publishing, following graph, personalized timeline, and viral traffic behavior.
What to clarify in the interview
- •Which actions are core: post, follow, like, comment?
- •What are DAU and p95 feed-open targets?
- •Do we need ranking/personalization in v1?
Architecture focus
- •Feed strategy: fanout-on-write vs fanout-on-read vs hybrid.
- •Multi-layer caching for timeline/media/metadata.
- •Async pipelines for media processing and counters.
Typical risks
- •Celebrity fanout explosion.
- •Cache inconsistency vs source of truth.
Design Ad Click Event Aggregator
Ad Click Event Aggregator
Streaming aggregation pipeline with freshness and billing accuracy constraints.
Focus: Analytics pipeline for ad clicks and reporting.
Product context: near-realtime event aggregation with strict quality requirements for reporting and billing.
What to clarify in the interview
- •Do we target realtime dashboards or hourly batch?
- •What accuracy is required for billing use-cases?
- •How do we handle out-of-order and late events?
Architecture focus
- •Event ingestion + idempotent deduplication.
- •Windowed aggregates + watermark strategy.
- •Realtime + historical recomputation compatibility.
Typical risks
- •Double counting from retries.
- •Drift between online and offline numbers.
Design Object Storage Service
Object Storage
Object storage architecture, metadata/data split, and durability mechanisms.
Focus: Durable large-object storage for media and backups.
Product context: cost-efficient, high-durability storage with simple API and lifecycle controls.
What to clarify in the interview
- •What durability/availability targets are needed?
- •What object size distribution and R/W profile do we expect?
- •Do we need versioning, lifecycle, and storage tiering?
Architecture focus
- •Metadata/data separation with independent scaling.
- •Erasure coding/replication and background repair.
- •Multipart upload, checksum validation, pre-signed URLs.
Typical risks
- •Metadata bottleneck at namespace scale.
- •High cross-region replication cost.
Design Online Payment App
Payment System
Idempotency, auth/capture/refund flow, payment orchestration, and reconciliation.
Focus: Payment processing with strict correctness guarantees.
Product context: money movement where correctness and auditability are more important than raw latency.
What to clarify in the interview
- •Which payment flows are required: auth/capture/refund/chargeback?
- •What compliance and audit constraints apply?
- •How are duplicate requests and partial failures handled?
Architecture focus
- •Idempotency keys + transaction state machine.
- •Double-entry ledger as source of truth.
- •PSP reconciliation and compensating actions.
Typical risks
- •Duplicate charges from retries.
- •Ledger vs PSP mismatch.
Design Social Media App (Infrastructure View)
Social Media Infrastructure View
SLO-driven social platform operations: degradation, isolation, and observability.
Focus: Same domain, but from platform/operations perspective.
Product context: moving from feature-level design to SLO-driven runtime architecture and operability.
What to clarify in the interview
- •What SLO/error budget applies to key user journeys?
- •Where do we need autoscaling and graceful degradation?
- •How do we limit blast radius across services?
Architecture focus
- •Service boundaries, API contracts, and versioning.
- •Observability baseline: logs/metrics/traces + alert routing.
- •Deployment topology: multi-AZ rollout and rollback.
Typical risks
- •Cascading failures without bulkheads.
- •Opaque incidents without end-to-end tracing.
Design Room Reservation and Marketplace App
Airbnb
Marketplace search, availability calendar, and contention on booking slots.
Focus: Reservation marketplace with high contention on inventory.
Product context: search + atomic booking under race conditions and strict user trust expectations.
What to clarify in the interview
- •What are hold/booking/cancel rules and confirmation SLA?
- •How much concurrent contention per slot is expected?
- •How should search/ranking/filtering work?
Architecture focus
- •Inventory model + optimistic/pessimistic locking.
- •Reservation workflow with hold timeout.
- •Search index with eventual consistency vs transactional store.
Typical risks
- •Overbooking under contention.
- •Poor UX from slow confirmations.
Design Access Control and Authorization for Media App
Access Control for Media App
RBAC/ABAC/ReBAC, PDP/PEP split, auditability, and safe cache invalidation.
Focus: Authorization layer for media platform workflows.
Product context: unified policy model for web/mobile/admin with least-privilege defaults.
What to clarify in the interview
- •What initial roles/resources are business-critical?
- •Do we need RBAC only, or RBAC + ABAC/ReBAC?
- •Do we need explain/audit APIs for investigations?
Architecture focus
- •Policy decision point + policy enforcement point.
- •Decision caching with correct invalidation.
- •Tenant isolation and immutable audit trail.
Typical risks
- •Privilege escalation from policy gaps.
- •Stale authorization cache after policy changes.
Design Top Products Dashboard
Top Products Dashboard
Analytics serving layer with pre-aggregations, metric consistency, and freshness controls.
Focus: Analytical dashboard for product and operations teams.
Product context: decision-making UI where metric explainability and freshness transparency are essential.
What to clarify in the interview
- •What refresh frequency and latency are acceptable?
- •Which dimensions and filters are required?
- •Do we need drill-down to raw events?
Architecture focus
- •Dedicated serving layer for dashboard workloads.
- •Pre-aggregations/materialized views for p95 latency.
- •Combination of realtime stream + scheduled backfill.
Typical risks
- •Heavy ad-hoc queries impacting OLTP.
- •Metric mismatch across dashboards.
Conclusions and recommendations
Who is this book suitable for?
For those who want to understand "why" — the book provides a methodological basis, and not just a set of ready-made solutions.
Engineers with experience — a deep analysis of distributed transactions and common services will be especially useful.
For long term training — the book forms a systematic understanding of architecture, rather than superficial knowledge.
Recommendation: Use this book as a primary source for understanding the process and methodology, and Alex Xu's book as a reference for specific systems.
