This Theme 3 chapter focuses on data-intensive flows, indexing, and feed mechanics. The goal is not only to propose a working design, but also to explain behavior under scale and failure pressure.
Use a stable structure: requirements -> architecture -> critical deep dive -> evolution. This makes the solution clear, defensible, and interview-ready.
Pipeline Thinking
Ingestion, partitioning, deduplication, and stage latency drive system behavior.
Serving Layer
Index and cache-locality decisions directly shape user-facing query latency.
Consistency Window
Explicitly define where eventual consistency is acceptable and where it is not.
Cost vs Freshness
Balance update frequency with compute/storage cost and operational complexity.
Case-Solving Playbook
Map the data lifecycle
Phase 1Describe the full pipeline from event source to serving layer explicitly.
Separate read/write paths
Phase 2Optimize ingest throughput and user-facing query latency independently.
Define consistency contract
Phase 3Set acceptable staleness windows and compensation mechanisms.
Plan for x10/x100 growth
Phase 4Show partitioning, indexing, and storage-tier evolution under growth.
Related chapter
Machine Learning System Design
ML case-study framework: problem framing, metrics, data, and production risks.
Recommendation System is a classic multi-stage case where you must optimize relevance, latency, and cost at the same time. Interviewers expect you to split the system into candidate generation, ranking, and policy layers, then justify which metrics actually map to business outcomes.
Functional requirements
- Generate personalized recommendations for the home/feed surface.
- Support candidate generation, ranking, and re-ranking with business constraints.
- Use both implicit feedback (views, likes, watch time, clicks) and explicit signals.
- Expose explainability hints: why a recommendation was shown to the user.
Non-functional requirements
- p95 recommendation latency below 200 ms on the online path.
- Safe model evolution without downtime and with holdout quality controls.
- Predictable inference and feature-storage cost as MAU grows.
- Failure isolation with fallback modes for model serving and feature-store incidents.
Scale and assumptions
| Parameter | Assumption | Why it matters |
|---|---|---|
| DAU | 12M | Large consumer platform with a personalized feed as a core product surface. |
| Recommendation QPS | 180k (peak) | Strong session peaks and high fan-out across recommendation surfaces. |
| Candidate pool | 10M+ items | Large and frequently changing catalog of content/products. |
| Feature freshness | 1-5 minutes | Recent intent strongly changes ranking quality in many scenarios. |
| Availability | 99.95% | Recommendation quality directly impacts conversion and retention. |
High-Level Architecture
Stage 1: Candidate Generation
Fast retrieval from multiple sources: collaborative candidates, content-based retrieval, trending/popular, and editorial boosts.
Stage 2: Ranking
ML ranking with online/offline feature stores, user/item/context features, and multi-objective scoring (CTR, watch-time, conversion).
Stage 3: Re-ranking & Policy Layer
Diversification, business rules, caps, safety/abuse filters, cold-start fallback, and final response shaping.
Typical write/read cycle: user events enter a streaming bus, update online features, and a ranking service fetches candidates from retrieval paths before returning a policy-filtered list.
Deep Dives and trade-offs
Freshness vs stability
More frequent model/feature refresh improves adaptation to intent but increases quality drift risk and operational pressure on serving pipelines.
Exploration vs exploitation
Aggressive exploitation improves short-term CTR but can limit discovery. Controlled exploration (for example, bandit-based) reduces feedback-loop bias.
Model quality vs latency/cost
A heavier model can improve ranking quality but may break latency budgets and increase inference cost. Multi-stage models and budget-aware routing are common mitigations.
Personalization vs explainability
Deep personalization is harder to explain to users and stakeholders. Teams usually add reason codes and explicit policy boundaries in the final layer.
Common anti-patterns
Using a single heavy ranker without candidate pruning, which breaks latency budgets at peak.
Training only on clicks while ignoring delayed metrics such as retention, long watch-time, and churn signals.
No fallback strategy: recommendation output disappears during feature-store incidents.
No distribution-shift monitoring: offline metrics look good while online KPIs degrade for weeks.
Interview prompts to cover
- How does the online path work end to end, and where is the most expensive component?
- Which metrics do you choose: offline (NDCG/Recall@K) and online (CTR, dwell time, conversion)?
- How do you handle cold start for both new users and new catalog items?
- What is your degradation plan if the feature store, ANN index, or model serving layer is unavailable?
References
- Deep Neural Networks for YouTube Recommendations - Classic publication describing the two-stage retrieval + ranking architecture.
- Netflix Tech Blog - Production-oriented posts about recommendation platform evolution.
Related chapters
- Search System - A close retrieval/ranking pattern with similar relevance trade-offs.
- Twitter/X - Feed personalization and fan-out under high QPS.
- A/B Testing platform - Experimentation workflow for recommendation quality and guardrail metrics.
- Precision and Recall - Metrics foundation for ranking quality and threshold selection.
