A recommendation system is less about the model itself and more about candidate generation, ranking, feedback loops, and a hard user-facing latency budget.
The case helps connect offline training, online features, retrieval, re-ranking, cache locality, and signal freshness into a single serving system.
For interviews and design reviews, it is useful because it reveals whether you can discuss cold start, evaluation trade-offs, and the way product goals shape architecture.
Pipeline Thinking
Ingestion, partitioning, deduplication, and stage latency drive system behavior.
Serving Layer
Index and cache-locality decisions directly shape user-facing query latency.
Consistency Window
Explicitly define where eventual consistency is acceptable and where it is not.
Cost vs Freshness
Balance update frequency with compute/storage cost and operational complexity.
Related chapter
Machine Learning System Design
ML case-study framework: problem framing, metrics, data, and production risks.
Recommendation System is a classic multi-stage case where you must optimize relevance, latency, and cost at the same time. Interviewers expect you to split the system into candidate generation, ranking, and policy layers, then justify which metrics actually map to business outcomes.
Functional requirements
- Generate personalized recommendations for the home/feed surface.
- Support candidate generation, ranking, and re-ranking with business constraints.
- Use both implicit feedback (views, likes, watch time, clicks) and explicit signals.
- Expose explainability hints: why a recommendation was shown to the user.
Non-functional requirements
- p95 recommendation latency below 200 ms on the online path.
- Safe model evolution without downtime and with holdout quality controls.
- Predictable inference and feature-storage cost as MAU grows.
- Failure isolation with fallback modes for model serving and feature-store incidents.
Scale and assumptions
| Parameter | Assumption | Why it matters |
|---|---|---|
| DAU | 12M | Large consumer platform with a personalized feed as a core product surface. |
| Recommendation QPS | 180k (peak) | Strong session peaks and high fan-out across recommendation surfaces. |
| Candidate pool | 10M+ items | Large and frequently changing catalog of content/products. |
| Feature freshness | 1-5 minutes | Recent intent strongly changes ranking quality in many scenarios. |
| Availability | 99.95% | Recommendation quality directly impacts conversion and retention. |
High-Level Architecture
Stage 1: Candidate Generation
Fast retrieval from multiple sources: collaborative candidates, content-based retrieval, trending/popular, and editorial boosts.
Stage 2: Ranking
ML ranking with online/offline feature stores, user/item/context features, and multi-objective scoring (CTR, watch-time, conversion).
Stage 3: Re-ranking & Policy Layer
Diversification, business rules, caps, safety/abuse filters, cold-start fallback, and final response shaping.
Typical write/read cycle: user events enter a streaming bus, update online features, and a ranking service fetches candidates from retrieval paths before returning a policy-filtered list.
Deep Dives and trade-offs
Freshness vs stability
More frequent model/feature refresh improves adaptation to intent but increases quality drift risk and operational pressure on serving pipelines.
Exploration vs exploitation
Aggressive exploitation improves short-term CTR but can limit discovery. Controlled exploration (for example, bandit-based) reduces feedback-loop bias.
Model quality vs latency/cost
A heavier model can improve ranking quality but may break latency budgets and increase inference cost. Multi-stage models and budget-aware routing are common mitigations.
Personalization vs explainability
Deep personalization is harder to explain to users and stakeholders. Teams usually add reason codes and explicit policy boundaries in the final layer.
Common anti-patterns
Using a single heavy ranker without candidate pruning, which breaks latency budgets at peak.
Training only on clicks while ignoring delayed metrics such as retention, long watch-time, and churn signals.
No fallback strategy: recommendation output disappears during feature-store incidents.
No distribution-shift monitoring: offline metrics look good while online KPIs degrade for weeks.
Interview prompts to cover
- How does the online path work end to end, and where is the most expensive component?
- Which metrics do you choose: offline (NDCG/Recall@K) and online (CTR, dwell time, conversion)?
- How do you handle cold start for both new users and new catalog items?
- What is your degradation plan if the feature store, ANN index, or model serving layer is unavailable?
References
- Deep Neural Networks for YouTube Recommendations - Classic publication describing the two-stage retrieval + ranking architecture.
- Netflix Tech Blog - Production-oriented posts about recommendation platform evolution.
Related chapters
- Search System - A close retrieval/ranking pattern with similar relevance trade-offs.
- Twitter/X - Feed personalization and fan-out under high QPS.
- A/B Testing platform - Experimentation workflow for recommendation quality and guardrail metrics.
- Precision and Recall - Metrics foundation for ranking quality and threshold selection.
