Recommendation System — System Design Space

A recommendation system is less about the model itself and more about candidate generation, ranking, feedback loops, and a hard user-facing latency budget.

The case helps connect offline training, online features, retrieval, re-ranking, cache locality, and signal freshness into a single serving system.

For interviews and design reviews, it is useful because it reveals whether you can discuss cold start, evaluation trade-offs, and the way product goals shape architecture.

Pipeline Thinking

Ingestion, partitioning, deduplication, and stage latency drive system behavior.

Serving Layer

Index and cache-locality decisions directly shape user-facing query latency.

Consistency Window

Explicitly define where eventual consistency is acceptable and where it is not.

Cost vs Freshness

Balance update frequency with compute/storage cost and operational complexity.

Related chapter

Machine Learning System Design

ML case-study framework: problem framing, metrics, data, and production risks.

Читать обзор

Recommendation System is a classic multi-stage case where you must optimize relevance, latency, and cost at the same time. Interviewers expect you to split the system into candidate generation, ranking, and policy layers, then justify which metrics actually map to business outcomes.

Functional requirements

Generate personalized recommendations for the home/feed surface.
Support candidate generation, ranking, and re-ranking with business constraints.
Use both implicit feedback (views, likes, watch time, clicks) and explicit signals.
Expose explainability hints: why a recommendation was shown to the user.

Non-functional requirements

p95 recommendation latency below 200 ms on the online path.
Safe model evolution without downtime and with holdout quality controls.
Predictable inference and feature-storage cost as MAU grows.
Failure isolation with fallback modes for model serving and feature-store incidents.

Scale and assumptions

Parameter	Assumption	Why it matters
DAU	12M	Large consumer platform with a personalized feed as a core product surface.
Recommendation QPS	180k (peak)	Strong session peaks and high fan-out across recommendation surfaces.
Candidate pool	10M+ items	Large and frequently changing catalog of content/products.
Feature freshness	1-5 minutes	Recent intent strongly changes ranking quality in many scenarios.
Availability	99.95%	Recommendation quality directly impacts conversion and retention.

High-Level Architecture

Stage 1: Candidate Generation

Fast retrieval from multiple sources: collaborative candidates, content-based retrieval, trending/popular, and editorial boosts.

Stage 2: Ranking

ML ranking with online/offline feature stores, user/item/context features, and multi-objective scoring (CTR, watch-time, conversion).

Stage 3: Re-ranking & Policy Layer

Diversification, business rules, caps, safety/abuse filters, cold-start fallback, and final response shaping.

Typical write/read cycle: user events enter a streaming bus, update online features, and a ranking service fetches candidates from retrieval paths before returning a policy-filtered list.

Deep Dives and trade-offs

Freshness vs stability

More frequent model/feature refresh improves adaptation to intent but increases quality drift risk and operational pressure on serving pipelines.

Exploration vs exploitation

Aggressive exploitation improves short-term CTR but can limit discovery. Controlled exploration (for example, bandit-based) reduces feedback-loop bias.

Model quality vs latency/cost

A heavier model can improve ranking quality but may break latency budgets and increase inference cost. Multi-stage models and budget-aware routing are common mitigations.

Personalization vs explainability

Deep personalization is harder to explain to users and stakeholders. Teams usually add reason codes and explicit policy boundaries in the final layer.

Common anti-patterns

Using a single heavy ranker without candidate pruning, which breaks latency budgets at peak.

Training only on clicks while ignoring delayed metrics such as retention, long watch-time, and churn signals.

No fallback strategy: recommendation output disappears during feature-store incidents.

No distribution-shift monitoring: offline metrics look good while online KPIs degrade for weeks.

Interview prompts to cover

How does the online path work end to end, and where is the most expensive component?
Which metrics do you choose: offline (NDCG/Recall@K) and online (CTR, dwell time, conversion)?
How do you handle cold start for both new users and new catalog items?
What is your degradation plan if the feature store, ANN index, or model serving layer is unavailable?

References

Deep Neural Networks for YouTube Recommendations - Classic publication describing the two-stage retrieval + ranking architecture.
Netflix Tech Blog - Production-oriented posts about recommendation platform evolution.

Related chapters

Search System - A close retrieval/ranking pattern with similar relevance trade-offs.
Twitter/X - Feed personalization and fan-out under high QPS.
A/B Testing platform - Experimentation workflow for recommendation quality and guardrail metrics.
Precision and Recall - Metrics foundation for ranking quality and threshold selection.