Ranking and Recommendation Architecture for ML Systems

Ranking and recommendation systems are valuable because they show ML not as one model, but as an operating loop of candidate generation, policy, and feedback.

The chapter ties freshness, product constraints, degraded modes, and experimentation into one serving architecture.

That is especially useful in interviews where you need to explain why recommendation quality depends on more than the model alone.

Practical value of this chapter

Ranking loop

Break candidate generation, the ranker, policy, and list assembly into separate controlled layers.

Feedback and experiments

Understand how exposures, reactions, and experiments move the next system release.

Freshness and constraints

Connect freshness, diversity, product rules, and latency budgets in one serving architecture.

Interview material

Use a concrete ranking case instead of vague recommender-system talk.

Related chapter

Recommendation System

A broader case where ranking lives inside the full product architecture of a recommender system.

Читать обзор

Ranking and recommendation architecture is not “train on clicks and sort the list.” It is an operating system where candidate generation, ranking, product rules, and the feedback loop shape the outcome as much as the model itself.

Problem and context

The product has several surfaces: a home feed, similar-item recommendations, checkout suggestions, and blended search. They all pull from one loop, and the system has to fit the latency budget, survive cold start, respect product rules, and not learn from its own distortions.

Functional requirements

Build personalized lists for feeds, recommendation blocks, similar items, and blended search surfaces.
Support a multi-stage loop: candidate generation, context fetch, ranking, policy, and final list assembly.
Respect product constraints such as diversity, freshness, safety, inventory limits, ad separation, and editorial rules.
Collect exposures, clicks, skips, hides, dwell time, and delayed outcomes for training and experiments.

Non-functional requirements

Keep p95 end-to-end list latency below 180 ms for user-facing surfaces.
Provide a fallback path: a popular list, a cached list, or a lighter ranking route when dependencies degrade.
Maintain stable freshness SLAs for new items and catalog changes so the surface does not feel stale.
Expose observability for segment quality, exposure coverage, policy override rate, and tail latency at every stage.

Load and scale assumptions

DAU

20M+

The list changes by surface, segment, device, and session context, so ranking behaves like a shared serving runtime.

Candidate pool

100K-10M items

You cannot send the whole catalog into an expensive ranker, so the first stage must stay cheap and recall-oriented.

Peak QPS

80K

Peaks depend on campaigns, prime time, notifications, and search bursts.

Freshness target

<= 5 min for hot inventory

New items, price changes, and availability updates must reach candidate generation and policy quickly.

Label delay

hours to weeks

Retention and long-term value appear much later than clicks, so instant CTR is never enough by itself.

Reference ranking architecture

It helps to read the system as a stack of layers: from catalog signals through features and ranking to list assembly and the next learning cycle.

Signals and catalog

cataloguser profilesession contexteditorial signals

Layer transition

Candidate generation

candidate retrievalembeddingsgraph signalshard filters

Layer transition

Context and feature layer

freshnessfeature cacheuser signalsitem signals

Layer transition

Ranking and policy

rankerdiversitybusiness rulessafety constraints

Layer transition

List assembly and fallback

list assemblypopular listcached listempty state

Layer transition

Feedback and experimentation

exposure loggingA/Bdelayed signalsnext training cycle

What to keep under control

It helps to view ranking not only as a chain of models, but as a balance of list quality, live constraints, and how fast the next learning cycle can move.

Ranking economics

CTR vs retentionconversioncomplaint ratead separation

Live constraints

p95 latencyfreshness SLAcache hit ratetail latency

Learning loop

cold startexploration budgetcounterfactual evaluationsegment review

Below, the chapter separates the read path from the write path. The second one matters just as much, because that is where exposures, reactions, and the next release actually take shape.

How the ranking system serves a list and writes feedback

Comparing the synchronous serving path with the delayed feedback path

Active step

Synchronous list-serving path

1. Request and surface context

The system receives the request, identifies the product surface, the user, the session, and the active product constraints.

Interactive replay

Tightly constrained by latency.
Good candidates must not be dropped too early.
Policy and fallback can change the final ordering a lot.

Latency budgetFreshnessFallback

Key deep dives

This topic becomes much clearer once you separate cold start, exposure bias, and degraded modes instead of treating ranking as a single-model problem.

Cold start and fallback lists

For new users and new items, the system must survive without rich behavioral history: popular lists, editorial priors, content features, and a lightweight exploration path are essential from day one.

Freshness vs stability

Letting new items and signals into the loop quickly makes the product more responsive, but you pay for it in result stability: regressions get harder to reproduce and trace by segment and surface.

Exploration vs exploitation

If the system shows only what it already knows how to sell, it stops learning. Exploration budgets must be controlled, segment-aware, and measurable rather than random noise.

Exposure bias and feedback traps

Click logs reflect not only user intent, but also what the system chose to show. Exposure logging, counterfactual evaluation, and segment review are therefore required for honest learning.

Degraded modes

You need an explicit degraded path: cached lists, popular results, reduced feature sets, or a lighter ranker. Otherwise ranking becomes an all-or-nothing dependency and hurts UX during incidents.

Offline metrics

Recall@K, NDCG, MAP, calibration, segment quality, and fairness checks help iteration, but they do not capture the full product impact without fresh context and a real exposure mix.

Online metrics

CTR, add-to-cart, conversion, retention, session depth, complaint rate, and policy override rate show how ranking actually changes product behavior and operations.

Failure and degraded modes

Feature fetch stalls or cache hit rate collapses on a hot surface.

Policy override rate suddenly spikes and model score barely affects the final list.

A new ranker improves CTR but hurts retention and diversity over a longer horizon.

Fresh inventory never reaches the candidate pool because ingestion or index freshness lags.

Key trade-offs

A stronger ranker improves quality but increases latency and cost per request.
Aggressive personalization can improve short-term engagement while harming diversity and explainability.
High freshness improves responsiveness but complicates cache strategy and regression analysis.
Too little exploration freezes the learning loop, while too much can temporarily hurt product metrics.

Anti-patterns

Treating ranking as a one-model problem without candidate generation, a policy layer, and a fallback path.

Optimizing only CTR while ignoring retention, complaint rate, and business-side objectives.

Training on clicks without exposure logging and pretending the feedback data is objective by default.

Hiding business rules inside features instead of applying them in an explicit policy layer after ranking.

Recommendations

Keep ranking as a multi-stage system: candidate generation, features, the ranker, policy, and list assembly should each remain controllable.

Separate model score from the final decision so diversity, safety, and business constraints do not disappear into one formula.

Treat offline and online quality as different planes and review segment-level regressions in both.

Design fallback, exploration budgets, and exposure logging before the first product incident, not after it.

What to explain in an interview

How would you design candidate generation so you preserve recall without breaking the latency budget?
Where should diversity, freshness, and business rules live: inside the model or in a separate policy layer?
How do you separate true model improvement from a feedback trap that the system created through its own exposure?
What happens when feature freshness drops or the ranking runtime is unavailable, and what does the user see?

References

Covington, Adams, Sargin — Deep Neural Networks for YouTube Recommendations (RecSys 2016)Cheng et al. — Wide & Deep Learning for Recommender Systems (Google, 2016)Zhao et al. — Recommending What Video to Watch Next: A Multitask Ranking System (RecSys 2019)Google — Rules of Machine Learning: Best Practices for ML Engineering

Related chapters

Recommendation System - A broader product case where ranking and candidate generation are embedded in the full recommender architecture.
Search System - A neighboring domain where candidate retrieval and ranking live under their own latency and relevance constraints.
Model Release, Calibration, and Experiment Loops - How to change ranking models, thresholds, and rule layers safely in production.
Feature Store & Model Serving - How low-latency feature delivery, parity, and degraded modes affect ranking runtime.
Fraud / Risk Scoring ML System - Useful for comparing list ranking with threshold-driven decision systems and their feedback loops.