System Design Space
Knowledge graphSettings

Updated: April 5, 2026 at 8:34 PM

Ranking and Recommendation Architecture for ML Systems

medium

How to design a recommendation loop: candidate generation, ranking, policy layers, freshness, feedback, and the next training cycle.

Ranking and recommendation systems are valuable because they show ML not as one model, but as an operating loop of candidate generation, policy, and feedback.

The chapter ties freshness, product constraints, degraded modes, and experimentation into one serving architecture.

That is especially useful in interviews where you need to explain why recommendation quality depends on more than the model alone.

Practical value of this chapter

Ranking loop

Break candidate generation, the ranker, policy, and list assembly into separate controlled layers.

Feedback and experiments

Understand how exposures, reactions, and experiments move the next system release.

Freshness and constraints

Connect freshness, diversity, product rules, and latency budgets in one serving architecture.

Interview material

Use a concrete ranking case instead of vague recommender-system talk.

Related chapter

Recommendation System

A broader case where ranking lives inside the full product architecture of a recommender system.

Читать обзор

Ranking and recommendation architecture is not “train on clicks and sort the list.” It is an operating system where candidate generation, ranking, product rules, and the feedback loop shape the outcome as much as the model itself.

Problem and context

Imagine a product with several surfaces: a home feed, similar-item recommendations, checkout suggestions, and blended search. We need a ranking system that fits the latency budget, survives cold start, respects product rules, and does not learn from its own distortions.

Functional requirements

  • Build personalized lists for feeds, recommendation blocks, similar items, and blended search surfaces.
  • Support a multi-stage loop: candidate generation, context fetch, ranking, policy, and final list assembly.
  • Respect product constraints such as diversity, freshness, safety, inventory limits, ad separation, and editorial rules.
  • Collect exposures, clicks, skips, hides, dwell time, and delayed outcomes for training and experiments.

Non-functional requirements

  • Keep p95 end-to-end list latency below 180 ms for user-facing surfaces.
  • Provide a fallback path: a popular list, a cached list, or a lighter ranking route when dependencies degrade.
  • Maintain stable freshness SLAs for new items and catalog changes so the surface does not feel stale.
  • Expose observability for segment quality, exposure coverage, policy override rate, and tail latency at every stage.

Load and scale assumptions

DAU

20M+

The list changes by surface, segment, device, and session context, so ranking behaves like a shared serving runtime.

Candidate pool

100K-10M items

You cannot send the whole catalog into an expensive ranker, so the first stage must stay cheap and recall-oriented.

Peak QPS

80K

Peaks depend on campaigns, prime time, notifications, and search bursts.

Freshness target

<= 5 min for hot inventory

New items, price changes, and availability updates must reach candidate generation and policy quickly.

Label delay

hours to weeks

Retention and long-term value appear much later than clicks, so instant CTR is never enough by itself.

Reference ranking architecture

It helps to read the system as a stack of layers: from catalog signals through features and ranking to list assembly and the next learning cycle.

Signals and catalog
cataloguser profilesession contexteditorial signals
Layer transition
Candidate generation
candidate retrievalembeddingsgraph signalshard filters
Layer transition
Context and feature layer
freshnessfeature cacheuser signalsitem signals
Layer transition
Ranking and policy
rankerdiversitybusiness rulessafety constraints
Layer transition
List assembly and fallback
list assemblypopular listcached listempty state
Layer transition
Feedback and experimentation
exposure loggingA/Bdelayed signalsnext training cycle

What to keep under control

It helps to view ranking not only as a chain of models, but as a balance of list quality, live constraints, and how fast the next learning cycle can move.

Ranking economics

CTR vs retentionconversioncomplaint ratead separation

Live constraints

p95 latencyfreshness SLAcache hit ratetail latency

Learning loop

cold startexploration budgetcounterfactual evaluationsegment review

Below, the chapter separates the read path from the write path. The second one matters just as much, because that is where exposures, reactions, and the next release actually take shape.

How the ranking system serves a list and writes feedback

Comparing the synchronous serving path with the delayed feedback path

Active step

Synchronous list-serving path

1. Request and surface context

The system receives the request, identifies the product surface, the user, the session, and the active product constraints.

Interactive replay

  • Tightly constrained by latency.
  • Good candidates must not be dropped too early.
  • Policy and fallback can change the final ordering a lot.
Latency budgetFreshnessFallback

Key deep dives

This topic becomes much clearer once you separate cold start, exposure bias, and degraded modes instead of treating ranking as a single-model problem.

Cold start and fallback lists

For new users and new items, the system must survive without rich behavioral history: popular lists, editorial priors, content features, and a lightweight exploration path are essential from day one.

Freshness vs stability

The faster new items and signals enter the loop, the more responsive the system becomes, but the harder it is to keep results stable and investigate regressions by segment and surface.

Exploration vs exploitation

If the system shows only what it already knows how to sell, it stops learning. Exploration budgets must be controlled, segment-aware, and measurable rather than random noise.

Exposure bias and feedback traps

Click logs reflect not only user intent, but also what the system chose to show. Exposure logging, counterfactual evaluation, and segment review are therefore required for honest learning.

Degraded modes

You need an explicit degraded path: cached lists, popular results, reduced feature sets, or a lighter ranker. Otherwise ranking becomes an all-or-nothing dependency and hurts UX during incidents.

Offline metrics

Recall@K, NDCG, MAP, calibration, segment quality, and fairness checks help iteration, but they do not capture the full product impact without fresh context and a real exposure mix.

Online metrics

CTR, add-to-cart, conversion, retention, session depth, complaint rate, and policy override rate show how ranking actually changes product behavior and operations.

Failure and degraded modes

Feature fetch stalls or cache hit rate collapses on a hot surface.
Policy override rate suddenly spikes and model score barely affects the final list.
A new ranker improves CTR but hurts retention and diversity over a longer horizon.
Fresh inventory never reaches the candidate pool because ingestion or index freshness lags.

Key trade-offs

  • A stronger ranker improves quality but increases latency and cost per request.
  • Aggressive personalization can improve short-term engagement while harming diversity and explainability.
  • High freshness improves responsiveness but complicates cache strategy and regression analysis.
  • Too little exploration freezes the learning loop, while too much can temporarily hurt product metrics.

Anti-patterns

Treating ranking as a one-model problem without candidate generation, a policy layer, and a fallback path.
Optimizing only CTR while ignoring retention, complaint rate, and business-side objectives.
Training on clicks without exposure logging and pretending the feedback data is objective by default.
Hiding business rules inside features instead of applying them in an explicit policy layer after ranking.

Recommendations

Keep ranking as a multi-stage system: candidate generation, features, the ranker, policy, and list assembly should each remain controllable.
Separate model score from the final decision so diversity, safety, and business constraints do not disappear into one formula.
Treat offline and online quality as different planes and review segment-level regressions in both.
Design fallback, exploration budgets, and exposure logging before the first product incident, not after it.

What to explain in an interview

  • How would you design candidate generation so you preserve recall without breaking the latency budget?
  • Where should diversity, freshness, and business rules live: inside the model or in a separate policy layer?
  • How do you separate true model improvement from a feedback trap that the system created through its own exposure?
  • What happens when feature freshness drops or the ranking runtime is unavailable, and what does the user see?

Related chapters

Enable tracking in Settings