Fraud / Risk Scoring ML System — System Design Space

Risk scoring is one of the best ML cases for system design because model quality immediately collides with error costs and decisions under tight latency budgets.

The chapter ties threshold choice, delayed labels, human review, and fallback behavior into one production system.

That is especially useful in interviews where you need to connect ML metrics, architecture decisions, and business outcomes.

Practical value of this chapter

Decision cost

Connect the model score to the product action and the real cost of a wrong decision.

Online path

Design scoring, thresholds, and fallback under a strict latency budget.

Delayed labels

Handle feedback that arrives too late for naive online-only schemes.

Interview material

Use a concrete risk-system story instead of generic ML theory.

Related chapter

Precision and recall basics

Foundation for discussing threshold choice and the cost of errors in risk systems.

Читать обзор

Fraud / Risk Scoring ML System is a classic ML case where the cost of a mistake is immediate: a system that is too soft leaks loss, while a system that is too strict hurts conversion and user trust. In interviews, the goal is to show how you connect threshold tuning, realtime scoring, delayed labels, and human review into one working system.

Functional requirements

Compute a risk score for payments, logins, transfers, and new-device events in near real time.
Support a decision policy that can approve, challenge, route to manual review, or block based on the risk tier.
Use features from user history, device graph, geo patterns, velocity checks, and external risk signals.
Collect delayed labels from chargebacks, confirmed fraud, analyst review, and customer disputes.

Non-functional requirements

p95 scoring latency below 120 ms on the synchronous critical path.
The system must survive provider or feature-store degradation through fallback rules and safe default thresholds.
Full auditability of which features, model, and threshold produced a decision.
Support frequent recalibration and threshold updates without rebuilding the whole system.

Scale assumptions

Transactions/day

250M+

The event rate requires a cheap scoring path and streaming feature updates.

Peak scoring QPS

75k

Peaks align with marketing campaigns, payroll windows, and holiday traffic.

Label delay

days to weeks

Chargebacks and confirmed-fraud events arrive long after the original decision.

False-positive cost

very high

Over-blocking legitimate activity hurts conversion, trust, and support load.

Reference architecture

Splitting the system into layers separates what lives on the hot decision path from what runs in the background: from incoming signals and feature state to review operations and the loop that updates the next release.

Signals and ingress

payments and loginsdevice eventsexternal signalspre-checks

Layer transition

Feature and state layer

online aggregatesdevice graphvelocity checksfreshness control

Layer transition

Scoring and decisioning

scoringthresholdspolicy rulesfallback

Layer transition

Review and case ops

manual reviewreason codesstep-up checkscase outcomes

Layer transition

Feedback and tuning

delayed labelsdriftreplayrecalibration

What to keep under control

It helps to view this system not only as a request path, but as a balance of error cost, live constraints, and how quickly the next tuning cycle can happen.

Error economics

false-positive costfraud leakageconversion hitsupport load

Live constraints

p95 latencyfeature-store SLAprovider degradationaudit trail

Improvement loop

label delaysegment driftthreshold tuningpolicy/model updates

Below, the chapter separates the synchronous decision path from the write path that carries delayed feedback, labels, and the next round of tuning.

How the system reads and writes fraud signals

Comparing the synchronous decision path with the delayed feedback path

Interactive replay

Step

Synchronous decision path

Active step

1. Event intake and pre-check

The system receives the event, validates the required fields, and decides whether it can enter the critical path.

Latency budgetThresholdsFallback

Latency-sensitive.

Must survive feature and provider degradation.

The cost of a false positive is immediately visible to the user.

Related chapter

Model Release, Calibration, and Experiment Loops

How label delay changes the interpretation of the first days after a model release.

Читать обзор

Delayed labels: training on incomplete truth

The decision is made in milliseconds, but the truth about it arrives days or weeks later. Label delay is not an annoying detail — it is the central design constraint of the training loop: the model always sees an incomplete picture, and a naive retrain-on-yesterday approach systematically underestimates risk.

Manual review queue

hours to days

The fastest near-truth: analyst decisions are precise, but they only cover the traffic the model itself routed to review.

Customer complaints and disputes

days to weeks

The signal arrives earlier than a formal chargeback, but it is noisy: some complaints are misunderstandings or forgotten subscriptions, not fraud.

Issuer chargebacks

up to 120 days

Visa and Mastercard rules give cardholders up to 120 days to dispute, so the final label can mature four months after the decision was made.

Card-network and partner confirmations

weeks

Compromised-card lists and confirmed-fraud feeds arrive in batches and retroactively — the pipeline must be able to backfill them into already-built training slices.

A two-speed label loop

Proxy labels — analyst verdicts, early disputes, failed step-up challenges — give a fast but biased signal for rapid retraining. The full model retrains on labels that have matured past the chargeback window. Neither loop replaces the other: the fast one catches new attacks, the slow one corrects the fast one's bias.

Label maturation window

An honest training and comparison set contains only transactions older than the maturation window, typically 90-120 days. The price: the model lags behind fresh attack patterns, which is exactly why the fast proxy-label loop is mandatory, not optional.

Bias correction: importance weighting

Training on fresh data systematically under-represents positives: some fraud is not labeled yet and looks like a good transaction. The classic fix is to model the label delay with a separate model and reweight examples (importance weighting), as in Chapelle's work on delayed conversions (KDD 2014).

Time-sliced validation

Historical replays must use only the labels that were known at scoring time, otherwise the evaluation leaks information from the future. Models can be compared honestly only on periods where labels have matured; last month's quality is always an optimistic estimate.

The selective labels problem

Blocked transactions never receive a label: there is no way to learn whether they were fraud. Without a control slice of risky traffic that is deliberately let through, or counterfactual evaluation, the model learns only from its own misses — and each generation sees a more distorted picture.

The most treacherous part is selective labels: the system itself decides which transactions will ever receive the truth. A blocked event disappears from the dataset, so without control traffic and counterfactual evaluation, every next model generation trains on an increasingly distorted picture of the world.

Model choice: GBDT, neural networks, and ensembles

Fraud features are classic tabular data: counters, amounts, categories, window aggregates. So the model conversation starts not with a neural architecture but with a strong gradient-boosting baseline — and an honest answer to the question of which problem GBDT does not solve.

GBDT: XGBoost / LightGBM

Mechanism: A gradient-boosted tree ensemble over tabular features: amounts, counters, categories, window aggregates, graph fan-out.

Trade-off: Trees are robust to uninformative features and easily learn the step-like dependencies typical of fraud, but they handle incremental retraining poorly and cannot be trained jointly with embeddings.

When to choose: The default baseline: on small and medium tabular data, tree ensembles still beat neural networks (Grinsztajn et al. benchmark, NeurIPS 2022) at a much lower training, tuning, and serving cost.

Neural networks

Mechanism: A DNN over the same tabular features plus embeddings of categories, event sequences, and graph entities; a single stack for fine-tuning and representation reuse.

Trade-off: They scale better with data volume and retraining cadence, but demand more data and monitoring discipline, and the quality win is not guaranteed: when Stripe migrated Radar from a Wide & Deep ensemble to a pure DNN, it first had to reproduce the XGBoost contribution inside the new architecture — dropping it outright cost 1.5 points of recall.

When to choose: When the dataset is large, features include sequences and embeddings, and the bottleneck is retraining speed and ensemble scaling rather than baseline quality.

An ensemble with specialist models

Mechanism: A global model scores overall risk, while specialist models cover individual attack classes: account takeover, card testing, merchant fraud. A decision policy and rules operate on top of the scores.

Trade-off: Segment models are more accurate in their niches but multiply the cost of ownership: each one needs its own calibration, monitoring, retraining loop, and rollback plan.

When to choose: When attack classes have different error economics and different features, and a single model consistently underperforms on niche but expensive segments.

Calibrating probabilities for business thresholds

Raw GBDT and neural-network scores are not probabilities. Add calibration after training (Platt scaling or isotonic regression), otherwise a business threshold like 'block above 0.9' means nothing stable across releases.
A threshold is economics, not a magic number: block when p times the expected fraud loss exceeds the false-positive cost weighted by (1 - p). The operating point differs across amounts, products, and segments.
With fraud below one percent of traffic, ROC-AUC looks optimistic for almost any model. Watch PR-AUC and recall at a fixed precision around the operating threshold — metrics that actually feel class imbalance.
Recalibration is needed more often than retraining: the score distribution drifts with traffic, and the same threshold a month later means a different block rate and a different manual-review load.

Calibration is what connects the model to the decision policy: until the score can be read as a probability, the threshold cannot be derived from the economics of errors — it can only be guessed.

Graph features and fraud rings

Organized fraud is not about anomalous transactions but about anomalous links: dozens of accounts on one device, one card across a hundred profiles, shared shipping addresses. The device-card-account graph turns those links into features at three maturity levels — from cheap counters to community detection and graph embeddings.

Fan-out counters (1 hop)

Mechanism: Node degree in the device-card-account graph: how many cards this device has seen in 30 days, how many accounts share an email or shipping address, how many devices one card touches.

Trade-off: Cheap to serve: precomputed counters are read from the online store by key in milliseconds, but they only see immediate neighbors and miss the structure of the ring as a whole.

When to apply: The foundation of any fraud model: anomalous fan-out is one of the most stable signals of account factories and stolen-card testing.

Communities and fraud rings

Mechanism: Offline search for dense clusters (connected components, Louvain): fraud rings appear as isolated communities of moderate size, separated from the giant component of legitimate users. Confirmed-fraud labels propagate to community neighbors.

Trade-off: Recomputation is batch, every few hours or daily: a new ring becomes visible only after the next run. In exchange, the online path keeps a cheap lookup of the community id and its risk status.

When to apply: Against organized rings, mule intermediaries, and shared identifiers: each account looks harmless on its own, and the anomaly is visible only in the link structure.

Graph embeddings and GNNs

Mechanism: Node embeddings (FastRP, node2vec) or graph neural networks encode the structural role of a node into a vector that feeds the main model alongside tabular features.

Trade-off: The most expressive level and the most expensive one: recomputing embeddings, versioning them, and explaining them to analysts is much harder than with counters and communities.

When to apply: When counters and communities are exhausted and rings have learned to split their links so they never cross simple fan-out thresholds.

The cost of real-time graph feature serving: a full graph traversal on the synchronous path does not fit a 120 ms budget at 75k QPS. The working rule is that the online path only reads precomputed values from the feature store by key; all traversal depth moves into batch and streaming recomputation. The trade-off of that rule is freshness: the deeper a feature looks into the graph, the further it lags behind reality.

Related chapter

Redis: In-Memory Database and Architecture

Data structures and trade-offs of the hot store where velocity counters live.

Читать обзор

Velocity features: windows, storage, consistency

Velocity counters — how many events per window — are the strongest realtime features and the most demanding on infrastructure: they must be incremented on every event, read on every scoring call, and reproduced exactly in training. They are computed over sliding windows of different lengths, and each length catches its own class of attacks.

Window

5 minutes

Payment attempts per card, share of failed authorizations, new cards on a device.

Catches: Stolen-card testing with small amounts and burst bot attacks.

Window

1 hour

Spend per card and account, number of unique recipients and merchants.

Catches: Fast cash-out right after an account takeover.

Window

24 hours

Cards per device, geo jumps, spend relative to the weekly average.

Catches: Device farms and distributed attacks smeared over time.

Window

7-30 days

Deviation from the spending profile, share of new recipients, device-change frequency.

Catches: Slow fraud, mule accounts, and gradual cash-out in small portions.

Exact sliding windows

A Redis sorted set per counter gives an exact window but pays with memory per event and logarithmic cost per operation — at 75k scoring QPS that is tens of millions of operations per minute for velocity features alone.

Sub-buckets instead of exact windows

A daily window is assembled from 24 hourly buckets with TTLs: a bucket-boundary error in exchange for predictable memory and a cheap increment. For most velocity features that precision is enough — the threshold is chosen with a margin anyway.

Who increments the counter

A stream processor (Kafka to Flink) is cheaper and stays off the critical path, but the counter lags by the pipeline delay — precisely during a burst attack, when it matters most. A synchronous increment on the scoring path closes that gap at the cost of an extra write to the hot store per event.

Train/serve consistency

An offline recomputation of windows from raw logs almost never matches what the online counter saw: late events, retries, materialization lag. The reliable pattern is to log the feature vector used at scoring time and train on that log, rather than rebuilding features after the fact.

The main invisible risk here is training/serving skew: a feature with the same name offline and online can mean different things. Logging the feature vector at scoring time solves this more reliably than any attempt to achieve point-in-time correctness by recomputing raw events after the fact.

Adversarial dynamics: the opponent adapts

In recommender systems, drift is a change of audience taste; in anti-fraud it is concept drift actively produced by an adversary. The operating loop is therefore designed with adversarial thinking: every model release is a move in a game, and a counter-move will follow.

Drift as the adversary's strategy

Every blocked pattern teaches the attacker: fraudsters probe thresholds with small amounts and change tactics after the first block. Concept drift here is faster than in ordinary ML systems, so score distributions and block rates per segment are monitored daily, not reviewed quarterly.

Champion / challenger

The champion serves production traffic while challengers score the same stream in shadow mode. Promotion decisions are made on matured labels, not on the first days — otherwise the winner is simply the model that looks best on incomplete truth.

Rules as the fast response and the fallback

An analyst ships a blocking rule in hours; retraining and releasing a model takes days — rules cover a new attack while the model catches up. The same conservative rule set is the fallback when the model degrades or online features become unavailable.

The review queue as a labeling source

Analyst throughput is a hard limit of the whole system. The queue is prioritized by score uncertainty and expected error cost, not FIFO. Random passed-through transactions are mixed into labeling, otherwise the training set drifts toward what the current model already catches.

It pays to keep challengers in shadow mode permanently, not just before a release: that gives early drift diagnostics and a ready replacement candidate when the champion degrades. And rules remain the last fallback that must keep working even when everything else in the ML stack does not.

Key trade-offs

A lower threshold reduces fraud leakage but increases false positives and friction for legitimate users.
More realtime features improve quality but make freshness SLAs, debugging, and fallback paths more complex.
One global model is easier to operate, but segmented models are often more accurate for different markets and products.
Hard blocking high scores lowers loss risk but raises the cost of wrong decisions and increases support pressure.
Deeper graph features catch rings better, but a full graph traversal does not fit the latency budget — depth has to move into offline recomputation at the cost of freshness.
Synchronous velocity-counter increments close burst attacks but add a hot-store write per scoring call; the asynchronous path is cheaper but lags exactly when it matters most.

Common mistakes

Optimizing only ROC-AUC without translating quality into the business cost of false positives and fraud leakage.

Mixing online scoring with post-factum labels without explicit handling of delayed feedback and label leakage.

Making the risk engine a black box with no explainability or audit trail for support and analysts.

Running without a fallback policy when online features, external providers, or the primary model are unavailable.

Training and comparing models on fresh transactions whose positive labels have not matured: a model that underestimates risk always looks better than it is.

Labeling only what the model routed to review, then wondering why the next model generation finds no new attack types.

Recommendations

Separate score generation from decision policy: the model predicts risk, while product and risk policy decide the action.

An aggregate quality metric hides local failures: track drift across countries, products, channels, and device cohorts separately.

Maintain replay sets and a calibration pipeline because thresholds and score distributions age faster than teams expect.

Treat analyst review as part of the system design, not as a manual tail after the incident.

What to explain in an interview

How would you choose the threshold, and who should own the trade-off between false positives and false negatives?
How do you design the scoring path when labels arrive weeks after the transaction?
What happens when online features or an external risk provider become unavailable?
How do you explain to analysts and support staff why the system made a decision?
Why does GBDT remain a strong baseline for tabular fraud features, and at what point do neural networks or an ensemble of specialist models become worth it?
How do you compute velocity features so that training values match the values seen on the scoring path?
How do you evaluate the model honestly when blocked transactions never receive a label?

Sources

How we built it: Stripe Radar — The evolution of the Radar model: from logistic regression and XGBoost through a Wide & Deep ensemble to a pure DNN scoring in about 100 ms.
Stripe Radar: a primer on machine learning for fraud detection — Features, quality metrics, and the product logic of thresholds in an industrial anti-fraud system.
Chapelle. Modeling Delayed Feedback in Display Advertising (KDD 2014) — The classic work on correcting bias under delayed labels: a separate delay model plus importance weighting.
Grinsztajn et al. Why do tree-based models still outperform deep learning on typical tabular data? (NeurIPS 2022) — A benchmark across 45 tabular datasets: why tree ensembles remain state of the art on medium-sized tabular data.
Neo4j: graph-based approach to financial fraud detection — A graph of transactions, cards, devices, and addresses: fan-out features, community detection, and propagation of confirmed-fraud labels.
Visa chargeback time limits (Chargebacks911) — Dispute windows under Visa rules: the standard 120 days and extended timelines for delayed delivery.
Redis: real-time fraud detection tutorial — Data structures for velocity counters: sorted sets, TTL buckets, and risk profiles in the hot store.

Related chapters

Precision and recall basics - The core language for thresholds, false positives, and false negatives.
ML Ops Pipeline - How to assemble the full working loop: release, monitoring, feedback, and the next training cycle.
Feature Store & Model Serving - How to build the online feature path, keep data consistent, and run fallback paths for scoring.
Human-in-the-Loop, Data Quality, and the Operational AI Loop - How analyst review and delayed labels become part of the operating loop.
T-Bank ML platform interview - A platform view on operating ML systems and standardizing how teams build and release them.