Risk scoring is one of the best ML cases for system design because model quality immediately collides with error costs and decisions under tight latency budgets.
The chapter ties threshold choice, delayed labels, human review, and fallback behavior into one production system.
That is especially useful in interviews where you need to connect ML metrics, architecture decisions, and business outcomes.
Practical value of this chapter
Decision cost
Connect the model score to the product action and the real cost of a wrong decision.
Online path
Design scoring, thresholds, and fallback under a strict latency budget.
Delayed labels
Handle feedback that arrives too late for naive online-only schemes.
Interview material
Use a concrete risk-system story instead of generic ML theory.
Related chapter
Precision and recall basics
Foundation for discussing threshold choice and the cost of errors in risk systems.
Fraud / Risk Scoring ML System is a classic ML case where the cost of a mistake is immediate: a system that is too soft leaks loss, while a system that is too strict hurts conversion and user trust. In interviews, the goal is to show how you connect threshold tuning, realtime scoring, delayed labels, and human review into one working system.
Functional requirements
- Compute a risk score for payments, logins, transfers, and new-device events in near real time.
- Support a decision policy that can approve, challenge, route to manual review, or block based on the risk tier.
- Use features from user history, device graph, geo patterns, velocity checks, and external risk signals.
- Collect delayed labels from chargebacks, confirmed fraud, analyst review, and customer disputes.
Non-functional requirements
- p95 scoring latency below 120 ms on the synchronous critical path.
- The system must survive provider or feature-store degradation through fallback rules and safe default thresholds.
- Full auditability of which features, model, and threshold produced a decision.
- Support frequent recalibration and threshold updates without rebuilding the whole system.
Scale assumptions
Transactions/day
250M+
The event rate requires a cheap scoring path and streaming feature updates.
Peak scoring QPS
75k
Peaks align with marketing campaigns, payroll windows, and holiday traffic.
Label delay
days to weeks
Chargebacks and confirmed-fraud events arrive long after the original decision.
False-positive cost
very high
Over-blocking legitimate activity hurts conversion, trust, and support load.
Reference architecture
It helps to read this system as a stack of layers: from incoming signals and feature state to review operations and the loop that updates the next release.
What to keep under control
It helps to view this system not only as a request path, but as a balance of error cost, live constraints, and how quickly the next tuning cycle can happen.
Error economics
Live constraints
Improvement loop
Below, the chapter separates the synchronous decision path from the write path that carries delayed feedback, labels, and the next round of tuning.
How the system reads and writes fraud signals
Comparing the synchronous decision path with the delayed feedback path
Interactive replay
Step
Synchronous decision path
Active step
1. Event intake and pre-check
The system receives the event, validates the required fields, and decides whether it can enter the critical path.
Key trade-offs
- A lower threshold reduces fraud leakage but increases false positives and friction for legitimate users.
- More realtime features improve quality but make freshness SLAs, debugging, and fallback paths more complex.
- One global model is easier to operate, but segmented models are often more accurate for different markets and products.
- Hard blocking high scores lowers loss risk but raises the cost of wrong decisions and increases support pressure.
Common mistakes
Recommendations
What to explain in an interview
- How would you choose the threshold, and who should own the trade-off between false positives and false negatives?
- How do you design the scoring path when labels arrive weeks after the transaction?
- What happens when online features or an external risk provider become unavailable?
- How do you explain to analysts and support staff why the system made a decision?
Related chapters
- Precision and recall basics - The core language for thresholds, false positives, and false negatives.
- ML Ops Pipeline - End-to-end lifecycle: deployment, monitoring, feedback, and retraining.
- Feature Store & Model Serving - Online feature path, parity, and fallback strategy for realtime scoring.
- Human-in-the-Loop, Data Quality, and the Operational AI Loop - How analyst review and delayed labels become part of the operating loop.
- T-Bank ML platform interview - A platform view on operating ML systems and standardizing how teams build and release them.
