Precision and recall matter not as classroom formulas, but as the language a product uses to negotiate the cost of errors with a model.
The chapter grounds the topic in thresholds, false positives, false negatives, and the kinds of scenarios where metric choice directly changes user experience and operational consequences.
In interviews, it helps you explain the trade-off between error types quickly and clearly, and show that model tuning is always tied to task context rather than to an abstract maximum of quality.
Practical value of this chapter
Design in practice
Translate guidance on precision/recall metrics and ML-system quality evaluation into architecture decisions for data flow, model serving, and quality control points.
Decision quality
Evaluate system quality through both model and platform metrics: precision/recall, latency, drift, cost, and operational risk.
Interview articulation
Frame answers as data -> model -> serving -> monitoring, showing where constraints appear and how you manage them.
Trade-off framing
Make trade-offs explicit for precision/recall metrics and ML-system quality evaluation: experiment speed, quality, explainability, resource budget, and maintenance complexity.
Source
Precision and recall at your fingertips
The original post on which this chapter is based.
Precision And recall show different aspects of the quality of the classifier. Precision is responsible for quality of positive responses, recall - for completeness of detection. In real systems, there is almost always a trade-off between these metrics.
Formulas in simple language
Precision
Of all the things the model has labeled as positive, how many are actually positive.
Precision = TP / (TP + FP)
Recall (completeness)
Of all the really positive ones, as many as the model could find.
Recall = TP / (TP + FN)
Visualization: Vasya, sheep and wolf
Vasya keeps a balance between false alarms and missed wolves.
TP
17
Vasya shouted and the wolf really was
FP
10
False alarm: "wolf", but it's not a wolf
FN
13
The wolf was there, but Vasya remained silent
TN
60
Silence and there really is no wolf
Metrics at the current threshold
Production threshold: step by step
Step 1. Quantify error cost
Translate FP and FN into money, support load, and reputational risks. Without this, threshold tuning is blind.
Step 2. Define an operational target
For example: FN <= 5% with FP <= 15%. This turns metrics into a concrete product contract.
Step 3. Validate by segments
Compare quality across languages, customer types, and time windows so local degradation is not hidden.
Step 4. Revisit threshold regularly
After release, monitor drift and recalculate the operating point on fresh data, not only on train/test.
When precision/recall is not enough
PR-AUC
When: When comparing several models across threshold ranges.
Better captures positive ranking quality under class imbalance.
F-beta score
When: When FN is significantly more expensive than FP.
Lets you explicitly increase recall weight with the beta parameter.
Probability calibration
When: When confidence drives product-level actions.
Reduces sharp quality swings after model and data updates.
If missing positives is more expensive than false alarms, use F-beta with beta > 1 so recall has a stronger impact on threshold selection.
Where is what is more important?
Code review assistant
Precision: High | Recall: Average
False alarms quickly lead to fatigue and disregard for recommendations.
Medical screening
Precision: Average | Recall: Very high
It is critical not to miss real cases of the disease (FN is more expensive than FP).
Antifraud in payments
Precision: High | Recall: High
You need to balance: do not block unnecessary transactions and do not allow fraud.
Practical recommendations
Always fix the FP and FN price for a particular product before choosing a threshold.
Show precision/recall together with the confusion matrix, and not in isolation from each other.
Check metrics separately by segment (clients, data types, languages) so as not to hide degradation.
For review assistants, it is often beneficial to keep the precision higher in order to maintain the trust of users.
Common mistakes
Look only at accuracy if there is a strong class imbalance.
Compare models by precision without recall control (and vice versa).
Fix the threshold once and do not revise it after data changes.
Ignore user reaction to false positives in production.
Related chapters
- AI Engineering (short summary) - Practice of evaluation and product-level validation of AI systems.
- AI Engineering Interviews (short summary) - Frequently asked questions and scenarios around the quality of ML/AI solutions.
- Machine Learning System Design (short summary) - Deep dive into metrics, model errors, and ML lifecycle decisions in production.
- ML platform in T-Bank - How quality metrics are used in platform practice.
- SRE Evolution: AI Assistant Rollout at T-Bank - A production case where precision/recall shape operational reliability.
