Precision and recall sound like textbook metrics right up to the moment when you must choose between a false alarm and a dangerous miss.
This chapter turns the formulas into operating decisions: where to place the threshold, when to add human review, and how error cost reshapes the workflow.
It is a strong foundation for fraud detection, moderation, medical screening, and any system where the impact of a decision matters more than an abstract score.
Practical value of this chapter
Metric intuition
Explain the difference between a false alarm and a dangerous miss without heavy math.
Threshold language
Connect threshold choice to manual review, support load, and product policy.
Clear interview answer
Explain in a compact way why one metric is rarely enough on its own.
Case foundation
Build the base for fraud, moderation, ranking, and other applied ML cases.
Source
A simple post about precision and recall
The short source post this chapter grew out of.
Precision and recall answer different questions. Precision tells you how much you can trust positive predictions, while recall tells you what share of real positive cases the model actually catches. As soon as you choose a threshold, the discussion turns into a practical trade-off between extra alerts and dangerous misses.
Formulas in plain language
Precision
Of all the cases the model marked as positive, how many were actually positive.
Precision = TP / (TP + FP)
Recall
Of all the real positive cases, how many the model actually managed to find.
Recall = TP / (TP + FN)
When you need one combined reference point, people often look at F1-score, but it still does not answer which type of mistake is more expensive for the product.
Quick glossary
Precision
The share of correct positive predictions among all cases the model marked as positive.
Recall
The share of real positive cases that the model actually managed to find.
F1-score
One combined metric that balances precision and recall through the harmonic mean.
False positive
The model raised a positive signal even though the case was actually negative.
False negative
A positive case existed, but the model failed to catch it.
True positive
The model correctly identified a real positive case.
True negative
The model correctly left a negative case negative.
Interactive example: Vasya, sheep, and the wolf
Imagine Vasya deciding when to shout “wolf”. In this toy dataset there are one hundred observations, but the wolf really appears only ten times. A low threshold helps avoid missed danger, while a high threshold cuts down on unnecessary panic.
Vasya keeps a workable balance between unnecessary panic and a dangerous missed wolf.
The four outcomes below are a true positive, a false positive, a false negative, and a true negative.
TP
6
Vasya raised the alarm, and the wolf really appeared.
FP
11
False alarm: Vasya shouted “wolf”, but there was no wolf.
FN
4
The wolf appeared, but Vasya stayed silent.
TN
79
Vasya stayed quiet, and there really was no danger.
Metrics at the current threshold
Threshold in a working system: step by step
Step 1. Price both error types
Translate false alarms and misses into money, support load, manual review cost, and reputational risk.
Step 2. Write down an operating target
For example: missed positives below 5% while false alarms stay below 15%. That turns metrics into a product requirement.
Step 3. Check quality by segment
Compare metrics across languages, customer groups, countries, and time windows so local degradation does not hide behind the average.
Step 4. Revisit the threshold on fresh data
An operating threshold rarely stays valid forever: after release you need to recalculate it on new data, not only on train/test splits.
After release, watch for drift: a shift in data or user behavior can make yesterday’s threshold misleading very quickly.
When two metrics are no longer enough
If one operating point is not enough, look at the whole curve. ROC AUC shows how well a model separates positive and negative cases across all thresholds; ROC stands for Receiver Operating Characteristic. PR AUC is especially useful when the positive class is rare, because it exposes the cost of false alarms against ninety calm observations.
ROC curve
The X axis shows the false-positive rate, and the Y axis shows how many wolves the model still catches.
Current threshold 50%: FPR 12.2%, TPR 60.0%.
ROC AUC is useful as a separability metric, but with a rare wolf it can look more optimistic than the product experience.
PR curve
The X axis shows recall, and the Y axis shows how many raised alarms really lead to a wolf.
Current threshold 50%: Recall 60.0%, Precision 35.3%.
The baseline precision here is only 10.0%: the wolf appears in just 10 out of 100 cases. That is why PR AUC better shows whether the model adds real value beyond random guessing.
ROC AUC
When: When you want to know whether the model separates the wolf from harmless noise across the full threshold range.
It gives a compact view of separability even if you choose the operating threshold later.
PR AUC
When: When the positive class is rare, like here: the wolf appears only 10 times out of 100.
It penalizes extra alerts more clearly and better reflects practical value under imbalance.
F-beta score
When: When missing a positive case is more expensive than triggering an extra alert.
It lets you intentionally increase recall weight through the beta parameter.
Probability calibration
When: When model confidence directly drives product actions.
It makes probabilities closer to reality and keeps threshold decisions more stable.
The orange point on both charts is the current threshold from the slider. ROC AUC and PR AUC themselves do not move with that point: they summarize model behavior across the entire threshold range, not one chosen operating point.
If missing a positive case is more expensive than a false alarm, choose beta > 1 so recall has more influence on the final score. If product actions depend on raw probabilities, also watch probability calibration.
Where each metric matters more
Code review assistant
Precision: High | Recall: Medium
If the system raises too many wrong alerts, the team quickly stops trusting it.
Medical screening
Precision: Medium | Recall: Very high
Missing a real case is usually far more expensive than sending someone to an extra check.
Payment fraud detection
Precision: High | Recall: High
You need to reduce both unnecessary blocks and missed fraudulent transactions.
Practical recommendations
Estimate the cost of a false alarm and the cost of a miss before choosing the operating threshold.
Look at precision and recall together with the confusion matrix instead of treating them as isolated numbers.
Check quality by segment so an average score does not hide local degradation.
With a rare positive class, add PR AUC next to the operating threshold so a nice ROC curve does not make you overestimate real quality.
Common mistakes
Assume that accuracy alone is enough when classes are heavily imbalanced.
Compare models by one metric without tying it back to error cost and the confusion matrix.
Pick a threshold once and never revisit it after changes in data, traffic, or product behavior.
Read ROC AUC in isolation and skip PR AUC in settings where the positive class is rare and false alarms are expensive.
Related chapters
- AI Engineering (short summary) - shows how model evaluation connects to product value and validation in real workflows.
- AI Engineering Interviews (short summary) - collects common interview questions about metrics, error cost, and quality trade-offs.
- Machine Learning System Design (short summary) - extends the discussion to system-level concerns: metrics, model errors, data, and ML lifecycle decisions.
- ML platform in T-Bank - shows how a platform team helps product teams work with model quality at company scale.
