Precision and recall at your fingertips

Precision and recall sound like textbook metrics right up to the moment when you must choose between a false alarm and a dangerous miss.

This chapter turns the formulas into operating decisions: where to place the threshold, when to add human review, and how error cost reshapes the workflow.

It is a strong foundation for fraud detection, moderation, medical screening, and any system where the impact of a decision matters more than an abstract score.

Practical value of this chapter

Metric intuition

Explain the difference between a false alarm and a dangerous miss without heavy math.

Threshold language

Connect threshold choice to manual review, support load, and product policy.

Clear interview answer

Explain in a compact way why one metric is rarely enough on its own.

Case foundation

Build the base for fraud, moderation, ranking, and other applied ML cases.

Source

A simple post about precision and recall

The short source post this chapter grew out of.

Open post

Precision and recall answer different questions, and confusing them is expensive. Precision tells you how much you can trust positive predictions; recall tells you what share of real positive cases the model catches at all. Until you pick a threshold, both numbers stay abstract. Fix one, and the discussion immediately turns into a trade-off: every step toward fewer misses is paid for with more false alarms.

Formulas in plain language

Precision

Of all the cases the model marked as positive, how many were actually positive.

Precision = TP / (TP + FP)

Recall

Of all the real positive cases, how many the model actually managed to find.

Recall = TP / (TP + FN)

When you need one combined reference point, people often look at F1-score. Convenient, but risky: a single number hides the question of which type of mistake is more expensive for the product — and that is exactly where you should start.

Quick glossary

Precision

The share of correct positive predictions among all cases the model marked as positive.

Recall

The share of real positive cases that the model actually managed to find.

F1-score

One combined metric that balances precision and recall through the harmonic mean.

False positive

The model raised a positive signal even though the case was actually negative.

False negative

A positive case existed, but the model failed to catch it.

True positive

The model correctly identified a real positive case.

True negative

The model correctly left a negative case negative.

Interactive example: Vasya, sheep, and the wolf

Vasya decides when to shout “wolf”. In this toy dataset there are one hundred observations, but the wolf really appears only ten times — the positive class is rare. A low threshold misses danger less often at the price of a stream of false alarms; a high threshold mutes the alarms but lets the wolf through more and more. There is no free middle here — you have to choose it.

Operating confidence threshold50%

Vasya keeps a workable balance between unnecessary panic and a dangerous missed wolf.

The four outcomes below are a true positive, a false positive, a false negative, and a true negative.

Vasya raised the alarm, and the wolf really appeared.

False alarm: Vasya shouted “wolf”, but there was no wolf.

The wolf appeared, but Vasya stayed silent.

Vasya stayed quiet, and there really was no danger.

Metrics at the current threshold

Precision35.3%

Recall60.0%

F1-score44.4%

This example uses 100 observations, and only 10 of them contain a real wolf. Move the threshold and watch how false alarms, misses, and the final metrics change under a rare positive class.

Threshold in a working system: step by step

Step 1. Price both error types

Translate false alarms and misses into money, support load, manual review cost, and reputational risk.

Step 2. Write down an operating target

For example: missed positives below 5% while false alarms stay below 15%. That turns metrics into a product requirement.

Step 3. Check quality by segment

Compare metrics across languages, customer groups, countries, and time windows so local degradation does not hide behind the average.

Step 4. Revisit the threshold on fresh data

An operating threshold rarely stays valid forever: after release you need to recalculate it on new data, not only on train/test splits.

After release, watch for drift: a shift in data or user behavior can make yesterday’s threshold misleading very quickly.

When two metrics are no longer enough

If one operating point is not enough, look at the whole curve. ROC AUC shows how well a model separates positive and negative cases across all thresholds; ROC stands for Receiver Operating Characteristic. PR AUC is especially useful when the positive class is rare, because it exposes the cost of false alarms against ninety calm observations.

ROC curve

The X axis shows the false-positive rate, and the Y axis shows how many wolves the model still catches.

AUC 0.80

Current threshold 50%: FPR 12.2%, TPR 60.0%.

ROC AUC is useful as a separability metric, but with a rare wolf it can look more optimistic than the product experience.

PR curve

The X axis shows recall, and the Y axis shows how many raised alarms really lead to a wolf.

AUC 0.54

Current threshold 50%: Recall 60.0%, Precision 35.3%.

The baseline precision here is only 10.0%: the wolf appears in just 10 out of 100 cases. That is why PR AUC better shows whether the model adds real value beyond random guessing.

ROC AUC

When: you want to know whether the model separates the wolf from harmless noise across the full threshold range, before picking an operating point.

It gives a compact view of separability even if you choose the operating threshold later.

PR AUC

When: the positive class is rare, like here: the wolf appears only 10 times out of 100.

It penalizes extra alerts more clearly and better reflects practical value under imbalance.

F-beta score

When: missing a positive case costs more than an extra alert, and you want that baked straight into the metric.

It lets you intentionally increase recall weight through the beta parameter.

Probability calibration

When: the product action depends not on the fact of a trigger but on the probability value itself.

It makes probabilities closer to reality and keeps threshold decisions more stable.

The orange point on both charts is the current threshold from the slider. ROC AUC and PR AUC themselves do not move with that point: they summarize model behavior across the entire threshold range, not one chosen operating point.

When missing a positive case costs more than a false alarm, take beta > 1 — the metric itself then values recall more highly. And where raw probabilities drive product actions, watch probability calibration separately: without it the threshold drifts as the probabilities shift.

Where each metric matters more

Code review assistant

Precision: High | Recall: Medium

If the system raises too many wrong alerts, the team quickly stops trusting it.

Medical screening

Precision: Medium | Recall: Very high

Missing a real case is usually far more expensive than sending someone to an extra check.

Payment fraud detection

Precision: High | Recall: High

You need to reduce both unnecessary blocks and missed fraudulent transactions.

Practical recommendations

Estimate the cost of a false alarm and the cost of a miss before choosing the operating threshold.

Look at precision and recall together with the confusion matrix instead of treating them as isolated numbers.

Check quality by segment so an average score does not hide local degradation.

With a rare positive class, add PR AUC next to the operating threshold so a nice ROC curve does not make you overestimate real quality.

Common mistakes

Assume that accuracy alone is enough when classes are heavily imbalanced.

Compare models by one metric without tying it back to error cost and the confusion matrix.

Pick a threshold once and never revisit it after changes in data, traffic, or product behavior.

Read ROC AUC in isolation and skip PR AUC in settings where the positive class is rare and false alarms are expensive.

Related chapters

AI Engineering (short summary) - shows how model evaluation connects to product value and validation in real workflows.
AI Engineering Interviews (short summary) - collects common interview questions about metrics, error cost, and quality trade-offs.
Machine Learning System Design (short summary) - extends the discussion to system-level concerns: metrics, model errors, data, and ML lifecycle decisions.
ML platform in T-Bank - lifts the quality question to the platform level: how product teams are helped to keep their metrics under control at company scale.