System Design Space
Knowledge graphSettings

Updated: April 4, 2026 at 8:17 PM

Precision and recall at your fingertips

easy

A simple explanation of precision, recall, threshold choice, ROC AUC, and PR AUC built around the story of Vasya and the wolf.

Precision and recall sound like textbook metrics right up to the moment when you must choose between a false alarm and a dangerous miss.

This chapter turns the formulas into operating decisions: where to place the threshold, when to add human review, and how error cost reshapes the workflow.

It is a strong foundation for fraud detection, moderation, medical screening, and any system where the impact of a decision matters more than an abstract score.

Practical value of this chapter

Metric intuition

Explain the difference between a false alarm and a dangerous miss without heavy math.

Threshold language

Connect threshold choice to manual review, support load, and product policy.

Clear interview answer

Explain in a compact way why one metric is rarely enough on its own.

Case foundation

Build the base for fraud, moderation, ranking, and other applied ML cases.

Source

A simple post about precision and recall

The short source post this chapter grew out of.

Open post

Precision and recall answer different questions. Precision tells you how much you can trust positive predictions, while recall tells you what share of real positive cases the model actually catches. As soon as you choose a threshold, the discussion turns into a practical trade-off between extra alerts and dangerous misses.

Formulas in plain language

Precision

Of all the cases the model marked as positive, how many were actually positive.

Precision = TP / (TP + FP)

Recall

Of all the real positive cases, how many the model actually managed to find.

Recall = TP / (TP + FN)

When you need one combined reference point, people often look at F1-score, but it still does not answer which type of mistake is more expensive for the product.

Quick glossary

Precision

The share of correct positive predictions among all cases the model marked as positive.

Recall

The share of real positive cases that the model actually managed to find.

F1-score

One combined metric that balances precision and recall through the harmonic mean.

FP

False positive

The model raised a positive signal even though the case was actually negative.

FN

False negative

A positive case existed, but the model failed to catch it.

TP

True positive

The model correctly identified a real positive case.

TN

True negative

The model correctly left a negative case negative.

Interactive example: Vasya, sheep, and the wolf

Imagine Vasya deciding when to shout “wolf”. In this toy dataset there are one hundred observations, but the wolf really appears only ten times. A low threshold helps avoid missed danger, while a high threshold cuts down on unnecessary panic.

50%

Vasya keeps a workable balance between unnecessary panic and a dangerous missed wolf.

The four outcomes below are a true positive, a false positive, a false negative, and a true negative.

TP

6

Vasya raised the alarm, and the wolf really appeared.

FP

11

False alarm: Vasya shouted “wolf”, but there was no wolf.

FN

4

The wolf appeared, but Vasya stayed silent.

TN

79

Vasya stayed quiet, and there really was no danger.

Metrics at the current threshold

Precision35.3%
Recall60.0%
F1-score44.4%
This example uses 100 observations, and only 10 of them contain a real wolf. Move the threshold and watch how false alarms, misses, and the final metrics change under a rare positive class.

Threshold in a working system: step by step

Step 1. Price both error types

Translate false alarms and misses into money, support load, manual review cost, and reputational risk.

Step 2. Write down an operating target

For example: missed positives below 5% while false alarms stay below 15%. That turns metrics into a product requirement.

Step 3. Check quality by segment

Compare metrics across languages, customer groups, countries, and time windows so local degradation does not hide behind the average.

Step 4. Revisit the threshold on fresh data

An operating threshold rarely stays valid forever: after release you need to recalculate it on new data, not only on train/test splits.

After release, watch for drift: a shift in data or user behavior can make yesterday’s threshold misleading very quickly.

When two metrics are no longer enough

If one operating point is not enough, look at the whole curve. ROC AUC shows how well a model separates positive and negative cases across all thresholds; ROC stands for Receiver Operating Characteristic. PR AUC is especially useful when the positive class is rare, because it exposes the cost of false alarms against ninety calm observations.

ROC curve

The X axis shows the false-positive rate, and the Y axis shows how many wolves the model still catches.

AUC 0.80
00252550507575100100TPR / RecallFPR

Current threshold 50%: FPR 12.2%, TPR 60.0%.

ROC AUC is useful as a separability metric, but with a rare wolf it can look more optimistic than the product experience.

PR curve

The X axis shows recall, and the Y axis shows how many raised alarms really lead to a wolf.

AUC 0.54
00252550507575100100Baseline: 10%PrecisionRecall

Current threshold 50%: Recall 60.0%, Precision 35.3%.

The baseline precision here is only 10.0%: the wolf appears in just 10 out of 100 cases. That is why PR AUC better shows whether the model adds real value beyond random guessing.

ROC AUC

When: When you want to know whether the model separates the wolf from harmless noise across the full threshold range.

It gives a compact view of separability even if you choose the operating threshold later.

PR AUC

When: When the positive class is rare, like here: the wolf appears only 10 times out of 100.

It penalizes extra alerts more clearly and better reflects practical value under imbalance.

F-beta score

When: When missing a positive case is more expensive than triggering an extra alert.

It lets you intentionally increase recall weight through the beta parameter.

Probability calibration

When: When model confidence directly drives product actions.

It makes probabilities closer to reality and keeps threshold decisions more stable.

The orange point on both charts is the current threshold from the slider. ROC AUC and PR AUC themselves do not move with that point: they summarize model behavior across the entire threshold range, not one chosen operating point.

If missing a positive case is more expensive than a false alarm, choose beta > 1 so recall has more influence on the final score. If product actions depend on raw probabilities, also watch probability calibration.

Where each metric matters more

Code review assistant

Precision: High | Recall: Medium

If the system raises too many wrong alerts, the team quickly stops trusting it.

Medical screening

Precision: Medium | Recall: Very high

Missing a real case is usually far more expensive than sending someone to an extra check.

Payment fraud detection

Precision: High | Recall: High

You need to reduce both unnecessary blocks and missed fraudulent transactions.

Practical recommendations

Estimate the cost of a false alarm and the cost of a miss before choosing the operating threshold.

Look at precision and recall together with the confusion matrix instead of treating them as isolated numbers.

Check quality by segment so an average score does not hide local degradation.

With a rare positive class, add PR AUC next to the operating threshold so a nice ROC curve does not make you overestimate real quality.

Common mistakes

Assume that accuracy alone is enough when classes are heavily imbalanced.

Compare models by one metric without tying it back to error cost and the confusion matrix.

Pick a threshold once and never revisit it after changes in data, traffic, or product behavior.

Read ROC AUC in isolation and skip PR AUC in settings where the positive class is rare and false alarms are expensive.

Related chapters

Enable tracking in Settings