System Design Space
Knowledge graphSettings

Updated: March 24, 2026 at 2:56 PM

Precision and recall at your fingertips

easy

A simple and practical explanation of precision/recall, their trade-off and threshold selection using the example of “Vasya and the Wolf”.

Precision and recall matter not as classroom formulas, but as the language a product uses to negotiate the cost of errors with a model.

The chapter grounds the topic in thresholds, false positives, false negatives, and the kinds of scenarios where metric choice directly changes user experience and operational consequences.

In interviews, it helps you explain the trade-off between error types quickly and clearly, and show that model tuning is always tied to task context rather than to an abstract maximum of quality.

Practical value of this chapter

Design in practice

Translate guidance on precision/recall metrics and ML-system quality evaluation into architecture decisions for data flow, model serving, and quality control points.

Decision quality

Evaluate system quality through both model and platform metrics: precision/recall, latency, drift, cost, and operational risk.

Interview articulation

Frame answers as data -> model -> serving -> monitoring, showing where constraints appear and how you manage them.

Trade-off framing

Make trade-offs explicit for precision/recall metrics and ML-system quality evaluation: experiment speed, quality, explainability, resource budget, and maintenance complexity.

Source

Precision and recall at your fingertips

The original post on which this chapter is based.

Open post

Precision And recall show different aspects of the quality of the classifier. Precision is responsible for quality of positive responses, recall - for completeness of detection. In real systems, there is almost always a trade-off between these metrics.

Formulas in simple language

Precision

Of all the things the model has labeled as positive, how many are actually positive.

Precision = TP / (TP + FP)

Recall (completeness)

Of all the really positive ones, as many as the model could find.

Recall = TP / (TP + FN)

Visualization: Vasya, sheep and wolf

50%

Vasya keeps a balance between false alarms and missed wolves.

TP

17

Vasya shouted and the wolf really was

FP

10

False alarm: "wolf", but it's not a wolf

FN

13

The wolf was there, but Vasya remained silent

TN

60

Silence and there really is no wolf

Metrics at the current threshold

Precision63.0%
Recall56.7%
F1-score59.6%
In the example we use a fixed stream from 100 events where the real wolf comes 30. Change the threshold and watch how TP, FP, FN grow/fall.

Production threshold: step by step

Step 1. Quantify error cost

Translate FP and FN into money, support load, and reputational risks. Without this, threshold tuning is blind.

Step 2. Define an operational target

For example: FN <= 5% with FP <= 15%. This turns metrics into a concrete product contract.

Step 3. Validate by segments

Compare quality across languages, customer types, and time windows so local degradation is not hidden.

Step 4. Revisit threshold regularly

After release, monitor drift and recalculate the operating point on fresh data, not only on train/test.

When precision/recall is not enough

PR-AUC

When: When comparing several models across threshold ranges.

Better captures positive ranking quality under class imbalance.

F-beta score

When: When FN is significantly more expensive than FP.

Lets you explicitly increase recall weight with the beta parameter.

Probability calibration

When: When confidence drives product-level actions.

Reduces sharp quality swings after model and data updates.

If missing positives is more expensive than false alarms, use F-beta with beta > 1 so recall has a stronger impact on threshold selection.

Where is what is more important?

Code review assistant

Precision: High | Recall: Average

False alarms quickly lead to fatigue and disregard for recommendations.

Medical screening

Precision: Average | Recall: Very high

It is critical not to miss real cases of the disease (FN is more expensive than FP).

Antifraud in payments

Precision: High | Recall: High

You need to balance: do not block unnecessary transactions and do not allow fraud.

Practical recommendations

Always fix the FP and FN price for a particular product before choosing a threshold.

Show precision/recall together with the confusion matrix, and not in isolation from each other.

Check metrics separately by segment (clients, data types, languages) so as not to hide degradation.

For review assistants, it is often beneficial to keep the precision higher in order to maintain the trust of users.

Common mistakes

Look only at accuracy if there is a strong class imbalance.

Compare models by precision without recall control (and vice versa).

Fix the threshold once and do not revise it after data changes.

Ignore user reaction to false positives in production.

Related chapters

Enable tracking in Settings