Error-Budget Alerting: Multi-Window, Multi-Burn-Rate

A classic fixed-threshold alert is doomed to a compromise: a sensitive threshold is noisy on every spike, a calm one stays silent during a slow degradation that eats the whole error budget over a week. Either way the link between the alert and real user harm is lost.

Burn rate changes the frame of reference: the alert is tied not to a raw error ratio but to how fast the budget drains relative to your SLO. The threshold 'burn rate > 14.4x' means the same thing — the budget will run out too fast — at a 99% target and at a 99.99% target alike.

The multi-window, multi-burn-rate scheme from the SRE Workbook requires the threshold to be exceeded on the long AND the short window at once: the long one carries meaning, the short one carries freshness and a fast reset. This chapter covers the threshold pairs, the page-versus-ticket split, an implementation with Prometheus recording and alerting rules, and the common mistakes.

Practical value of this chapter

Design in practice

Turn guidance on Burn rate as the signal: multi-window, multi-burn-rate alerts into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for Burn rate as the signal: multi-window, multi-burn-rate alerts: release speed, automation level, observability cost, and operational complexity.

Source

Google SRE Workbook

The “Alerting on SLOs” chapter: the path from a naive threshold to a working multi-window, multi-burn-rate scheme.

Перейти на сайт

The neighbouring chapter SLI / SLO / SLA and Error Budgets introduces service level indicators and objectives, plus the error budget. This chapter is about how to alert on top of them: how burn rate ties an alert to real user harm and why a single threshold cannot do the job. We walk from the naive burn-rate alert to the multi-window, multi-burn-rate scheme from SLO-based alerting, the page-versus-ticket split, a Prometheus implementation, and the common mistakes.

1. What is wrong with classic thresholds

Classic monitoring alerts on a fixed threshold over a raw metric, and the conflict is baked in: one threshold cannot be both sensitive to sharp spikes and calm about small background noise. Set it low and on-call gets paged for noise; set it high and a real degradation slips by unnoticed. Whichever way you move it, you lose one side.

The threshold is noisy

An alert like “error rate > 5% over 5 minutes” fires on any short spike, even if the monthly error budget is barely touched. On-call gets paged for an event that already self-healed.

The threshold is late

Make it more conservative (“> 5% over an hour”) and it stays silent during a slow 0.2% degradation that quietly eats the whole error budget over a week.

It loses the link to user harm

A fixed threshold knows nothing about your SLO: 5% errors at a 99% target and at a 99.99% target represent very different user harm, yet the threshold is identical.

Alert fatigue

Noisy thresholds breed alert fatigue: the team learns to wave signals away, and against that backdrop of false positives a real incident gets noticed too late.

2. Error budget and burn rate

The error budget is how many errors are allowed over a period: budget = 1 - SLO. At a 99.9% target the budget is 0.1% of requests over 30 days. The burn rate shows how many times faster than nominal that budget is being consumed:

burn rate = (observed error ratio) / (error budget)

1x — the budget is spent at exactly the allowed pace and runs out right at the end of the period (for a 30-day window, in 30 days).
14.4x is defined as the pace that burns 2% of the monthly budget in one hour; from there the whole 30-day budget is gone in 1h / 2% = 50 hours, which against the baseline 720 hours per month gives 720 / 50 = 14.4x.

The key property of burn rate is that it is normalized to your SLO. The threshold “burn rate > 14.4x” means the same thing — “the budget will run out too fast” — whether your target is 99% or 99.99%.

Burn-rate calculator

Service level objective (%)Observed error ratio over the window (%)

Error budget

0.100%

Burn rate

10.0x

Budget runs out in

3.0 days

At this pace the tier that would fire is: Medium page (6x / 6 hours). The threshold comparison below is for a 99.9% target over 30 days. This is a simplification: in the real scheme a page only fires when the threshold is exceeded on the long AND the short window at once — the calculator just shows which speed tier the current value lands in.

3. The simple burn-rate alert and its problems

The first step is already better than a threshold on a raw metric: alert on “burn rate over a window > X”. But the simple single-window variant keeps two defects that the SRE Workbook spells out.

Sensitivity versus noise

A short window (5 minutes) reacts fast but is noisy: any brief blip drives the burn rate up. A long window (several hours) is calmer but slow to start — the reaction arrives with a large delay, and the first minutes of the incident are wasted.

Long reset time

A long window “remembers” the past: even after the problem is fixed, the averaged error ratio stays above the threshold for a while and the alert keeps firing. On-call sees a signal about an already-closed incident.

4. Multi-window, multi-burn-rate

The fix from the SRE Workbook: for each tier, require the threshold to be exceeded on the long AND the short window simultaneously. The long window carries the meaning (“the budget really is burning at this rate”), the short window carries freshness (“the burn is happening right now, not an hour ago”). The short window is made roughly 12 times shorter than the long one.

The short window clears the alert as soon as the spike ends (the right edge of the red zone), even while the long window still “remembers” the incident. This sharply cuts the reset time.

Recommended threshold pairs (99.9% SLO, 30 days)

Burn rate	Long window	Short window	Budget per window	Severity
14.4x	1 hour	5 minutes	2%	page
6x	6 hours	30 minutes	5%	page
1x	3 days	6 hours	10%	ticket

Parameter source: Google SRE Workbook, “Alerting on SLOs”. “Budget per window” is the share of the 30-day budget the burn will consume by the time the alert fires.

5. Severity: page or ticket

The burn rate directly sets the severity and the response you need: the faster the budget drains, the less time is left before it runs out — and the more urgent the reaction. That gives a natural split into two classes.

Fast burn → page

14.4x and 6x eat a noticeable share of the monthly budget (2% and 5%) within hours. This is an actionable signal: wake on-call immediately.

Slow burn → ticket

1x over 3 days is a gentle degradation. The budget is draining, but not catastrophically fast: a ticket to investigate during business hours is enough, with no night-time call.

A convenient second way to pick the threshold is via budget share: “page if this incident, at the current pace, will eat > 2% of the monthly budget.” It is the same table, expressed as harm rather than rate.

6. Implementation in Prometheus

Precompute the error ratio for each window with recording rules — it is both faster and more readable. Then the alerting rule compares the ready-made metric against the threshold using the long-AND-short logic.

Recording rules: error ratio per window

# Recording rule: ratio of "bad" events over a window as a single metric.
groups:
  - name: slo:request_error_ratio
    rules:
      - record: job:slo_errors_per_request:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{code=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
      - record: job:slo_errors_per_request:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{code=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))

Alerting rule: multi-window page

# Fast page: 14.4x over 1 hour AND over 5 minutes at the same time.
# 0.001 = error budget for a 99.9% SLO; 14.4 * 0.001 = 0.0144.
groups:
  - name: slo:burn_rate_alerts
    rules:
      - alert: ErrorBudgetBurnFast
        expr: |
          job:slo_errors_per_request:ratio_rate1h  > (14.4 * 0.001)
          and
          job:slo_errors_per_request:ratio_rate5m  > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Error budget burning at 14.4x"

The numbers in the expression: at a 99.9% SLO the error budget is 0.001, so the 14.4x threshold is 14.4 * 0.001 = 0.0144. The full scheme with all three tiers follows the same model, swapping the window pairs and multipliers from the table above.

7. Trade-offs: the axes for choosing windows

The SRE Workbook suggests evaluating any alerting scheme along four axes. Choosing window lengths and thresholds is always a balance between them.

Precision

What share of fired alerts reflect a real problem. The second, short window exists exactly for precision: it filters out alerts on a spike that has already ended.

Recall

What share of real burns we catch at all. The slow window improves recall: it notices gentle degradations that a fast threshold would miss.

Detection time

How quickly the alert fires after a problem starts. A higher burn-rate threshold and a shorter window react faster — but get noisier.

Reset time

How long the alert keeps firing after the problem is fixed. A long window cools down slowly on its own; a short window clears the alert almost immediately.

8. Common mistakes

A single window instead of a pair. Long-only means slow reset and late reaction; short-only means noise on spikes. The long-AND-short pair solves both at once.

Alerting on causes instead of symptoms. Paging on “CPU > 90%” or “queue is growing” catches things that may not hurt the user at all. Alert on SLIs — that is, on symptoms.

Paging on everything. A slow burn (1x) is a ticket to investigate during business hours, not a reason to wake someone at night.

Hard thresholds on raw error rate with no link to the SLO. The same “5% over 5 minutes” is meaningless to copy between services with different targets.

Recommendations

Run 2-3 tiers: fast (14.4x / 1 hour) and medium (6x / 6 hours) as pages, slow (1x / 3 days) as a ticket.

Make the short window about 1/12 of the long one, as the SRE Workbook advises. It acts as the AND condition and is responsible for filtering false positives on faded spikes.

Compute the error ratio with recording rules, then have the alerting rule compare the ready-made metric against a threshold. Expressions stay readable and fast.

Alert on the symptoms of user journeys, not on infrastructure causes. Causes are useful for dashboards and analysis, but not for a page.

References

Source map: the SRE Workbook supports the multiwindow, multi-burn-rate parameters and page/ticket criteria; the SRE Book anchors SLO/error-budget basics; Prometheus docs support recording and alerting rules. The 14.4x/6x/1x numbers are tied to the workbook's 99.9%-over-30-days model; for another SLO, budget window, or on-call policy, recompute them.

Related chapters

SLI / SLO / SLA and Error Budgets - The neighbouring chapter: it introduces service level indicators and objectives, the error budget, and baseline burn rate — the foundation everything in this chapter builds on.
Observability & Monitoring Design - Designing the metrics, logs, and tracing that trustworthy SLI signals and burn-rate alerts are built on.
Prometheus Architecture - How metric collection, recording and alerting rules, and Alertmanager fit together — the engine that actually evaluates multi-window burn-rate alerts.
Incident Management Discipline - What happens after a page fires: routing, roles, review. It closes the loop from budget-based alert to response to learning.