System Design Space
Knowledge graphSettings

Updated: March 3, 2026 at 10:30 PM

SLI / SLO / SLA and Error Budgets

mid

Practical walkthrough of SLI/SLO/SLA: why they matter, how to read burn rate, and how to calculate budget with interactive calculators.

Source

Google SRE Workbook

A practical guide to defining SLI/SLO and operating with error budgets.

Перейти на сайт

SLI / SLO / SLA are the shared language between business expectations and engineering decisions. In this chapter we break down how to formalize reliability and why error budget directly drives release pace, prioritization, and cost. For wider SRE context, start with the section introduction.

What SLI, SLO, and SLA mean

SLI

Service Level Indicator

A measurable service-quality signal: availability, latency, error rate, or freshness.

SLO

Service Level Objective

A target value for SLI over a period. Example: 99.9% successful requests over 30 days.

SLA

Service Level Agreement

An external contract with consequences (credits, penalties, support commitments).

Why this matters

One language for product and engineering

SLO turns "the service should be stable" into measurable decision criteria.

Release risk control

Error budget provides a formal gate: accelerate feature delivery or prioritize stability work.

Clear prioritization

You can justify reliability investment with numbers instead of intuition.

Predictable customer expectations

SLA sets external commitments, while SLO helps engineering stay inside those bounds.

Calculator 1: allowed downtime from SLO

Error budget = 0.100%

Allowed downtime

43 min

In seconds

2,592

Errors per 1M requests

1,000

Formula: budget = (1 - SLO) * period. For example, with a 99.9% SLO over 30 days, you get about 43 minutes of downtime budget.

Calculator 2: burn rate and remaining budget

Observed error rate

0.0240%

Burn rate

0.24x

Spent in window

0.03%

Remaining budget

84.97%

At the current pace, budget exhaustion is expected in 106 d 5 h 0 min.

Budget is burning slowly: you have room for safe releases.

How to use this in daily operations

  1. Select 1-3 critical user journeys and define SLI for each path.
  2. Agree on SLO with product priorities and failure cost in mind.
  3. Define release policy for burn-rate tiers: < 1, 1-2, and > 2.
  4. Connect alerting and incident response to budget consumption, not only infra-level metrics.

Common anti-patterns

Measuring SLI only at the infrastructure layer (CPU/RAM), not on real user journeys.

Setting SLO 99.999% without linking it to business expectations, architecture limits, and cost.

Using SLA as an internal engineering metric instead of an external contractual commitment.

Ignoring burn rate and only checking monthly totals when the budget is already gone.

Recommendations

Define 1-3 critical user journeys and build SLI around those paths.

Tie releases to an error budget policy: budget available -> ship faster, budget exhausted -> stabilization mode.

Use both fast and slow burn-rate alerts to catch spikes and sustained degradation.

Separate internal SLO from external SLA so customer expectations stay realistic.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov