SLI / SLO / SLA and Error Budgets — System Design Space

SLIs, SLOs, SLAs, and error budgets matter because they turn reliability from opinion into an explicit agreement about risk.

The chapter shows how indicators, service targets, and error budgets connect product expectations with operations: teams use them to read burn rate, pause releases, and decide when fixing the system matters more than shipping the next feature.

In interviews, this material is especially useful because it lets you discuss measurement discipline, acceptable risk, and release policy instead of falling back to the vague idea that a system should simply be stable.

Practical value of this chapter

Design in practice

Turn reliability goals into measurable indicators, service targets, error budgets, and alerting rules.

Decision quality

Evaluate architecture through user journeys, burn rate, and failure cost rather than average availability alone.

Interview articulation

Show when a team can keep shipping changes and when it should switch to stabilization mode.

Trade-off framing

Make the trade-off explicit between release speed, customer expectations, reliability cost, and external commitments.

Source

Google SRE Workbook

A practical guide to defining SLI/SLO and operating with error budgets.

Перейти на сайт

“The service should be reliable” means different things to product and to engineering until reliability is written as a number. SLI / SLO / SLA are that shared language: they turn a vague expectation into measurable rules you can argue and decide by. This chapter explains service level indicators, service level objectives, service level agreements, error budgets, and burn rate. Together they draw the line between “keep shipping” and “protect reliability first.” For wider SRE context, start with the section introduction.

How SLI, SLO, and SLA differ

SLI

Service Level Indicator

What you measure

A measurable service-quality signal on a user path: availability, latency, error rate, or freshness.

SLO

Service Level Objective

What you target

A target value for an SLI over a period. Example: 99.9% successful requests over 30 days.

SLA

Service Level Agreement

What you promise externally

An external commitment with a price for breaching it: credits, penalties, or support obligations. That is why an SLA is set deliberately looser than the internal objective.

Why this matters

One language for product and engineering

An SLO turns “the service should be stable” into measurable decision criteria.

Release risk control

An error budget creates a formal gate: keep shipping safely or switch to stabilization mode.

Clear prioritization

When the budget runs low, the “reliability vs. new feature” debate is settled by a number rather than the loudest voice: the team shows what shipping on keeps costing.

Predictable customer expectations

An SLA sets external commitments, while SLOs help engineering stay inside those bounds.

Calculator 1: allowed downtime

Target SLO (%)

Calculation period

Error budget = 0.100%

Allowed downtime

43 min

In seconds

2,592

Errors per 1M requests

1,000

Formula: budget = (1 - SLO) * period. For example, with a 99.9% SLO over 30 days, the service has about 43 minutes of downtime budget.

Calculator 2: budget burn rate

Observation window (min)Requests in windowErrors in windowBudget already spent (%)

Observed error rate

0.0240%

Burn rate

0.24x

Spent in window

0.03%

Remaining budget

84.97%

At the current pace, budget exhaustion is expected in 106 d 5 h 0 min.

Budget is burning slowly: you still have room for safe releases.

How to use this in daily operations

Select 1-3 critical user journeys and define SLIs for those paths.
Agree on SLOs with product priorities and failure cost in mind.
Define release policy for burn-rate tiers: below 1x, between 1x and 2x, and above 2x.
Connect SLO-based alerting and incident response to error-budget consumption, not only infrastructure metrics.

Common anti-patterns

Measuring SLIs only by CPU/RAM load: the graphs stay green while users hit errors and timeouts on their path.

Setting a 99.999% SLO without linking it to business expectations, architecture limits, and cost.

Using an SLA as an internal engineering metric instead of an external contractual commitment.

Watching only monthly totals and ignoring burn rate, so you learn about the problem once the budget is already gone and it is too late to react.

Recommendations

Define 1-3 critical user journeys and build SLIs around those paths.

Tie releases to an error-budget policy: if budget is available, accept risk; if it is exhausted, stabilize first.

Use fast and slow SLO-based alerts to catch both sharp spikes and slow degradation.

Separate internal SLOs from external SLAs so customer expectations stay realistic.

References

Related chapters

Site Reliability Engineering (short summary) - Provides the core SRE model around service objectives, error budgets, toil reduction, and the speed-versus-reliability balance.
The Site Reliability Workbook (short summary) - Adds practical rollout guidance for SLO adoption in production: alerting rules, burn rate, and team operating rhythm.
Observability & Monitoring Design - SLIs are only as good as the data beneath them; this covers the metrics, logs, and tracing design those signals stand on.
Performance Engineering - Deepens latency-focused SLI work with profiling, capacity planning, and performance budgets.
Release It! (short summary) - Extends reliability policy discussions with resilience patterns and safer release strategies.