SLIs, SLOs, SLAs, and error budgets matter because they turn reliability from opinion into an explicit agreement about risk.
The chapter shows how indicators, service targets, and error budgets connect product expectations with operations: teams use them to read burn rate, pause releases, and decide when fixing the system matters more than shipping the next feature.
In interviews, this material is especially useful because it lets you discuss measurement discipline, acceptable risk, and release governance instead of falling back to the vague idea that a system should simply be stable.
Practical value of this chapter
Design in practice
Turn guidance on service quality metrics and error-budget management into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.
Decision quality
Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.
Interview articulation
Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.
Trade-off framing
Make trade-offs explicit for service quality metrics and error-budget management: release speed, automation level, observability cost, and operational complexity.
Source
Google SRE Workbook
A practical guide to defining SLI/SLO and operating with error budgets.
SLI / SLO / SLA are the shared language between business expectations and engineering decisions. In this chapter we break down how to formalize reliability and why error budget directly drives release pace, prioritization, and cost. For wider SRE context, start with the section introduction.
What SLI, SLO, and SLA mean
SLI
Service Level Indicator
A measurable service-quality signal: availability, latency, error rate, or freshness.
SLO
Service Level Objective
A target value for SLI over a period. Example: 99.9% successful requests over 30 days.
SLA
Service Level Agreement
An external contract with consequences (credits, penalties, support commitments).
Why this matters
One language for product and engineering
SLO turns "the service should be stable" into measurable decision criteria.
Release risk control
Error budget provides a formal gate: accelerate feature delivery or prioritize stability work.
Clear prioritization
You can justify reliability investment with numbers instead of intuition.
Predictable customer expectations
SLA sets external commitments, while SLO helps engineering stay inside those bounds.
Calculator 1: allowed downtime from SLO
Error budget = 0.100%
Allowed downtime
43 min
In seconds
2,592
Errors per 1M requests
1,000
Formula: budget = (1 - SLO) * period. For example, with a 99.9% SLO over 30 days, you get about 43 minutes of downtime budget.
Calculator 2: burn rate and remaining budget
Observed error rate
0.0240%
Burn rate
0.24x
Spent in window
0.03%
Remaining budget
84.97%
At the current pace, budget exhaustion is expected in 106 d 5 h 0 min.
Budget is burning slowly: you have room for safe releases.
How to use this in daily operations
- Select 1-3 critical user journeys and define SLI for each path.
- Agree on SLO with product priorities and failure cost in mind.
- Define release policy for burn-rate tiers: < 1, 1-2, and > 2.
- Connect alerting and incident response to budget consumption, not only infra-level metrics.
Common anti-patterns
Measuring SLI only at the infrastructure layer (CPU/RAM), not on real user journeys.
Setting SLO 99.999% without linking it to business expectations, architecture limits, and cost.
Using SLA as an internal engineering metric instead of an external contractual commitment.
Ignoring burn rate and only checking monthly totals when the budget is already gone.
Recommendations
Define 1-3 critical user journeys and build SLI around those paths.
Tie releases to an error budget policy: budget available -> ship faster, budget exhausted -> stabilization mode.
Use both fast and slow burn-rate alerts to catch spikes and sustained degradation.
Separate internal SLO from external SLA so customer expectations stay realistic.
References
Related chapters
- Site Reliability Engineering (short summary) - Provides the core SRE model around SLOs, error budgets, toil reduction and reliability governance.
- The Site Reliability Workbook (short summary) - Adds practical rollout guidance for SLOs, burn-rate alerting policies and operating rituals.
- Observability & Monitoring Design - Covers metrics, logs and tracing design patterns required to build trustworthy SLI signals.
- Performance Engineering - Deepens latency-focused SLI work with profiling, capacity planning and performance budgets.
- Release It! (short summary) - Extends reliability policy discussions with resilience patterns and safer release strategies.
