This Theme 11 chapter focuses on incident response process, escalation, and postmortem practice.
In real system design and operations, this material helps set measurable reliability goals, choose resilience mechanisms, and reduce incident cost at scale.
For system design interviews, the chapter builds a clear operational narrative: how reliability is validated, where degradation risks sit, and which guardrails are planned up front.
Practical value of this chapter
Design in practice
Turn guidance on incident response process, escalation, and postmortem practice into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.
Decision quality
Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.
Interview articulation
Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.
Trade-off framing
Make trade-offs explicit for incident response process, escalation, and postmortem practice: release speed, automation level, observability cost, and operational complexity.
Primary source
Google SRE: Emergency Response
Canonical guidance for building controlled production incident response.
Incident Management is an engineering discipline, not a set of ad-hoc actions during an outage. It defines how teams detect problems, coordinate on-call response, make escalation decisions, and turn incidents into systemic improvements.
Combined with Observability & Monitoring Design and SLI/SLO practicesit provides an operating model: who owns response, when escalation is mandatory, and how lessons are captured through postmortems.
Why this is a discipline
One operating model
Incident management connects detection, triage, mitigation, communications, and recovery into a single controllable process.
Clear roles and ownership
During an incident, teams should not debate authority: Incident Commander, service owner, and escalation channels are predefined.
Post-incident learning loop
A blameless postmortem turns stress into engineering improvement: action items, owners, deadlines, and execution tracking.
Incident lifecycle: from signal to learning
1. Detection and triage
Signals from monitoring/alerts are validated against user impact, then severity and response owner are assigned.
2. Stabilization
Priority is damage control: rollback, feature-flag disable, traffic shaping, dependency isolation, or temporary read-only mode.
3. Escalation
If recovery windows are missed or critical user journeys are affected, additional experts are engaged through predefined rules.
4. Recovery
After stabilization, teams restore normal SLA/SLO, verify data consistency, and remove temporary workaround mechanisms.
5. Postmortem and follow-up
Teams document timeline, root cause, contributing factors, and engineering changes that reduce recurrence risk.
On-call model
Every page-alert must have an owner, a runbook, and an explicit expected response time.
On-call shifts must never be blind: handoff rituals and current risk context are mandatory.
Escalation policy should live next to operational docs and be tested regularly through drills/game days.
On-call must be sustainable: controlled load, shift replacement process, and transparent compensation.
Escalation matrix
| Severity | Trigger | First action | Escalate to | Max delay |
|---|---|---|---|---|
| SEV-1 | Critical user flow is down or major transaction loss is ongoing | Immediate war-room call and Incident Commander assignment | Platform/SRE lead, security, and business on-duty | 0-5 minutes |
| SEV-2 | Strong latency/error degradation without full service outage | On-call triage with service owner, then blast-radius containment | Dependency teams and release owner | 10-15 minutes |
| SEV-3 | Localized issue with low immediate business impact | Runbook-based handling within the active shift | Escalate to product team during business hours if needed | Up to 1 hour |
Postmortem: required incident outcome
Impact: who was affected, how impact was measured, and how long degradation lasted.
Timeline: chronology of events, actions, and decision points.
Root cause and contributing factors: technical and process causes without blame.
Action items: concrete changes with owner, due date, and expected risk reduction.
Verification: how the team will validate that recurrence probability has actually decreased.
Incident management maturity metrics
MTTD
Time from degradation start to detection
MTTA
Time to acknowledge the incident and assign owner
MTTR
Time to restore affected user journey
Repeat Incident Rate
Share of repeated incidents tied to the same root cause
Escalation Quality
Share of incidents with correct severity and timely escalation
Common anti-patterns
Treating incident management as just a chat room without roles, response SLA, or formal ownership.
Escalating too late because teams avoid disturbing adjacent teams or management.
Running postmortems as a ritual without action items, owners, and deadlines.
Measuring on-call only by alert closure count while ignoring triage quality and recurrence reduction.
Recommendations
Define incident lifecycle as a team standard: detection -> triage -> mitigation -> recovery -> learning.
Make escalation policy part of the engineering contract: triggers, roles, channels, and response SLA.
Standardize postmortem template and tie each action item to risk or MTTR improvement.
Run regular incident drills so on-call and escalation workflows work under pressure, not only on paper.
References
Related chapters
- SLI / SLO / SLA and Error Budgets - it defines the base signals used to set severity and assess incident impact.
- Observability & Monitoring Design - it shows how to design alerting and runbooks for fast detection and triage.
- Troubleshooting Interview - it trains practical diagnostics and decision-making during production failures.
- The Site Reliability Workbook (short summary) - it complements this chapter with concrete incident response, on-call and postmortem practices.
- Evolution of SRE: implementation of an AI assistant in T-Bank - it shows how incident management scales through platformization and automation.
