Troubleshooting Interviews — System Design Space

Troubleshooting interviews do not test clean-sheet architecture work. They test how an engineer behaves once the system is already degraded and calm, ideal reasoning time is gone.

This chapter shows how the format evaluates symptom reading, hypothesis prioritization, check quality, and the ability to keep the conversation tied to real user impact instead of random guesses.

For companies, that makes operational maturity visible. For candidates, it turns expectations into something concrete: do not thrash through logs, run the investigation step by step with priorities and validation.

Practical value of this chapter

Incident decomposition

Break the problem into user impact, infrastructure path, data integrity, and recent release changes.

Hypothesis tree

Build prioritized hypotheses with explicit validation criteria and test cost.

Signal prioritization

Use metrics, logs, and traces to cut through noise and reach the primary cause faster.

Communication under pressure

Keep transparent status, mitigation plan, and risk updates for stakeholders during incidents.

Source

Troubleshooting Interview

Based on Alexander Polomodov's article and conference talk at DevOps & Techlead Conf 2022.

Read original

Troubleshooting interviews test something different from clean-sheet system design. The candidate does not start with a blank architecture task. They start with a system that is already degraded and must read symptoms, narrow hypotheses, stabilize user impact, and explain what is actually failing.

How this differs from a system design interview

In system design, the discussion starts from requirements and future architecture. Here the starting point is a live problem: users are already affected, the system is unstable, and the signal comes from the quality of the investigation rather than from the elegance of a proposed design.

Why companies use this format

For the company

•Reliability work is increasingly distributed across product teams, so engineers must be able to investigate their own production failures.
•Complex ecosystems require people who can reason about dependencies, user impact, and operating conditions rather than only application code.
•This format reveals how a candidate behaves under time pressure and incomplete information when the system is already in trouble.

For the candidate

•It is a chance to demonstrate maturity through real incident reasoning, not only through architecture diagrams.
•Weak spots become visible quickly: missing tools, shallow hypotheses, or loss of connection to actual user impact.
•The preparation transfers well to real work because it teaches structured investigations instead of random log diving.

Video

DevOps & Techlead Conf 2022

Conference recording that explains the format and shows how the interviewer steers the investigation.

Watch video

How the interview is structured

Scenario setup

The candidate and interviewer are imagined as members of the same SRE team. The candidate plays the Lead, while the interviewer plays the Junior. The Lead is away, Junior is on call, an incident starts, and the candidate must guide the investigation.

The key constraint is deliberate: Junior cannot interpret vague instructions. The signal comes from whether the candidate can ask precise questions and issue concrete, operationally useful directions.

Related chapter

System Design Interviews: A 7-Step Approach

Useful as a comparison point for how a design conversation changes once the task becomes an incident investigation.

Read chapter

Troubleshooting interview flow

9 stages from setup to post-incident review

Setup

Incident

Stabilization

Post-incident review

Setup and rules

The interviewer explains the format and operating constraints of the scenario

Architecture walkthrough

The interviewer presents the system diagram and the important components

Clarifying questions

The candidate closes important unknowns before the incident begins

Incident starts

The interviewer reports the symptoms and user-facing impact

Diagnosis

Forming hypotheses, choosing checks, and eliminating weak explanations

Temporary mitigation

A fast move that reduces user impact before the deeper fix is ready

Full resolution

Fixing the issue with a clear remediation path

Root cause

Explaining why the issue happened and how the symptoms connect

Prevention and follow-up

Changes that would catch the problem earlier or prevent it from returning

Setup and rules

The interviewer explains the format and operating constraints of the scenario

Architecture walkthrough

The interviewer presents the system diagram and the important components

Clarifying questions

The candidate closes important unknowns before the incident begins

Incident starts

The interviewer reports the symptoms and user-facing impact

Diagnosis

Forming hypotheses, choosing checks, and eliminating weak explanations

Temporary mitigation

A fast move that reduces user impact before the deeper fix is ready

Full resolution

Fixing the issue with a clear remediation path

Root cause

Explaining why the issue happened and how the symptoms connect

Prevention and follow-up

Changes that would catch the problem earlier or prevent it from returning

Press Play to follow the interview stage by stage.

Foundation

Containerization

K8s and containers define the operating environment in which the incident unfolds.

Читать обзор

A representative scenario

System context

Several million users per day
Two data centers
React frontend and Python/Django backend
Both applications deployed on K8s
Postgres for primary data, Redis for caching

Symptom

Support reports a rise in complaints: the website has become noticeably slower and some pages fail to load intermittently.

A common first pass starts with RED metrics to understand request flow, errors, and latency, then adds USE metrics when you need to see whether a particular node is running into resource pressure, saturation, or localized failures.

Dialogue fragment

Lead:Do we have centralized logging, for example an ELK stack?

Junior:Yes, but I am not very confident with it yet. Where should I look first?

Lead:Let’s open the load balancer dashboard and inspect the RED metrics: Requests, Errors, and Duration.

Junior:Request volume is stable, but errors are up and the average Duration has increased too.

Lead:Which error type dominates?

Junior:There are several, but most responses are 504s.

Lead:Then let’s inspect the application logs and see what happens right before those timeouts.

How candidates are evaluated

Diagnostic range

How broadly the candidate uses metrics, logs, traces, dashboards, and architectural context.

Methodical investigation

How consistently they form hypotheses, remove noise, and move from symptoms to focused checks.

Temporary stabilization

How quickly the candidate proposes a way to reduce user impact before the full fix is ready.

Full resolution

Whether the candidate can carry the conversation all the way to an actual fix instead of stopping at mitigation.

Root cause

Whether they can explain why the symptoms appeared and why the chosen fix should work.

Recurrence prevention

Whether they suggest changes in monitoring, architecture, or process that would catch or prevent the issue next time.

Example candidate profile

Imagine a candidate who uses tooling well and investigates in a structured way, but gets only as far as a mitigation step and never reaches the real fix. If they also fail to explain the root cause, the final assessment can easily fall to Junior even if the opening of the interview looked strong.

Scores across axes usually differ. Strong investigations rarely look perfectly even in every dimension.

How this differs from system design interviews

Incident troubleshooting

→The system already exists and is already degraded.
→The goal is to stabilize the situation and work back to the primary cause.
→Interviewers evaluate investigation quality, action prioritization, and operational clarity.
→It is especially relevant for SRE, platform, and reliability-heavy roles.

System design

→The conversation starts from requirements and future behavior.
→The goal is to assemble an architecture that meets scale and reliability constraints.
→The strongest signal is design maturity: decomposition, boundaries, trade-offs, and risk handling.
→It is more common in backend, platform, and product-engineering tracks.

The strongest engineers benefit from both skill sets: they can diagnose failure under pressure and also design systems that make incidents less frequent and less expensive.

How to build this skill

Incident investigation

Theory:

Practices from Site Reliability Engineering and The Site Reliability Workbook
Reliability material on protective mechanisms and failure analysis
Systematic familiarity with metrics, logs, traces, and dashboards

Practice:

Participating in on-call and real incident response
Reading public incident reports and reconstructing the investigation flow
Training on scenarios that require both fast mitigation and a durable fix

System design

Theory:

Distributed-systems design principles and constraint-driven architecture work
System classes, recurring trade-offs, and realistic limits of common solutions

Practice:

Reviewing real system architectures and the design changes that followed incidents
Connecting post-incident lessons to architectural improvements
Running regular mock interviews in both design and troubleshooting formats

References

SRE Interview Prep Guide (GitHub)

Related chapters

Troubleshooting Interview Example - walks through the same format in a concrete incident investigation.
System Design Interviews: A 7-Step Approach - helps compare a design conversation with an incident-driven investigation and see where the interview logic changes.
How system design interviews are evaluated and how difficulty is calibrated - explains how interviewers collect evidence and regulate the amount of guidance during the conversation.
Site Reliability Engineering - provides the operational foundations needed to reason about real production incidents.
The Site Reliability Workbook - adds concrete on-call, incident-response, and process-improvement patterns.
Building Secure and Reliable Systems - connects reliability and security decisions to the way incidents are prevented and handled.
Release It! - systematizes failure modes and resilience measures that often appear in these interviews.
Designing Data-Intensive Applications, 2nd Edition - deepens reasoning about failures, consistency, and root-cause analysis in distributed systems.