The troubleshooting format is easier to grasp from a real walkthrough than from an abstract checklist of steps.
This chapter shows how a live incident scenario unfolds: which signals appear during the conversation, how the candidate sets priorities, and which moves actually advance the investigation.
For preparation, it works well as a concrete reference point for pacing, question quality, and hypothesis depth, while companies can also reuse it to train and calibrate interviewers.
Practical value of this chapter
Investigation Sequence
Practice investigation order: symptom, hypothesis, test, confirmation, and corrective action.
Root-cause isolation
Separate the primary cause from secondary effects so temporary measures do not hide a systemic problem.
Temporary stabilization
Choose actions by horizon: immediate damage reduction first, then the fix and longer-term guardrails.
Post-incident learning
Explain clearly what changes after the incident and which signals will show that the system improved.
Source
Public interview at DevOops
Alexander Polomodov's article walking through a public troubleshooting interview.
A troubleshooting interview becomes much easier to understand once you see one unfold in real time. At DevOops 2023, the speakers ran a public session that shows the whole arc of the conversation: aligning on architecture, reacting to symptoms, narrowing hypotheses, and deciding what to do next.
Interview participants
- Interviewer: Alexander Polomodov
- Candidate: Salikh Fakhrutdinov, Senior SRE at Tinkoff Origination Platform
Interview setup
In the scenario, the candidate and interviewer are part of the same SRE team. The candidate plays the more experienced engineer, while the interviewer acts as the junior teammate on call. An incident starts while the senior engineer is away, and the rest of the conversation becomes a joint investigation session.
That setup feels close to real incident response and makes it easier to assess how the candidate guides a less experienced teammate, keeps the investigation structured, and stays focused on useful checks instead of noise.
Theory
Troubleshooting Interviews
A 9-step structure for incident investigation and interviewer guidance.
System architecture
Before the investigation starts, the participants align on the architecture of the fintech application Yellow.
Scale
About 1 million DAU
Functionality
Debit cards, credit cards, and payments
Interactive architecture diagram
Switch between the initialization path and the main transaction flow to see how the app boots and where the payment scenario runs. The play button steps through the flow automatically.
App launch
User opens the web or mobile app
Initialization section
Incident
User journey
Product list
Card #1
Debit • ****4521
Card #2
Credit • ****8832
Payments
Payment form
Money transfer
Once the candidate has clarified the architecture, the interview shifts into diagnosis. The junior teammate reports the initial symptom, an alert showing fewer successful payments, and the pair starts narrowing down possible causes.
What the interview reveals
- •Whether the candidate runs a disciplined investigation instead of thrashing between guesses
- •How clearly they formulate hypotheses and choose checks with a purpose
- •Whether they use operational metrics and troubleshooting heuristics as a guide instead of reciting acronyms
- •How well they guide a less experienced teammate through the investigation
- •Whether they separate temporary mitigation from a full fix
In practice, candidates often start with RED metrics to get a quick read on request flow, errors, and latency, then cross-check that picture with USE metrics when they need to understand whether a specific node is running into resource pressure, saturation, or localized failures.
Key takeaways
Realistic format
The senior-plus-junior setup makes the conversation feel like a real on-call handoff and exposes communication quality as clearly as technical depth.
Architectural context
Because the interview starts from system architecture, the candidate's hypotheses are grounded in real dependencies and user flows instead of abstract guesses.
Practice vs theory
Watching a real session complements the theory by showing pacing, question quality, and which moves genuinely push the investigation forward.
References
Related chapters
- Troubleshooting Interviews - provides the 9-step theory framework that this practical walkthrough applies.
- System Design Interviews: A 7-Step Approach - shows how a structured interview flow can be adapted to incident diagnosis.
- How system design interviews are evaluated and how difficulty is calibrated - explains how to assess candidate moves and adjust task complexity during the dialogue.
- Site Reliability Engineering - covers core monitoring, alerting, and incident response practices used in production.
- The Site Reliability Workbook - adds practical on-call scenarios, incident reviews, and operational improvement patterns.
- Release It! - systematizes common failure modes and protective patterns relevant to troubleshooting rounds.
