Troubleshooting Interview Example — System Design Space

The troubleshooting format is easier to grasp from a real walkthrough than from an abstract checklist of steps.

This chapter shows how a live incident scenario unfolds: which signals appear during the conversation, how the candidate sets priorities, and which moves actually advance the investigation.

For preparation, it works well as a concrete reference point for pacing, question quality, and hypothesis depth, while companies can also reuse it to train and calibrate interviewers.

Practical value of this chapter

Investigation Sequence

Practice investigation order: symptom, hypothesis, test, confirmation, and corrective action.

Root-cause isolation

Separate the primary cause from secondary effects so temporary measures do not hide a systemic problem.

Temporary stabilization

Choose actions by horizon: immediate damage reduction first, then the fix and longer-term guardrails.

Post-incident learning

Explain clearly what changes after the incident and which signals will show that the system improved.

Source

Public interview at DevOops

Alexander Polomodov's article walking through a public troubleshooting interview.

tellmeabout.tech

A troubleshooting interview becomes much easier to understand once you see one unfold in real time. At DevOops 2023, the speakers ran a public session that shows the whole arc of the conversation: aligning on architecture, reacting to symptoms, narrowing hypotheses, and deciding what to do next.

Interview participants

Interviewer: Alexander Polomodov
Candidate: Salikh Fakhrutdinov, Senior SRE at Tinkoff Origination Platform

Interview setup

In the scenario, the candidate and interviewer are part of the same SRE team. The candidate plays the more experienced engineer, while the interviewer acts as the junior teammate on call. An incident starts while the senior engineer is away, and the rest of the conversation becomes a joint investigation session.

That setup feels close to real incident response and makes it easier to assess how the candidate guides a less experienced teammate, keeps the investigation structured, and stays focused on useful checks instead of noise.

Theory

Troubleshooting Interviews

A 9-step structure for incident investigation and interviewer guidance.

Read the theory

System architecture

Before the investigation starts, the participants align on the architecture of the fintech application Yellow.

Scale

About 1 million DAU

Functionality

Debit cards, credit cards, and payments

Interactive architecture diagram

Switch between the initialization path and the main transaction flow to see how the app boots and where the payment scenario runs. The play button steps through the flow automatically.

App launch

User opens the web or mobile app

Web client

Users

Mobile app

Frontend load balancers

Frontend app

Backend load balancers

Auth service

Auth database

Card service

Card database

Payment service

Payment database

Initialization section

CDN

Config load balancers

Config service

Config database

Initialization path

Main path

Incident

User journey

Click to reveal incident symptoms

Product list

Card #1

Debit • ****4521

Card #2

Credit • ****8832

⋮

Products

Payments

First screen

Payments

Payment form

Money transfer

Products

Payments

Second screen

Once the candidate has clarified the architecture, the interview shifts into diagnosis. The junior teammate reports the initial symptom, an alert showing fewer successful payments, and the pair starts narrowing down possible causes.

What the interview reveals

•Whether the candidate runs a disciplined investigation instead of thrashing between guesses
•How clearly they formulate hypotheses and choose checks with a purpose
•Whether they use operational metrics and troubleshooting heuristics as a guide instead of reciting acronyms
•How well they guide a less experienced teammate through the investigation
•Whether they separate temporary mitigation from a full fix

In practice, candidates often start with RED metrics to get a quick read on request flow, errors, and latency, then cross-check that picture with USE metrics when they need to understand whether a specific node is running into resource pressure, saturation, or localized failures.

Key takeaways

Realistic format

The senior-plus-junior setup makes the conversation feel like a real on-call handoff and exposes communication quality as clearly as technical depth.

Architectural context

Because the interview starts from system architecture, the candidate's hypotheses are grounded in real dependencies and user flows instead of abstract guesses.

Practice vs theory

Watching a real session complements the theory by showing pacing, question quality, and which moves genuinely push the investigation forward.

References

Related chapters

Troubleshooting Interviews - provides the 9-step theory framework that this practical walkthrough applies.
System Design Interviews: A 7-Step Approach - shows how a structured interview flow can be adapted to incident diagnosis.
How system design interviews are evaluated and how difficulty is calibrated - explains how to assess candidate moves and adjust task complexity during the dialogue.
Site Reliability Engineering - covers core monitoring, alerting, and incident response practices used in production.
The Site Reliability Workbook - adds practical on-call scenarios, incident reviews, and operational improvement patterns.
Release It! - systematizes common failure modes and protective patterns relevant to troubleshooting rounds.