System Design Space
Knowledge graphSettings

Updated: April 12, 2026 at 1:20 PM

Troubleshooting Interview Example

medium

A public troubleshooting interview from DevOops 2023: the Yellow architecture, a drop in payments, and a joint investigation between a senior and a junior engineer.

The troubleshooting format is easier to grasp from a real walkthrough than from an abstract checklist of steps.

This chapter shows how a live incident scenario unfolds: which signals appear during the conversation, how the candidate sets priorities, and which moves actually advance the investigation.

For preparation, it works well as a concrete reference point for pacing, question quality, and hypothesis depth, while companies can also reuse it to train and calibrate interviewers.

Practical value of this chapter

Investigation Sequence

Practice investigation order: symptom, hypothesis, test, confirmation, and corrective action.

Root-cause isolation

Separate the primary cause from secondary effects so temporary measures do not hide a systemic problem.

Temporary stabilization

Choose actions by horizon: immediate damage reduction first, then the fix and longer-term guardrails.

Post-incident learning

Explain clearly what changes after the incident and which signals will show that the system improved.

Source

Public interview at DevOops

Alexander Polomodov's article walking through a public troubleshooting interview.

tellmeabout.tech

A troubleshooting interview becomes much easier to understand once you see one unfold in real time. At DevOops 2023, the speakers ran a public session that shows the whole arc of the conversation: aligning on architecture, reacting to symptoms, narrowing hypotheses, and deciding what to do next.

Interview participants

  • Interviewer: Alexander Polomodov
  • Candidate: Salikh Fakhrutdinov, Senior SRE at Tinkoff Origination Platform

Interview setup

In the scenario, the candidate and interviewer are part of the same SRE team. The candidate plays the more experienced engineer, while the interviewer acts as the junior teammate on call. An incident starts while the senior engineer is away, and the rest of the conversation becomes a joint investigation session.

That setup feels close to real incident response and makes it easier to assess how the candidate guides a less experienced teammate, keeps the investigation structured, and stays focused on useful checks instead of noise.

Theory

Troubleshooting Interviews

A 9-step structure for incident investigation and interviewer guidance.

Read the theory

System architecture

Before the investigation starts, the participants align on the architecture of the fintech application Yellow.

Scale

About 1 million DAU

Functionality

Debit cards, credit cards, and payments

Interactive architecture diagram

Switch between the initialization path and the main transaction flow to see how the app boots and where the payment scenario runs. The play button steps through the flow automatically.

App launch

User opens the web or mobile app

Web client
Users
Mobile app
Frontend load balancers
Frontend app
Backend load balancers
Auth service
Auth database
Card service
Card database
Payment service
Payment database

Initialization section

CDN
Config load balancers
Config service
Config database
Initialization path
Main path

Incident

User journey

Click to reveal incident symptoms

Product list

Card #1

Debit • ****4521

Card #2

Credit • ****8832

Products
Payments
First screen

Payments

Payment form

Money transfer

Products
Payments
Second screen

Once the candidate has clarified the architecture, the interview shifts into diagnosis. The junior teammate reports the initial symptom, an alert showing fewer successful payments, and the pair starts narrowing down possible causes.

What the interview reveals

  • Whether the candidate runs a disciplined investigation instead of thrashing between guesses
  • How clearly they formulate hypotheses and choose checks with a purpose
  • Whether they use operational metrics and troubleshooting heuristics as a guide instead of reciting acronyms
  • How well they guide a less experienced teammate through the investigation
  • Whether they separate temporary mitigation from a full fix

In practice, candidates often start with RED metrics to get a quick read on request flow, errors, and latency, then cross-check that picture with USE metrics when they need to understand whether a specific node is running into resource pressure, saturation, or localized failures.

Key takeaways

Realistic format

The senior-plus-junior setup makes the conversation feel like a real on-call handoff and exposes communication quality as clearly as technical depth.

Architectural context

Because the interview starts from system architecture, the candidate's hypotheses are grounded in real dependencies and user flows instead of abstract guesses.

Practice vs theory

Watching a real session complements the theory by showing pacing, question quality, and which moves genuinely push the investigation forward.

References

Related chapters

Enable tracking in Settings