System Design Space
Knowledge graphSettings

Updated: February 20, 2026 at 7:47 AM

Troubleshooting Interview

mid

SRE interview format: incident diagnosis, RED Method, workaround vs root cause, evaluation criteria and comparison with System Design.

Source

Troubleshooting Interview

The material is based on an article and speech by Alexander Polomodov at DevOps & Techlead Conf 2022.

Read original

Troubleshooting Interview is a technical interview format that checks whether a candidate can diagnose and resolve production incidents. Unlike System Design, where you design a system from scratch, here you work with an existing system that is already failing.

Key difference from System Design

In System Design, you start with product requirements and design for reliability. In Troubleshooting, you start with a live production system and must stabilize it quickly.

Why do you need a Troubleshooting Interview?

For the company

  • Reliability ownership is decentralized: engineers must be able to operate their own systems
  • Rapid staff growth and transition to cross-functional teams
  • Multi-product ecosystem with non-trivial integrations

For the candidate

  • Opportunity to gain experience working on an incident in conditions close to reality
  • Safe way to practice incident handling in near-real conditions
  • Feedback on what needs to be improved

Video recording

DevOps & Techlead Conf 2022

Recording of a speech demonstrating the interview format.

Watch video

Interview format

Legend

In the scenario, the candidate and interviewer work in the same SRE team. The candidate plays Lead, while the interviewer plays Junior. The Lead is away, Junior is on call, an incident starts, and Junior calls Lead for guidance.

Important: questions and commands should be specific and operationally clear. Junior is intentionally limited in depth and needs precise guidance.

Related chapter

Approaches to design interviews

Similar 7-step framework for System Design Interview.

Read chapter

Troubleshooting Interview steps

9 stages from setup to postmortem
Preparation
Incident
Resolution
Postmortem
0

Interview setup

Interviewer explains the format and interaction rules

1

Architecture walkthrough

Interviewer presents the architecture diagram and system components

2

System questions

Candidate asks clarifying questions (this step is often skipped)

3

Incident starts

Interviewer reports symptoms, the external effects of the problem

4

Diagnosis

Formulating hypotheses and running experiments

5

Workaround

Fast mitigation to reduce user impact

6

Complete fix

Resolving the issue with a clear remediation algorithm

7

Root cause

Understanding why it happened, all puzzle pieces fit together

8

System improvement

How to prevent recurrence or detect the issue earlier

Press Play to see the interview step sequence.

Practice

Example Troubleshooting Interview

Public interview at DevOops 2023 with analysis of the Yellow fintech system.

See example

Foundation

Containerization

K8s and containers define the environment where the incident unfolds.

Читать обзор

Example of a typical task

System context

  • Several million customers daily
  • Two data centers
  • Frontend in React + backend in Python/Django
  • Both applications are deployed on K8s
  • Postgres for data, Redis for cache

Incident

Support reports an increase in customer complaints: the website is slow and pages intermittently fail to load.

Dialogue example

Lead:Do we have a system for collecting logs, for example, ELK stack?
Junior:Yes, we have it, but I’m not very confident in it yet. Where should I look?
Lead:Can we look at dashboard visualizations from load balancer logs?
Junior:I opened a dashboard, what should I look for on it?
Lead:Let’s apply the RED method and check Requests, Errors, and Duration.
Junior:I see that the number of requests has not increased, but the number of errors has increased and the average Duration has grown.
Lead:What type of errors prevails?
Junior:There are multiple errors, but most responses are 504.
Lead:Hmm, 504 is Gateway Timeout. Let's look at the application logs...

Candidate Evaluation Criteria

Horizon

Wide range of diagnostic tools and approaches used

Methodical

Logical and systematic search for a solution, meaningful cutting off of incorrect hypotheses

Workaround

Speed of finding a temporary solution to mitigate user problems

Full fix

Finding a solution to the problem and formulating an algorithm for its application

Root Cause

Understanding the root cause that explains the symptoms and why the solution worked

Improvements

Suggestions for preventing recurrence or early detection

Candidate Profile Example

Example: candidate demonstrates solid tooling knowledge (Middle), follows a structured process (Middle+), quickly finds a workaround (Middle), but does not complete a full fix (Junior), misses root cause (Junior), and proposes only cosmetic improvements. Final assessment: Junior.

Typically, scores on different axes vary; there are rarely clear levels for all criteria.

Troubleshooting vs System Design

Troubleshooting

  • You start with an already running production system
  • Goal: stabilize an actively failing system
  • Evaluates structured incident diagnostics
  • Strong fit for engineers close to operations and infrastructure

System Design

  • You start from customer and product requirements
  • Goal: design a system that avoids incident-prone architecture
  • Evaluates structured architecture design
  • Strong fit for engineers focused on design and development

Ideal SRE must have both skills - not just put out fires, but also systematically work on reliability.

How to level up

Troubleshooting

Theory:

  • Practices from the SRE Book and SRE Workbook
  • Building Secure & Reliable Systems
  • Learning Diagnostic Tools

Practice:

  • Working in an SRE team on real systems
  • Parsing public postmortems
  • Training on tasks based on postmortems

System Design

Theory:

  • Design Principles for Distributed Systems
  • Classes of systems and limits of applicability

Practice:

  • Working as an architect on real systems
  • Analysis of the architecture of large systems (Google, Meta)
  • Architectural kata solution

Additional materials

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov