Source
Troubleshooting Interview
The material is based on an article and speech by Alexander Polomodov at DevOps & Techlead Conf 2022.
Troubleshooting Interview is a technical interview format that checks whether a candidate can diagnose and resolve production incidents. Unlike System Design, where you design a system from scratch, here you work with an existing system that is already failing.
Key difference from System Design
In System Design, you start with product requirements and design for reliability. In Troubleshooting, you start with a live production system and must stabilize it quickly.
Why do you need a Troubleshooting Interview?
For the company
- •Reliability ownership is decentralized: engineers must be able to operate their own systems
- •Rapid staff growth and transition to cross-functional teams
- •Multi-product ecosystem with non-trivial integrations
For the candidate
- •Opportunity to gain experience working on an incident in conditions close to reality
- •Safe way to practice incident handling in near-real conditions
- •Feedback on what needs to be improved
Video recording
DevOps & Techlead Conf 2022
Recording of a speech demonstrating the interview format.
Interview format
Legend
In the scenario, the candidate and interviewer work in the same SRE team. The candidate plays Lead, while the interviewer plays Junior. The Lead is away, Junior is on call, an incident starts, and Junior calls Lead for guidance.
Important: questions and commands should be specific and operationally clear. Junior is intentionally limited in depth and needs precise guidance.
Related chapter
Approaches to design interviews
Similar 7-step framework for System Design Interview.
Troubleshooting Interview steps
9 stages from setup to postmortemInterview setup
Interviewer explains the format and interaction rules
Architecture walkthrough
Interviewer presents the architecture diagram and system components
System questions
Candidate asks clarifying questions (this step is often skipped)
Incident starts
Interviewer reports symptoms, the external effects of the problem
Diagnosis
Formulating hypotheses and running experiments
Workaround
Fast mitigation to reduce user impact
Complete fix
Resolving the issue with a clear remediation algorithm
Root cause
Understanding why it happened, all puzzle pieces fit together
System improvement
How to prevent recurrence or detect the issue earlier
Interview setup
Interviewer explains the format and interaction rules
Architecture walkthrough
Interviewer presents the architecture diagram and system components
System questions
Candidate asks clarifying questions (this step is often skipped)
Incident starts
Interviewer reports symptoms, the external effects of the problem
Diagnosis
Formulating hypotheses and running experiments
Workaround
Fast mitigation to reduce user impact
Complete fix
Resolving the issue with a clear remediation algorithm
Root cause
Understanding why it happened, all puzzle pieces fit together
System improvement
How to prevent recurrence or detect the issue earlier
Practice
Example Troubleshooting Interview
Public interview at DevOops 2023 with analysis of the Yellow fintech system.
Foundation
Containerization
K8s and containers define the environment where the incident unfolds.
Example of a typical task
System context
- Several million customers daily
- Two data centers
- Frontend in React + backend in Python/Django
- Both applications are deployed on K8s
- Postgres for data, Redis for cache
Incident
Support reports an increase in customer complaints: the website is slow and pages intermittently fail to load.
Dialogue example
Candidate Evaluation Criteria
Horizon
Wide range of diagnostic tools and approaches used
Methodical
Logical and systematic search for a solution, meaningful cutting off of incorrect hypotheses
Workaround
Speed of finding a temporary solution to mitigate user problems
Full fix
Finding a solution to the problem and formulating an algorithm for its application
Root Cause
Understanding the root cause that explains the symptoms and why the solution worked
Improvements
Suggestions for preventing recurrence or early detection
Candidate Profile Example
Example: candidate demonstrates solid tooling knowledge (Middle), follows a structured process (Middle+), quickly finds a workaround (Middle), but does not complete a full fix (Junior), misses root cause (Junior), and proposes only cosmetic improvements. Final assessment: Junior.
Typically, scores on different axes vary; there are rarely clear levels for all criteria.
Troubleshooting vs System Design
Troubleshooting
- →You start with an already running production system
- →Goal: stabilize an actively failing system
- →Evaluates structured incident diagnostics
- →Strong fit for engineers close to operations and infrastructure
System Design
- →You start from customer and product requirements
- →Goal: design a system that avoids incident-prone architecture
- →Evaluates structured architecture design
- →Strong fit for engineers focused on design and development
Ideal SRE must have both skills - not just put out fires, but also systematically work on reliability.
How to level up
Troubleshooting
Theory:
- Practices from the SRE Book and SRE Workbook
- Building Secure & Reliable Systems
- Learning Diagnostic Tools
Practice:
- Working in an SRE team on real systems
- Parsing public postmortems
- Training on tasks based on postmortems
System Design
Theory:
- Design Principles for Distributed Systems
- Classes of systems and limits of applicability
Practice:
- Working as an architect on real systems
- Analysis of the architecture of large systems (Google, Meta)
- Architectural kata solution
