Root Cause: Backend Bugs as SRE Training

An SRE engineer often grows less from elegant diagrams and more from the moment an unclear symptom has to be traced to a concrete root cause.

Root Cause frames backend bugs as engineering investigations: users experience slowness or timeouts, the team gathers facts across UI, networking, databases, protocols, and OS limits, and the real fix often lands somewhere other than the first visible pain point.

For interviews and production work, it is a strong exercise in resisting symptom scaling, forming hypotheses, checking telemetry, separating mitigation from the fix, and turning lessons into postmortems.

Practical value of this chapter

Investigation practice

Practice moving from user-visible symptoms to testable hypotheses across UI, API, database, network, load balancer, and operating-system limits.

Decision quality

Separate temporary mitigation from the real fix: adding CPU is easier than removing meaningless product-generated load.

Interview articulation

Structure the answer as an investigation: what users see, which signals you collect, where you narrow the search, and why this is the root cause.

Postmortem thinking

Capture not only the patch, but the lesson: which default, limit, protocol behavior, or product flow made the failure possible.

Source

Book Cube post

Personal notes on the first cases: COUNT(*), SSE, HTTP/1.1, HTTP/2, load balancer, and SRE takeaways.

Open post

Root Cause: Stories and Lessons from Two Decades of Backend Engineering Bugs

Authors: Hussein Nasser
Publisher: Self-published, 2026 (1st Edition)
Length: 317 pages

Review of Hussein Nasser's book about real backend bugs: system-wide slowdowns, HTTP/1.1 and HTTP/2, load balancers, resource exhaustion, state corruption, and why investigations matter for SRE engineers.

Original

Root Cause is valuable not as an academic backend engineering textbook, but as a set of investigations. In this framing, engineering experience is measured less by years or features and more by how many bugs a person has reproduced, traced to root cause, and fixed in a way that made the system more understandable.

For SRE engineers, that format is almost ideal: every story begins with a production symptom, moves through telemetry and hypotheses, and arrives at a patch that carries a lesson about protocols, databases, resource limits, product decisions, or state correctness.

Reading it as a list of ready-made answers is pointless — it does not answer “how to fix” but “how to think.” The test is simple: after each case, try to say out loud which signals you would collect first, which mitigation you would choose, where root-cause analysis would stop, and what would go into the postmortem.

Why this is an SRE book, even though it is about backend bugs

Symptom discipline

The starting point is what the user can observe: slowness, 504s, a hanging request, or incorrect state. The team then separates observation from explanation and builds a testable chain of hypotheses.

Full-stack investigation

The cause may live in the UI, API, SQL query, HTTP version, load balancer, connection pool, file descriptor limits, or TCP behavior. The book is useful because it keeps that whole-stack view.

Mitigation is not the fix

Adding CPU, raising a limit, or increasing a pool may be necessary to restore service, but an SRE still has to reach the cause that generated the load or exhausted the resource.

Postmortem material

A patch closes the incident, not the investigation. The clear defect report — timeline, impact, contributing factors, root cause, action items, and evidence that the lesson was locked in — is the reason it is worth seeing the analysis through.

How the book is structured

In the public description, the author frames the book as 15 backend bug stories. So the honest way to cover it here is not to invent a 15-item table of contents, but to read it as a repeating investigative loop: observable effect -> investigation -> concept -> root cause.

Observable effect

Each story starts with an external effect: users complain about a slow product, a stuck live stream, or an intermittent timeout rather than about database CPU.

Investigation

The engineer narrows the search: traces, logs, metrics, client behavior, network limits, and infrastructure settings.

Engineering concept

The case becomes a reason to explain a concrete backend topic: connection limits, HTTP/2 overhead, HPACK, file descriptors, TCP windows, or state correctness.

Root cause

The final conclusion connects symptom to cause and shows why the first obvious fix would often treat the chart rather than the system.

Case: system-wide slowness caused by COUNT(*)

The most telling case from the post begins with a vague symptom: users feel that the whole product became slow. A symptom like that is hard to work with — it is not tied to a single endpoint, button, or workflow, and the investigation risks spreading across the whole stack.

The investigation eventually leads to a small UI element: the search box shows text like Search 550M items, while JavaScript repeatedly calls an API that runs SELECT COUNT(*) over a huge table. The trace shows more than 100 thousand such requests in 30 minutes.

Symptom

The whole product feels slow, the database is CPU-bound, and users cannot work normally.

False fix

Vertically scale the database and conclude that the machine was too weak.

Root cause

Product UI created meaningless backend load where an approximate number would have been enough.

The SRE lesson is not primarily about SQL, but about the boundary between symptom and cause. Database CPU was a real signal, but not the root cause. The root cause lived in product behavior that made an expensive exact operation part of a constant user path.

Case: SSE, HTTP/2, load balancer, and new bottlenecks

The second group of stories is especially useful for architects and SREs because it shows an uncomfortable reality: an architecture improvement is rarely just an improvement. It changes failure modes and moves pressure somewhere else.

HTTP/1.1 and SSE

Long-lived server-sent events consume browser connections per host. Once the limit is reached, the next request can sit in a client-side queue even though the backend has not done anything wrong yet.

HTTP/2

Multiplexing removes one limit, but changes backend cost: more small requests, TLS, frame parsing, stream state, and HPACK can start showing up as CPU overhead.

Load balancer

The load balancer relieves the backend and lets internal traffic move back to HTTP/1.1, but brings its own settings, caching behavior, connection reuse, and default values.

File descriptors and TCP window

An unlimited backend connection pool can run into file descriptors, TCP receive window behavior, or read-side pressure. The limit is visible low in the stack, but the cause may sit higher.

This is strong postmortem material: after every fix, ask which limit becomes next, which metrics need to appear, and which default value is now dangerous for the new operating mode.

Failure map worth taking from the book

Product UI creates backend load

What it looks like

The whole product feels slow and the database is CPU-bound, but the hot spot turns out to be a small search label.

SRE lesson

Check which product flow generated the load. Sometimes the right fix is to remove an exact COUNT(*), not buy more hardware.

An architecture improvement changes failure modes

What it looks like

HTTP/2 fixes the client queue problem, but adds CPU cost and forces the backend path to be redesigned.

SRE lesson

Every improvement needs a new capacity model, new metrics, and a clear answer to which bottleneck becomes next.

Resource limits masquerade as network errors

What it looks like

A 504 Gateway Timeout can look like a gateway issue while the real pressure is file descriptors or connection pools.

SRE lesson

Keep application, runtime, OS, and network metrics together; otherwise the investigation stops at the last visible component.

Subtle state bugs are worse than obvious crashes

What it looks like

Race conditions, state corruption, and distributed-system edge cases can appear rarely and quietly damage trust without a loud outage.

SRE lesson

A postmortem should cover not only uptime, but correctness: which invariants were broken and how they will be checked now.

How to read it with practical value

For incident triage

Use the stories as question practice: what changed, where the symptom appeared first, which dependencies are shared, and which data supports the hypothesis.

For observability

After each case, ask which metrics, logs, and traces would have shortened the path from symptom to root cause.

For postmortems

The author emphasizes bug reports. For SRE work, that is a reason to practice clear defect narratives: impact, timeline, root cause, fix, and prevention.

For SRE interviews

The book helps you answer with investigation logic rather than memorized patterns: stabilize, narrow the search, prove the cause, and propose prevention.

Main limitation

This is not a replacement for the SRE Book, a networking textbook, or a database internals book. It is closer to an engineering blog in book form: it gives the feel of a real investigation and shows how foundational topics surface in production.

So it works next to your own incidents, your postmortem template, and the chapters on observability, performance, and troubleshooting. In that company the stories stop being entertainment and turn into a working habit of reaching the cause.

Related chapters

Troubleshooting Interview - provides the investigation frame: symptoms, stabilization, hypotheses, telemetry, root cause, and follow-up.
Troubleshooting Interview Example - shows what this kind of conversation looks like in practice and how to separate observations from conclusions.
Incident Management as an Engineering Discipline - connects root-cause analysis with roles, escalation, service restoration, and postmortems.
Observability & Monitoring Design - is the foundation for investigations that do not depend on luck or memory.
Distributed tracing in microservices - helps move from a user request to the service, dependency, or latency segment that matters.
Performance Engineering - adds the language of profiling, capacity planning, and bottleneck analysis.
Site Reliability Engineering (short summary) - provides the operating frame for SLOs, error budgets, monitoring, on-call, and postmortems.
Release It! (short summary) - covers resilience practices: timeouts, isolation boundaries, load shedding, and protection from cascades.

Related materials

Book Cube post - the review anchor: COUNT(*), SSE, HTTP/1.1, HTTP/2, load balancer, and SRE context.
Root Cause on Amazon - book page with the description, author, and edition format.
Announcement by Hussein Nasser - author context: 15 backend bug stories, investigation flow, diagrams, and a fundamental concept in each story.
Hussein Nasser on YouTube - the author's channel about backend engineering, networking, databases, and distributed systems.

Where to find the book

Original

amazon.com

Root Cause: Stories and Lessons from Two Decades of Backend Engineering Bugs