Technoshow “Dropped”: episode 1

A strong incident review matters not because of the drama, but because it turns a painful failure into a better model for design decisions.

The analysis of a two-week T-Bank data-platform incident shows how metadata loss, Kafka-based recovery, and data contracts expose real platform fragility.

For retrospectives and design reviews, the chapter helps teams discuss blameless analysis, metadata survivability, contract quality, and the limits of automation through a real prolonged failure.

Practical value of this chapter

Design in practice

Turn guidance on data platform incident review and metadata recovery into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for data platform incident review and metadata recovery: release speed, automation level, observability cost, and operational complexity.

Watch on YouTube

Technoshow “Dropped”: episode 1

A two-week data-platform incident, taken apart live and without smoothing: how the system actually failed, what was missing during recovery, why the team had to turn the architecture around, and what management lesson was left at the end.

Format:Live tech show / blameless postmortem

Production:T-Bank

Subject:SRE, Data Platform, ETL/DWH

Source

Telegram: Book Cube

The original episode review with an emphasis on SRE practices and management implications.

Read review

Release context

The episode is framed as a blameless postmortem: the conversation stays on failure mechanics, recovery, and changes to architecture and process rather than on who pressed the wrong button. The price of that format is naming honestly where the system failed instead of a person. So it reads from two sides: an engineer sees what to fix in the pipeline, and a manager sees where process maturity fell short before the outage.

Guest and context

Alexander Krasheninnikov

Practitioner in the field of data platforms, ETL and DWH.
Guest of the first episode, talking about a real two-week incident.
Focus on engineering and organizational insights following service recovery.

Incident chain

Stage 1

Metadata loss and read degradation

Metadata loss led to Trino failures and cascading degradation for data consumers.

Stage 2

Recovery was not enough

There was no backup for the required point in time. The fallback path, CDC-based metadata recovery, technically worked but ran too slow: business consumers waited for data that never showed up.

Stage 3

Critical incident and change of approach

The team moved to a new Kafka-based architecture with data contracts, schema unification, and parallel ingestion.

Stage 4

Reconciliation, discrepancy fixes, and recovery

Then came data reconciliation, discrepancy fixes, partial service recovery, and historical backfill from reserves.

Related chapter

Data Pipeline / ETL / ELT Architecture

Data-pipeline design: orchestration, recovery, and data quality.

Open chapter

What matters to engineers

CDC pipelines are convenient while everything works; the recovery scenario and data-integrity checks have to be designed upfront, or they simply are not there during an incident.
The Outbox pattern lowers the risk of losing events between services, but it does not replace schema and contract-version discipline.
When a path switches to failover, data contracts and schema unification are what hold it together: without them the ETL/DWH paths drift apart within the first hour of the switch.
Pipeline reliability needs dedicated data observability: connectors, consumer lag, and replication failures must be visible before users complain.

Primary source

Google SRE: Postmortem Culture

Basic principles of blameless culture and learning from failures.

Open article

Management perspective

A blameless postmortem takes the internal turbulence out: the team spends its energy on the cause of the failure instead of self-defense, and the conversation reaches systemic improvements faster.
A shared risk language, including incident severity, error budgets, and release freezes, helps align business and engineering expectations.
Phrase a data SLO in the consumer's terms first — “data is fresh by this deadline” — and only then decompose it into technical metrics and alerts; the reverse order produces numbers the business has no way to check.
A data-platform incident rarely stays a purely technical failure — it exposes process maturity: how fast it was escalated, who made the decisions, and whether a recovery plan was ready.

Primary source

Google SRE Workbook: Implementing SLOs

How to ground user expectations into measurable goals and alerts.

Open material

SLI/SLO for the data platform

Data speed

freshness, end-to-end latency, consumer lag

Integrity

duplicates, omissions, discrepancies in control samples

Connector stability

error rate, restart rate, time to recover

Example user-facing SLO: “data in the mart is no more than X minutes old 99.9% of the time,” after which the goal is decomposed into alerts, risk, and a recovery plan.

References

Related chapters

Observability & Monitoring Design - Covers metrics, alerts, and diagnostics that matter during prolonged data-platform incidents.
Site Reliability Engineering - Foundational SRE practices: blameless postmortems, error budgets, and reliability management.
The Site Reliability Workbook - Hands-on patterns for SLI/SLO design and operational rollout in product teams.
Data Pipeline / ETL / ELT Architecture - Data pipeline architecture, recovery patterns, and platform-level data quality controls.
Consistency and idempotency - Helps reduce duplicates, state divergence, and replay-related failures in recovery flows.