System Design Space
Knowledge graphSettings

Updated: May 17, 2026 at 11:00 AM

Technoshow “Dropped”: episode 1

medium

Blameless analysis of a two-week T-Bank data-platform incident: metadata loss, recovery through Kafka and data contracts, data SLOs, and engineering/management takeaways.

A strong incident review matters not because of the drama, but because it turns a painful failure into a better model for design decisions.

The analysis of a two-week T-Bank data-platform incident shows how metadata loss, Kafka-based recovery, and data contracts expose real platform fragility.

For retrospectives and design reviews, the chapter helps teams discuss blameless analysis, metadata survivability, contract quality, and the limits of automation through a real prolonged failure.

Practical value of this chapter

Design in practice

Turn guidance on data platform incident review and metadata recovery into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for data platform incident review and metadata recovery: release speed, automation level, observability cost, and operational complexity.

Technoshow “Dropped”: episode 1

Live broadcast with an honest blameless review of a two-week data-platform incident: from the failure mechanics to recovery, architecture changes, and organizational takeaways.

Format:Live tech show / blameless postmortem
Production:T-Bank
Subject:SRE, Data Platform, ETL/DWH

Source

Telegram: Book Cube

The original episode review with an emphasis on SRE practices and management implications.

Read review

Release context

The episode is framed as a blameless postmortem: it focuses on failure mechanics, recovery, architecture changes, and process changes rather than personal blame. That makes it useful for engineers and managers who need a shared language for risk and reliability.

Guest and context

Alexander Krasheninnikov

  • Practitioner in the field of data platforms, ETL and DWH.
  • Guest of the first episode, talking about a real two-week incident.
  • Focus on engineering and organizational insights following service recovery.

Incident chain

Stage 1

Metadata loss and read degradation

Metadata loss led to Trino failures and cascading degradation for data consumers.

Stage 2

Recovery was not enough

There was no backup for the required point in time, and CDC-based metadata recovery was too slow for business consumers.

Stage 3

Critical incident and change of approach

The team moved to a new Kafka-based architecture with data contracts, schema unification, and parallel ingestion.

Stage 4

Reconciliation, discrepancy fixes, and recovery

Then came data reconciliation, discrepancy fixes, partial service recovery, and historical backfill from reserves.

Related chapter

Data Pipeline / ETL / ELT Architecture

Data-pipeline design: orchestration, recovery, and data quality.

Open chapter

What matters to engineers

  • CDC pipelines are convenient in normal operations, but they need recovery scenarios and data-integrity checks designed upfront.
  • The Outbox pattern lowers the risk of losing events between services, but it does not replace schema and contract-version discipline.
  • Data contracts and schema unification are critical to keeping ETL/DWH paths resilient during failover.
  • Pipeline reliability needs dedicated data observability: connectors, consumer lag, and replication failures must be visible before users complain.

Primary source

Google SRE: Postmortem Culture

Basic principles of blameless culture and learning from failures.

Open article

Management perspective

  • A blameless postmortem reduces internal turbulence and moves the conversation toward systemic improvements instead of blame.
  • A shared risk language, including incident severity, error budgets, and release freezes, helps align business and engineering expectations.
  • Data SLOs should be phrased in user terms first, then decomposed into technical metrics and alerts.
  • A data-platform incident is not only a technical failure. It is also a management signal about process maturity.

Primary source

Google SRE Workbook: Implementing SLOs

How to ground user expectations into measurable goals and alerts.

Open material

SLI/SLO for the data platform

Data speed

freshness, end-to-end latency, consumer lag

Integrity

duplicates, omissions, discrepancies in control samples

Connector stability

error rate, restart rate, time to recover

Example user-facing SLO: “data in the mart is no more than X minutes old 99.9% of the time,” after which the goal is decomposed into alerts, risk, and a recovery plan.

References

Related chapters

Enable tracking in Settings