System Design Space
Knowledge graphSettings

Updated: March 24, 2026 at 3:23 PM

Technoshow “Dropped”: episode 1

medium

Blameless analysis of a two-week incident in the T-Bank data platform: loss of metadata, recovery via Kafka/contracts and practical conclusions on SRE for data.

A strong incident review matters not because of the drama, but because it turns a painful failure into a new model for design decisions.

The analysis of a two-week incident in the T-Bank data platform shows how metadata loss, recovery through Kafka, and architectural contracts expose the real fragile points of data SRE.

For retrospectives and design reviews, the chapter is useful because it lets you discuss blame-free analysis, metadata survivability, contract quality, and the limits of automation on the basis of a real prolonged failure.

Practical value of this chapter

Design in practice

Turn guidance on practical incident review and data-platform recovery into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for practical incident review and data-platform recovery: release speed, automation level, observability cost, and operational complexity.

Technoshow “Dropped”: episode 1

Live broadcast with an honest postmortem analysis of a two-week incident in a data platform: from the root cause to the architectural turnaround and organizational conclusions.

Format:Live tech show / blameless postmortem
Production:T-Bank
Subject:SRE, Data Platform, ETL/DWH

Source

Telegram: book_cube

An original review of the issue with an emphasis on SRE practices and management implications.

Read review

Release context

The issue is designed in the spirit blameless postmortems: Focus on causes, failure mechanics, and architecture/process changes rather than on personal blame. This makes the release useful for both engineers and managers who need a common language of reliability.

Guest and focus of examination

Alexander Krasheninnikov

  • Practitioner in the field of data platforms, ETL and DWH.
  • Guest of the first episode, talking about a real two-week incident.
  • Focus on engineering and organizational insights following service recovery.

Incident chain

Stage 1

Metadata loss and read degradation

The deletion/loss of metadata led to failures in Trino and cascading degradation of data consumers.

Stage 2

Recovery Limitations

There was no backup at the required point, and restoration through CDC was slow and did not cover the business need.

Stage 3

Critical incident and change of approach

The team moved to a new Kafka architecture with contracts, schema unification, and parallel downloads.

Stage 4

Validation, discrepancy repair and recovery

Next - data reconciliation, correction of inconsistencies, partial restoration of service and additional loading of historical data from reserves.

Related chapter

Data Pipeline / ETL / ELT Architecture

Pipeline design: orchestration, recovery and data quality.

Open chapter

What is important to engineers

  • CDC pipelines are easy to use, but require pre-thought-out recovery scenarios and integrity checks.
  • The Outbox approach reduces the risk of losing events between services, but does not replace discipline regarding contract schemes and versions.
  • Data contracts and schema unification are critical to the resiliency of ETL/DWH loops during failover.
  • For pipeline reliability, separate observability for connectors, lags and replication failures is important.

Primary source

Google SRE: Postmortem Culture

Basic principles of blameless culture and learning from failures.

Open article

Management optics

  • Blameless postmortem reduces internal turbulence and accelerates the transition to systemic improvements rather than finger-pointing.
  • A common risk language (severity, error budget, release freeze) helps synchronize business and engineering expectations.
  • Data SLOs need to be formulated in user terms and then grounded in technical metrics and alerts.
  • An incident in a data platform is not only a technological failure, but also a management signal about the maturity of processes.

Primary source

Google SRE Workbook: Implementing SLOs

How to ground user expectations into measurable goals and alerts.

Open material

SLI/SLO for data platform

Data speed

freshness, end-to-end latency, consumer lag

Integrity

duplicates, omissions, discrepancies in control samples

Connector stability

error rate, restart rate, recovery time

An example of a custom SLO: “data in the storefront is fresh no older than X minutes 99.9% of the time,” after which the goal is decomposed into alerts and a risk plan/fact.

Additional materials

Related chapters

Enable tracking in Settings