System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Technoshow “Dropped”: episode 1

mid

Blameless analysis of a two-week incident in the T-Bank data platform: loss of metadata, recovery via Kafka/contracts and practical conclusions on SRE for data.

Technoshow “Dropped”: episode 1

Live broadcast with an honest postmortem analysis of a two-week incident in a data platform: from the root cause to the architectural turnaround and organizational conclusions.

Format:Live tech show / blameless postmortem
Production:T-Bank
Subject:SRE, Data Platform, ETL/DWH

Source

Telegram: book_cube

An original review of the issue with an emphasis on SRE practices and management implications.

Read review

Release context

The issue is designed in the spirit blameless postmortems: Focus on causes, failure mechanics, and architecture/process changes rather than on personal blame. This makes the release useful for both engineers and managers who need a common language of reliability.

Guest and focus of examination

Alexander Krasheninnikov

  • Practitioner in the field of data platforms, ETL and DWH.
  • Guest of the first episode, talking about a real two-week incident.
  • Focus on engineering and organizational insights following service recovery.

Incident chain

Stage 1

Metadata loss and read degradation

The deletion/loss of metadata led to failures in Trino and cascading degradation of data consumers.

Stage 2

Recovery Limitations

There was no backup at the required point, and restoration through CDC was slow and did not cover the business need.

Stage 3

Critical incident and change of approach

The team moved to a new Kafka architecture with contracts, schema unification, and parallel downloads.

Stage 4

Validation, discrepancy repair and recovery

Next - data reconciliation, correction of inconsistencies, partial restoration of service and additional loading of historical data from reserves.

Related chapter

Data Pipeline / ETL / ELT Architecture

Pipeline design: orchestration, recovery and data quality.

Open chapter

What is important to engineers

  • CDC pipelines are easy to use, but require pre-thought-out recovery scenarios and integrity checks.
  • The Outbox approach reduces the risk of losing events between services, but does not replace discipline regarding contract schemes and versions.
  • Data contracts and schema unification are critical to the resiliency of ETL/DWH loops during failover.
  • For pipeline reliability, separate observability for connectors, lags and replication failures is important.

Primary source

Google SRE: Postmortem Culture

Basic principles of blameless culture and learning from failures.

Open article

Management optics

  • Blameless postmortem reduces internal turbulence and accelerates the transition to systemic improvements rather than finger-pointing.
  • A common risk language (severity, error budget, release freeze) helps synchronize business and engineering expectations.
  • Data SLOs need to be formulated in user terms and then grounded in technical metrics and alerts.
  • An incident in a data platform is not only a technological failure, but also a management signal about the maturity of processes.

Primary source

Google SRE Workbook: Implementing SLOs

How to ground user expectations into measurable goals and alerts.

Open material

SLI/SLO for data platform

Data speed

freshness, end-to-end latency, consumer lag

Integrity

duplicates, omissions, discrepancies in control samples

Connector stability

error rate, restart rate, recovery time

An example of a custom SLO: “data in the storefront is fresh no older than X minutes 99.9% of the time,” after which the goal is decomposed into alerts and a risk plan/fact.

Additional materials

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov