Technoshow “Dropped”: episode 1
Live broadcast with an honest postmortem analysis of a two-week incident in a data platform: from the root cause to the architectural turnaround and organizational conclusions.
Source
Telegram: book_cube
An original review of the issue with an emphasis on SRE practices and management implications.
Release context
The issue is designed in the spirit blameless postmortems: Focus on causes, failure mechanics, and architecture/process changes rather than on personal blame. This makes the release useful for both engineers and managers who need a common language of reliability.
Guest and focus of examination
Alexander Krasheninnikov
- Practitioner in the field of data platforms, ETL and DWH.
- Guest of the first episode, talking about a real two-week incident.
- Focus on engineering and organizational insights following service recovery.
Incident chain
Metadata loss and read degradation
The deletion/loss of metadata led to failures in Trino and cascading degradation of data consumers.
Recovery Limitations
There was no backup at the required point, and restoration through CDC was slow and did not cover the business need.
Critical incident and change of approach
The team moved to a new Kafka architecture with contracts, schema unification, and parallel downloads.
Validation, discrepancy repair and recovery
Next - data reconciliation, correction of inconsistencies, partial restoration of service and additional loading of historical data from reserves.
Related chapter
Data Pipeline / ETL / ELT Architecture
Pipeline design: orchestration, recovery and data quality.
What is important to engineers
- CDC pipelines are easy to use, but require pre-thought-out recovery scenarios and integrity checks.
- The Outbox approach reduces the risk of losing events between services, but does not replace discipline regarding contract schemes and versions.
- Data contracts and schema unification are critical to the resiliency of ETL/DWH loops during failover.
- For pipeline reliability, separate observability for connectors, lags and replication failures is important.
Primary source
Google SRE: Postmortem Culture
Basic principles of blameless culture and learning from failures.
Management optics
- Blameless postmortem reduces internal turbulence and accelerates the transition to systemic improvements rather than finger-pointing.
- A common risk language (severity, error budget, release freeze) helps synchronize business and engineering expectations.
- Data SLOs need to be formulated in user terms and then grounded in technical metrics and alerts.
- An incident in a data platform is not only a technological failure, but also a management signal about the maturity of processes.
Primary source
Google SRE Workbook: Implementing SLOs
How to ground user expectations into measurable goals and alerts.
SLI/SLO for data platform
Data speed
freshness, end-to-end latency, consumer lag
Integrity
duplicates, omissions, discrepancies in control samples
Connector stability
error rate, restart rate, recovery time
An example of a custom SLO: “data in the storefront is fresh no older than X minutes 99.9% of the time,” after which the goal is decomposed into alerts and a risk plan/fact.

