A data pipeline does not really break when the diagram shows both batch and streaming. It breaks when nobody owns data quality, recovery, and the cost of moving information across layers.
In real engineering work, this chapter helps choose ETL versus ELT through platform maturity, transform location, compute cost, and the way data quality is embedded into normal operations.
In interviews and architecture reviews, it is especially useful when you need to talk about schema drift, data loss, freshness SLO, and recovery as architectural properties rather than problems for the analytics team later on.
Practical value of this chapter
Design in practice
Helps choose ETL vs ELT by organizational constraints and platform maturity.
Decision quality
Provides a framework for transform-stage decisions across cost, speed, and quality control.
Interview articulation
Supports clear explanation of ingestion, validation, lineage, and serving layers.
Risk and trade-offs
Makes schema drift, data loss, and freshness-SLO risks explicit.
Base
Streaming Data
Foundation on batch/stream thinking, delivery semantics and data processing level.
Data Pipeline / ETL / ELT Architecture is designing a loop that reliably and predictably turns raw events and records into useful showcases for analytics, ML and product APIs. The key engineering challenge here is not only transformation, but also the reliability of pipelines: idempotency, replay, data quality, observability, cost governance and failure recovery.
ETL vs ELT: how to choose
ETL
Transform before loading into the target storage.
When it suits
- Strict requirements for data quality before entering the DWH.
- Limited target storage resources.
- We need a predictable input data pattern.
Risks
- It is more difficult to reuse raw data for new cases.
- Changing business logic often requires reprocessing the upstream layer.
ELT
Raw data is first loaded into storage/warehouse, transformed later.
When it suits
- We need high ingestion speed and analytical flexibility.
- The team is actively experimenting with models and display cases.
- There is a powerful compute layer in DWH/Lakehouse.
Risks
- Without governance and cost-control it is easy to get expensive chaos.
- We need strict quality policies and access control to the raw layer.
Reference architecture data pipeline
Ingestion
CDC, API pull, events, file downloads. It is important to control schema drift and idempotency.
Raw / Bronze
Immutable layer with raw data for replay and audits. Minimum business logic.
Transform / Silver
Cleaning, deduplication, time standardization, enrichment and key agreement.
Serving / Gold
Domain showcases and aggregates for BI, ML, API and operational workloads.
Orchestration + Quality
DAG scheduler, dependency graph, retries, SLA/SLO, data tests, lineage and alerting.
Hybrid Lakehouse
Hybrid mode: stream updates serving quickly, while batch performs control recalculations and backfill for consistency.
Pros
- Combines low latency and high accuracy.
- Works well for incremental plus periodic full recalculation.
- One raw layer for replay and both processing strategies.
Constraints
- Most complex operating model.
- Requires strict orchestration and cost-governance discipline.
Incoming Jobs
Pipeline Engine
Batch and stream operate together over a shared raw layer.
Ready to simulate the pipeline. You can run auto mode or step through manually.
Last decision
—
Active step: idle
Ingestion
CDC / API / events
Raw / Bronze
Immutable landing zone
Transform
Batch + stream transform
Serving / Gold
BI, ML, APIs
Control Plane
Orchestration + Quality + Lineage + Cost
This loop is always active and determines pipeline reliability regardless of profile.
Processed Counters
Ingested: 0 | Landed: 0 | Transformed: 0 | Served: 0
Watch for long-term divergence between ingest/transform/serve rates.
Data Contracts Checklist
Related
Observability & Monitoring Design
How to build metrics, alerts and runbooks for production pipelines.
Reliability and operation
- Exactly-once is not always realistic: use at-least-once + idempotent processing.
- For backfill, specify a separate circuit so as not to break the online SLA.
- Each pipeline must have an owner, runbook and SLO for freshness/completeness.
- Store checkpoint/offset state in a fault-tolerant backend.
- Consider data contracts between producer and consumer and version schemas.
Common mistakes
One giant DAG for the entire company without domain boundaries.
Hidden business logic in SQL scripts without tests and code review.
Lack of observability: there is only 'job failed', but no data quality signals.
Mixing batch and streaming without late-arriving events strategy.
Non-transparent cost: no budget guardrails on compute and storage.
References
Related chapters
- Streaming Data - Stream processing, windows, delivery semantics and architectural tiers.
- Kafka: The Definitive Guide - Log-based backbone for ingestion and stream processing.
- Kappa Architecture - Stream-first approach and replay/backfill via a single log.
- Big Data - Lambda approach and compromises between batch/speed layers.
- Why understand storage systems? - Choosing a storage model and compromises for different loads.
- Event-Driven Architecture - Asynchronous data flows, CQRS/Saga and integration patterns.
- Observability & Monitoring Design - Pipeline monitoring, alerting and improvement operational cycle.
