A data pipeline does not break because the diagram has both batch and streaming paths. It breaks when nobody owns data quality, recovery, and the cost of moving data across layers.
In real engineering work, this chapter helps choose ETL versus ELT through platform maturity, transformation placement, compute cost, and how quality checks become part of normal operations.
In interviews and architecture reviews, it is especially useful when schema drift, data loss, freshness SLOs, and recovery need to be explained as architectural properties rather than future analytics-team problems.
Practical value of this chapter
Design in practice
Helps choose ETL vs ELT by organizational constraints and data-platform maturity.
Decision quality
Provides a framework for transformation-placement decisions across cost, speed, and quality control.
Interview articulation
Supports clear explanation of ingestion, validation, lineage, and serving layers.
Risk and trade-offs
Makes schema drift, data loss, and freshness SLO risks explicit.
Base
Streaming Data
Foundation for batch and stream processing, delivery semantics, and data-processing layers.
Data Pipeline / ETL / ELT Architecture is about designing a reliable path from sources to useful marts for analytics, ML, and product APIs. The hard part is not just transformation: mature pipelines need orchestration, data contracts, quality checks, freshness SLOs, lineage, replay, backfill, observability, cost governance, and recovery after failures.
ETL vs ELT: how to choose
ETL
Transform data before loading it into the target store.
When it suits
- Data quality must be checked before data reaches the warehouse.
- Limited target storage resources.
- The target system needs a predictable input schema.
Risks
- Raw data is harder to reuse for new use cases.
- Changing business logic often requires reprocessing the upstream layer.
ELT
Load raw data first, then transform it inside the warehouse or lakehouse.
When it suits
- The platform needs fast ingestion and analytical flexibility.
- The team actively experiments with models and marts.
- The warehouse or lakehouse has a strong compute layer.
Risks
- Without governance and cost controls, the raw layer can become expensive chaos.
- Raw data needs strict quality and access policies.
Reference architecture for a data pipeline
Ingestion
CDC, API pulls, events, and file loads. The pipeline must control schema drift and idempotency.
Raw / Bronze
Immutable raw data for replay, backfill, and audit. Business logic stays minimal here.
Transform / Silver
Cleaning, deduplication, time normalization, enrichment, and key alignment.
Serving / Gold
Domain marts and aggregates for BI, ML, APIs, and operational workloads.
Orchestration + Quality
DAG scheduler, dependency graph, retries, SLA/SLO, data-quality checks, lineage, and alerting.
Hybrid Lakehouse
Hybrid mode: stream updates serving quickly, while batch performs control recalculations and backfill for consistency.
Pros
- Combines low latency and high accuracy.
- Works well for incremental plus periodic full recalculation.
- One raw layer for replay and both processing strategies.
Constraints
- Most complex operating model.
- Requires strict orchestration and cost-governance discipline.
Incoming jobs
Pipeline engine
Batch and stream operate together over a shared raw layer.
Ready to simulate the pipeline. You can run auto mode or step through manually.
Last decision
—
Active step: idle
Ingestion
CDC / API / events
Raw / Bronze
Immutable landing zone
Transform
Batch + stream transform
Serving / Gold
BI, ML, APIs
Control plane
Orchestration + quality + lineage + cost
This loop is always active and determines pipeline reliability regardless of profile.
Processed counters
Ingested: 0 | Landed: 0 | Transformed: 0 | Served: 0
Watch for long-term divergence between ingest/transform/serve rates.
Data contracts checklist
Related
Observability & Monitoring Design
How to build metrics, alerts and runbooks for production pipelines.
Reliability and operation
- Exactly-once is not always realistic: design for at-least-once delivery plus idempotent processing.
- Run backfill through a separate path so the live-flow SLA is protected.
- Every pipeline needs an owner, a runbook, and SLOs for freshness and completeness.
- Store checkpoint/offset state in a fault-tolerant backend.
- Treat producer-consumer data contracts as versioned interfaces.
Common mistakes
One giant DAG for the entire company without domain boundaries.
Hidden business logic in SQL scripts without tests or code review.
Observability limited to 'job failed' without data-quality signals.
Mixing batch and streaming without a late-event strategy.
Opaque cost: no budget guardrails for compute and storage.
References
Related chapters
- Streaming Data - Stream processing, windows, delivery semantics, and streaming-system layers.
- Kafka: The Definitive Guide, 2nd Edition (short summary) - Partitioned logs as a foundation for ingestion and stream processing.
- Kappa Architecture: stream-first alternative to Lambda - A stream-first approach where historical replay and backfill run through one event log.
- Big Data: Principles and best practices of scalable realtime data systems (short summary) - Lambda Architecture and the trade-offs between batch and speed layers.
- Why understand storage systems? - Choosing a storage model and understanding workload-specific trade-offs.
- Event-Driven Architecture - Asynchronous data flows, CQRS/Saga and integration patterns.
- Observability & Monitoring Design - Pipeline monitoring, alerting, and the operational improvement loop.
