Data Pipeline / ETL / ELT Architecture

A data pipeline does not break because the diagram has both batch and streaming paths. It breaks when nobody owns data quality, recovery, and the cost of moving data across layers.

In real engineering work, this chapter helps choose ETL versus ELT through platform maturity, transformation placement, compute cost, and how quality checks become part of normal operations.

In interviews and architecture reviews, it is especially useful when schema drift, data loss, freshness SLOs, and recovery need to be explained as architectural properties rather than future analytics-team problems.

Practical value of this chapter

Design in practice

Helps choose ETL vs ELT by organizational constraints and data-platform maturity.

Decision quality

Provides a framework for transformation-placement decisions across cost, speed, and quality control.

Interview articulation

Supports clear explanation of ingestion, validation, lineage, and serving layers.

Risk and trade-offs

Makes schema drift, data loss, and freshness SLO risks explicit.

Base

Streaming Data

Foundation for batch and stream processing, delivery semantics, and data-processing layers.

Open chapter

Data Pipeline / ETL / ELT Architecture answers how to turn raw events and records into marts for analytics, ML, and product APIs so the path survives failures and does not fall apart on the first schema change. The transformations are the easy part. The cost lives elsewhere: recovery after failures, replay and backfill, orchestration, data contracts, quality checks, freshness SLOs, lineage, observability, and cost governance.

ETL vs ELT: how to choose

ETL

Transform before loading: only cleaned data reaches the store, and the raw input is left behind.

When it suits

Data quality must be checked before data reaches the warehouse.
Target storage resources are limited, and extra compute inside it is expensive.
Downstream consumers need a predictable input schema.

Risks

If the raw input was not kept, a new analytical use case hits data that no longer exists.
Change the business logic and you have to reprocess the upstream layer.

ELT

Load raw data as-is and defer the transforms — flexible, but the raw layer has to stay under control.

When it suits

Fast ingestion and freedom to rework analytical models both matter.
The team actively experiments with models and marts.
The warehouse or lakehouse has a strong compute layer.

Risks

Without governance and cost controls, the raw layer turns into expensive chaos.
The raw layer needs strict quality and access policies, or people start querying it blind.

Reference architecture for a data pipeline

Ingestion

CDC, API pulls, events, and file loads. The pipeline must control schema drift and idempotency.

Raw / Bronze

Immutable raw data for replay, backfill, and audit. Business logic stays minimal here.

Transform / Silver

Cleaning, deduplication, time normalization, enrichment, and key alignment.

Serving / Gold

Domain marts and aggregates for BI, ML, APIs, and operational workloads.

Orchestration + Quality

DAG scheduler, dependency graph, retries, SLA/SLO, data-quality checks, lineage, and alerting.

Hybrid Lakehouse

Hybrid mode: stream updates serving quickly, while batch performs control recalculations and backfill for consistency.

Pros

Combines low latency and high accuracy.
Works well for incremental plus periodic full recalculation.
One raw layer for replay and both processing strategies.

Constraints

Most complex operating model.
Requires strict orchestration and cost-governance discipline.

Best for: Large data platform loops with mixed workloads.

Incoming jobs

JOB-201

batch

Orders DB

orders_daily

JOB-202

stream

Payments Kafka

payments_rt

JOB-203

batch

CRM API

crm_sync

JOB-204

stream

Mobile events

product_events

Pipeline engine

Batch and stream operate together over a shared raw layer.

Ready to simulate the pipeline. You can run auto mode or step through manually.

Last decision

—

Active step: idle

Ingestion

CDC / API / events

Raw / Bronze

Immutable landing zone

Transform

Batch + stream transform

Serving / Gold

BI, ML, APIs

Control plane

Orchestration + quality + lineage + cost

This loop is always active and determines pipeline reliability regardless of profile.

Processed counters

Ingested: 0 | Landed: 0 | Transformed: 0 | Served: 0

Watch for long-term divergence between ingest/transform/serve rates.

Data contracts checklist

Schema versioning

Freshness / completeness

Idempotent replay

Observability & Monitoring Design

How to build metrics, alerts and runbooks for production pipelines.

Open chapter

Reliability and operation

Exactly-once is not always realistic: design for at-least-once delivery plus idempotent processing.
Run backfill through a separate path so the live-flow SLA is protected.
Every pipeline needs an owner, a runbook, and SLOs for freshness and completeness.
Store checkpoint/offset state in a fault-tolerant backend.
Treat producer-consumer data contracts as versioned interfaces.

Common mistakes

One giant DAG for the entire company without domain boundaries.

Hidden business logic in SQL scripts without tests or code review.

Observability limited to 'job failed' without data-quality signals.

Mixing batch and streaming without a late-event strategy.

Opaque cost: no budget guardrails for compute and storage.

References

Related chapters

Streaming Data - Stream processing, windows, delivery semantics, and streaming-system layers.
Kafka: The Definitive Guide, 2nd Edition (short summary) - Partitioned logs as a foundation for ingestion and stream processing.
Kappa Architecture: stream-first alternative to Lambda - A stream-first approach where historical replay and backfill run through one event log.
Big Data: Principles and best practices of scalable realtime data systems (short summary) - Lambda Architecture and the trade-offs between batch and speed layers.
Why understand storage systems? - Choosing a storage model and understanding workload-specific trade-offs.
Event-Driven Architecture - Asynchronous data flows, CQRS/Saga and integration patterns.
Observability & Monitoring Design - Pipeline monitoring, alerting, and the operational improvement loop.