System Design Space
Knowledge graphSettings

Updated: May 1, 2026 at 7:30 PM

Data Pipeline / ETL / ELT Architecture

medium

How to design data pipelines: ETL and ELT, batch and stream processing, orchestration, data quality, failure recovery, cost governance, and operational reliability.

A data pipeline does not break because the diagram has both batch and streaming paths. It breaks when nobody owns data quality, recovery, and the cost of moving data across layers.

In real engineering work, this chapter helps choose ETL versus ELT through platform maturity, transformation placement, compute cost, and how quality checks become part of normal operations.

In interviews and architecture reviews, it is especially useful when schema drift, data loss, freshness SLOs, and recovery need to be explained as architectural properties rather than future analytics-team problems.

Practical value of this chapter

Design in practice

Helps choose ETL vs ELT by organizational constraints and data-platform maturity.

Decision quality

Provides a framework for transformation-placement decisions across cost, speed, and quality control.

Interview articulation

Supports clear explanation of ingestion, validation, lineage, and serving layers.

Risk and trade-offs

Makes schema drift, data loss, and freshness SLO risks explicit.

Base

Streaming Data

Foundation for batch and stream processing, delivery semantics, and data-processing layers.

Open chapter

Data Pipeline / ETL / ELT Architecture is about designing a reliable path from sources to useful marts for analytics, ML, and product APIs. The hard part is not just transformation: mature pipelines need orchestration, data contracts, quality checks, freshness SLOs, lineage, replay, backfill, observability, cost governance, and recovery after failures.

ETL vs ELT: how to choose

ETL

Transform data before loading it into the target store.

When it suits

  • Data quality must be checked before data reaches the warehouse.
  • Limited target storage resources.
  • The target system needs a predictable input schema.

Risks

  • Raw data is harder to reuse for new use cases.
  • Changing business logic often requires reprocessing the upstream layer.

ELT

Load raw data first, then transform it inside the warehouse or lakehouse.

When it suits

  • The platform needs fast ingestion and analytical flexibility.
  • The team actively experiments with models and marts.
  • The warehouse or lakehouse has a strong compute layer.

Risks

  • Without governance and cost controls, the raw layer can become expensive chaos.
  • Raw data needs strict quality and access policies.

Reference architecture for a data pipeline

1

Ingestion

CDC, API pulls, events, and file loads. The pipeline must control schema drift and idempotency.

2

Raw / Bronze

Immutable raw data for replay, backfill, and audit. Business logic stays minimal here.

3

Transform / Silver

Cleaning, deduplication, time normalization, enrichment, and key alignment.

4

Serving / Gold

Domain marts and aggregates for BI, ML, APIs, and operational workloads.

5

Orchestration + Quality

DAG scheduler, dependency graph, retries, SLA/SLO, data-quality checks, lineage, and alerting.

Hybrid Lakehouse

Hybrid mode: stream updates serving quickly, while batch performs control recalculations and backfill for consistency.

Pros

  • Combines low latency and high accuracy.
  • Works well for incremental plus periodic full recalculation.
  • One raw layer for replay and both processing strategies.

Constraints

  • Most complex operating model.
  • Requires strict orchestration and cost-governance discipline.
Best for: Large data platform loops with mixed workloads.

Incoming jobs

JOB-201
batch
Orders DB
orders_daily
JOB-202
stream
Payments Kafka
payments_rt
JOB-203
batch
CRM API
crm_sync
JOB-204
stream
Mobile events
product_events

Pipeline engine

Batch and stream operate together over a shared raw layer.

Ready to simulate the pipeline. You can run auto mode or step through manually.

Last decision

Active step: idle

Ingestion

0

CDC / API / events

Raw / Bronze

0

Immutable landing zone

Transform

0

Batch + stream transform

Serving / Gold

0

BI, ML, APIs

Control plane

Orchestration + quality + lineage + cost

This loop is always active and determines pipeline reliability regardless of profile.

Processed counters

Ingested: 0 | Landed: 0 | Transformed: 0 | Served: 0

Watch for long-term divergence between ingest/transform/serve rates.

Data contracts checklist

Schema versioning
Freshness / completeness
Idempotent replay

Related

Observability & Monitoring Design

How to build metrics, alerts and runbooks for production pipelines.

Open chapter

Reliability and operation

  • Exactly-once is not always realistic: design for at-least-once delivery plus idempotent processing.
  • Run backfill through a separate path so the live-flow SLA is protected.
  • Every pipeline needs an owner, a runbook, and SLOs for freshness and completeness.
  • Store checkpoint/offset state in a fault-tolerant backend.
  • Treat producer-consumer data contracts as versioned interfaces.

Common mistakes

One giant DAG for the entire company without domain boundaries.

Hidden business logic in SQL scripts without tests or code review.

Observability limited to 'job failed' without data-quality signals.

Mixing batch and streaming without a late-event strategy.

Opaque cost: no budget guardrails for compute and storage.

References

Related chapters

Enable tracking in Settings