System Design Space
Knowledge graphSettings

Updated: March 2, 2026 at 9:20 PM

Data Pipeline / ETL / ELT Architecture

mid

How to design a data pipeline: batch + streaming, ETL vs ELT, orchestration, data quality, recovery, cost control and operational reliability.

Base

Streaming Data

Foundation on batch/stream thinking, delivery semantics and data processing level.

Open chapter

Data Pipeline / ETL / ELT Architecture is designing a loop that reliably and predictably turns raw events and records into useful showcases for analytics, ML and product APIs. The key engineering challenge here is not only transformation, but also the reliability of pipelines: idempotency, replay, data quality, observability, cost governance and failure recovery.

ETL vs ELT: how to choose

ETL

Transform before loading into the target storage.

When it suits

  • Strict requirements for data quality before entering the DWH.
  • Limited target storage resources.
  • We need a predictable input data pattern.

Risks

  • It is more difficult to reuse raw data for new cases.
  • Changing business logic often requires reprocessing the upstream layer.

ELT

Raw data is first loaded into storage/warehouse, transformed later.

When it suits

  • We need high ingestion speed and analytical flexibility.
  • The team is actively experimenting with models and display cases.
  • There is a powerful compute layer in DWH/Lakehouse.

Risks

  • Without governance and cost-control it is easy to get expensive chaos.
  • We need strict quality policies and access control to the raw layer.

Reference architecture data pipeline

1

Ingestion

CDC, API pull, events, file downloads. It is important to control schema drift and idempotency.

2

Raw / Bronze

Immutable layer with raw data for replay and audits. Minimum business logic.

3

Transform / Silver

Cleaning, deduplication, time standardization, enrichment and key agreement.

4

Serving / Gold

Domain showcases and aggregates for BI, ML, API and operational workloads.

5

Orchestration + Quality

DAG scheduler, dependency graph, retries, SLA/SLO, data tests, lineage and alerting.

Hybrid Lakehouse

Hybrid mode: stream updates serving quickly, while batch performs control recalculations and backfill for consistency.

Pros

  • Combines low latency and high accuracy.
  • Works well for incremental plus periodic full recalculation.
  • One raw layer for replay and both processing strategies.

Constraints

  • Most complex operating model.
  • Requires strict orchestration and cost-governance discipline.
Best for: Large data platform loops with mixed workloads.

Incoming Jobs

JOB-201
batch
Orders DB
orders_daily
JOB-202
stream
Payments Kafka
payments_rt
JOB-203
batch
CRM API
crm_sync
JOB-204
stream
Mobile Events
product_events

Pipeline Engine

Batch and stream operate together over a shared raw layer.

Ready to simulate the pipeline. You can run auto mode or step through manually.

Last decision

Active step: idle

Ingestion

0

CDC / API / events

Raw / Bronze

0

Immutable landing zone

Transform

0

Batch + stream transform

Serving / Gold

0

BI, ML, APIs

Control Plane

Orchestration + Quality + Lineage + Cost

This loop is always active and determines pipeline reliability regardless of profile.

Processed Counters

Ingested: 0 | Landed: 0 | Transformed: 0 | Served: 0

Watch for long-term divergence between ingest/transform/serve rates.

Data Contracts Checklist

Schema versioning
Freshness / completeness
Idempotent replay

Related

Observability & Monitoring Design

How to build metrics, alerts and runbooks for production pipelines.

Open chapter

Reliability and operation

  • Exactly-once is not always realistic: use at-least-once + idempotent processing.
  • For backfill, specify a separate circuit so as not to break the online SLA.
  • Each pipeline must have an owner, runbook and SLO for freshness/completeness.
  • Store checkpoint/offset state in a fault-tolerant backend.
  • Consider data contracts between producer and consumer and version schemas.

Common mistakes

One giant DAG for the entire company without domain boundaries.

Hidden business logic in SQL scripts without tests and code review.

Lack of observability: there is only 'job failed', but no data quality signals.

Mixing batch and streaming without late-arriving events strategy.

Non-transparent cost: no budget guardrails on compute and storage.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov