Base
Streaming Data
Foundation on batch/stream thinking, delivery semantics and data processing level.
Data Pipeline / ETL / ELT Architecture is designing a loop that reliably and predictably turns raw events and records into useful showcases for analytics, ML and product APIs. The key engineering challenge here is not only transformation, but also the reliability of pipelines: idempotency, replay, data quality, observability, cost governance and failure recovery.
ETL vs ELT: how to choose
ETL
Transform before loading into the target storage.
When it suits
- Strict requirements for data quality before entering the DWH.
- Limited target storage resources.
- We need a predictable input data pattern.
Risks
- It is more difficult to reuse raw data for new cases.
- Changing business logic often requires reprocessing the upstream layer.
ELT
Raw data is first loaded into storage/warehouse, transformed later.
When it suits
- We need high ingestion speed and analytical flexibility.
- The team is actively experimenting with models and display cases.
- There is a powerful compute layer in DWH/Lakehouse.
Risks
- Without governance and cost-control it is easy to get expensive chaos.
- We need strict quality policies and access control to the raw layer.
Reference architecture data pipeline
Ingestion
CDC, API pull, events, file downloads. It is important to control schema drift and idempotency.
Raw / Bronze
Immutable layer with raw data for replay and audits. Minimum business logic.
Transform / Silver
Cleaning, deduplication, time standardization, enrichment and key agreement.
Serving / Gold
Domain showcases and aggregates for BI, ML, API and operational workloads.
Orchestration + Quality
DAG scheduler, dependency graph, retries, SLA/SLO, data tests, lineage and alerting.
Hybrid Lakehouse
Hybrid mode: stream updates serving quickly, while batch performs control recalculations and backfill for consistency.
Pros
- Combines low latency and high accuracy.
- Works well for incremental plus periodic full recalculation.
- One raw layer for replay and both processing strategies.
Constraints
- Most complex operating model.
- Requires strict orchestration and cost-governance discipline.
Incoming Jobs
Pipeline Engine
Batch and stream operate together over a shared raw layer.
Ready to simulate the pipeline. You can run auto mode or step through manually.
Last decision
—
Active step: idle
Ingestion
CDC / API / events
Raw / Bronze
Immutable landing zone
Transform
Batch + stream transform
Serving / Gold
BI, ML, APIs
Control Plane
Orchestration + Quality + Lineage + Cost
This loop is always active and determines pipeline reliability regardless of profile.
Processed Counters
Ingested: 0 | Landed: 0 | Transformed: 0 | Served: 0
Watch for long-term divergence between ingest/transform/serve rates.
Data Contracts Checklist
Related
Observability & Monitoring Design
How to build metrics, alerts and runbooks for production pipelines.
Reliability and operation
- Exactly-once is not always realistic: use at-least-once + idempotent processing.
- For backfill, specify a separate circuit so as not to break the online SLA.
- Each pipeline must have an owner, runbook and SLO for freshness/completeness.
- Store checkpoint/offset state in a fault-tolerant backend.
- Consider data contracts between producer and consumer and version schemas.
Common mistakes
One giant DAG for the entire company without domain boundaries.
Hidden business logic in SQL scripts without tests and code review.
Lack of observability: there is only 'job failed', but no data quality signals.
Mixing batch and streaming without late-arriving events strategy.
Non-transparent cost: no budget guardrails on compute and storage.
