Streaming Data (short summary) — System Design Space

Streaming becomes easier to reason about once you stop seeing it as Kafka plus a consumer and start thinking in ingestion, queue, analysis, event-time, and stream-state layers.

In engineering practice, this book helps design streaming pipelines with explicit ordering, state, late-data handling, and materialization boundaries so the architecture survives replay and backfill.

In interviews and architecture reviews, it is especially useful when you need to show the cost of stream processing: reprocessing work, the correctness impact of late events, and backfill pressure on service objectives.

Practical value of this chapter

Design in practice

Supports streaming-pipeline design with event time, ordering, and stateful processing.

Decision quality

Improves batch-versus-stream and materialization-boundary decisions.

Interview articulation

Enables clear discussion of offsets, windowing, and exactly-once limitations.

Risk and trade-offs

Focuses on late events, reprocessing work, and backfill impact on service objectives.

Source

Book review

Alexander Polomodov's original review on tellmeabout.tech, focused on the book's practical value.

Перейти на сайт

Streaming Data: Understanding the Real-Time Pipeline

Authors: Andrew Psaltis
Publisher: Manning Publications, 2017 (Russian edition: DMK Press, 2018)
Length: 216 pages

Andrew Psaltis on streaming architecture: message queues, delivery semantics, event time, windowing, stream state, and historical backfill.

Original

Translated

The book is useful because it treats streaming as a full data path, not as a Kafka topic plus a consumer. It connects ingestion, buffering, processing, storage, serving interfaces, and downstream consumers into one system.

That framing keeps the important engineering questions visible: what guarantees the queue provides, where state lives, how late data is handled, how historical replay works, and what happens when consumers fall behind the incoming stream.

Streaming system architecture

Psaltis walks the data from the source to the final consumer through explicit layers. The split is not for a tidy diagram: once ingestion, messaging, processing, storage, and serving are pulled apart, a failure in one layer does not smear across the whole pipeline and is easier to localize.

Data ingestion

Author recommendation

Enterprise Integration Patterns

Classic integration-patterns book referenced by the author.

Читать обзор

At the edge, the interaction model decides everything: synchronous request, acknowledged handoff, event publication, one-way delivery, or a continuous stream. That choice sets what you pay for downstream — response latency, the risk of a lost message, or the cost of handling duplicates.

Request-response — The client waits for a result and pays the latency of a synchronous call.

Request-acknowledge — The sender gets proof of receipt while processing may continue later.

Publish-subscribe — The producer publishes an event without knowing all downstream consumers.

One-way delivery — The message is sent without a reply, so reliability rules must be explicit.

Stream — Data is delivered continuously and processed as an event sequence.

Fault tolerance during ingestion

The author compares checkpoints and message logging. For streaming systems, message logging is often more practical: it lets the system recover after a failure and re-read the needed range of events.

RBML

Receiver-based message logging

SBML

Sender-based message logging

HML

Hybrid message logging

Message queue

The queue decouples data collection from analysis: producers write events at one rate, consumers read at another. The price of that buffer is consumer lag, which has to stay visible and bounded — otherwise a load spike quietly turns into stale data.

Message delivery semantics

At most once

A message is not duplicated, but it may be lost during failure.

Low

At least once

A message is not lost, but the handler must tolerate duplicates.

Medium

Exactly once

The side effect should happen once, which requires strict storage and idempotency guarantees.

High

Stream analysis

Related chapter

DDIA: Stream Processing

DDIA goes deeper into stream state, event time, materialized views, and consistency trade-offs.

Читать обзор

The strongest part of the book is its take on data in motion: the event has not landed in its final store yet, but it already changes state, aggregates, alerts, and what the user sees. That is why a bug in stream processing shows up immediately and is far harder to roll back than a wrong batch export.

Processing technologies

Spark StreamingStormFlinkSamza

Typical components

Application driver
Streaming manager
Stream processor
Data sources

What to check when choosing a system

📨

Message delivery

Loss, duplicates, acknowledgements, and redelivery

💾

State management

Local state, snapshots, and failure recovery

🛡️

Fault tolerance

Restart, logging, and side-effect control

Limits of stream algorithms

•Single pass: the processor often has only one chance to make a decision for an event.
•Concept drift: data patterns change and the model gradually becomes stale.
•Resource limits: memory, CPU, and network capacity must keep up with the live stream, or the processor starts falling behind.
•Time: event time, processing time, and arrival order are different concepts.

Data windows and aggregation

Sliding window

Overlapping intervals that continuously move forward and provide a fresh aggregate.

Tumbling window

Non-overlapping fixed-size intervals that work well for regular reports and metrics.

How to summarize a stream without storing the full history

Random sampling

Representative subset of the stream for approximate analytics.

LogLog / MinCount

Approximate count of unique elements.

Count-Min Sketch

Frequency estimation with bounded memory usage.

Bloom filter

Fast check for possible membership in a set.

Data storage

Long-term storage

•Direct write: the stream writes to the target store immediately, but may hit its throughput limit.
•Indirect write: data lands in an intermediate layer first and is loaded in batches later.

In-memory storage

SQLiteRocksDBLevelDBMemcachedRedisMemSQLAerospikeApache Ignite

Caching strategies

Read-through

Read through cache

Refresh-ahead

Preemptive refresh

Write-through

Write through cache

Write-around

Bypass cache on write

Write-behind

Deferred write

Access to processed data

Interaction patterns

Data synchronization
RPC / RMI
Simple messaging
Publish-subscribe

Delivery protocols

WebhooksLong PollSSEWebSocket

Protocol selection factors

Update frequency

Direction

Latency

Efficiency

Fault tolerance

Data consumers

📊

Information applications

Dashboards, reports, visualization, and product analytics.

🔗

Third-party integrations

APIs, webhooks, synchronization, and event exchange.

⚡

Downstream processing

Additional computation paths that read the prepared stream.

Key questions for a streaming client

1.How does the consumer know it is falling behind the input stream?
2.What happens if that lag grows silently?
3.How can reads and processing scale without breaking ordering or delivery guarantees?

What to remember

“Brevity is the sister of talent” — A. P. Chekhov

The book is short, but still useful: it presents a streaming platform as a system with explicit layers, delivery guarantees, state, windows, storage, and consumers. The tools have changed since publication, but the engineering questions are the same: where state lives, how recovery works, how time is interpreted, and what happens when the stream outruns its processors.

Related chapters

Kafka: The Definitive Guide, 2nd Edition (short summary) - Hands-on focus on brokers, partitions, and delivery semantics as the foundation of streaming architecture.
Kappa Architecture: stream-first alternative to Lambda - A single streaming path for online processing and historical replay, extending the ideas from the book.
Data Pipeline / ETL / ELT Architecture - How to embed stream processing into an end-to-end data platform and team operating model.
Event-driven architecture: Event Sourcing, CQRS, Saga - Architectural context where event streams become the default integration mechanism across services.
Distributed message queue - System design case focused on throughput, ordering, durability, and behavior under peak load.
Designing Data-Intensive Applications, 2nd Edition (short summary) - Core foundation for stream processing, stateful computation, and consistency trade-offs in data-intensive systems.
Enterprise Integration Patterns (short summary) - Pattern language for reliable event and stream interactions between heterogeneous services.
Big Data: Principles and best practices of scalable realtime data systems (short summary) - Strategic perspective on realtime data-system architecture and the evolution of large-scale streaming platforms.
Data Mesh in Action (short summary) - Organizational layer for decomposing a streaming platform into data domains and federated governance.
Google Global Network: Evolution and Architectural Principles for the AI Era - Network foundation for high-throughput streams: latency budgets, cross-region links, and global-network resilience.

Where to find the book

Original

manning.com

Streaming Data: Understanding the Real-Time Pipeline

Translated

dmkpress.com

Потоковая обработка данных