Streaming becomes easier to reason about once you stop seeing it as Kafka plus a consumer and start thinking in ingestion, queue, analysis, event-time, and stream-state layers.
In engineering practice, this book helps design streaming pipelines with explicit ordering, state, late-data handling, and materialization boundaries so the architecture survives replay and backfill.
In interviews and architecture reviews, it is especially useful when you need to show the cost of stream processing: reprocessing work, the correctness impact of late events, and backfill pressure on service objectives.
Practical value of this chapter
Design in practice
Supports streaming-pipeline design with event time, ordering, and stateful processing.
Decision quality
Improves batch-versus-stream and materialization-boundary decisions.
Interview articulation
Enables clear discussion of offsets, windowing, and exactly-once limitations.
Risk and trade-offs
Focuses on late events, reprocessing work, and backfill impact on service objectives.
Source
Book review
Alexander Polomodov's original review on tellmeabout.tech, focused on the book's practical value.
Streaming Data: Understanding the Real-Time Pipeline
Authors: Andrew Psaltis
Publisher: Manning Publications, 2017 (Russian edition: DMK Press, 2018)
Length: 216 pages
Andrew Psaltis on streaming architecture: message queues, delivery semantics, event time, windowing, stream state, and historical backfill.
The book is useful because it treats streaming as a full data path, not as a Kafka topic plus a consumer. It connects ingestion, buffering, processing, storage, serving interfaces, and downstream consumers into one system.
That framing keeps the important engineering questions visible: what guarantees the queue provides, where state lives, how late data is handled, how historical replay works, and what happens when consumers fall behind the incoming stream.
Streaming system architecture
Psaltis walks through the data path from the source to the final consumer. The model is valuable because it separates responsibility across ingestion, messaging, processing, storage, and serving layers.
Related topic
Kafka: The Definitive Guide
Deep dive into event logs, partitions, consumer groups, and Kafka operations.
Collecting events from applications, devices, logs, and external systems.
Buffering, routing, and decoupling writers from readers.
Continuous computation, filtering, enrichment, and aggregation.
Fast access to fresh state and intermediate results.
APIs, subscriptions, and delivery protocols for processed data.
Dashboards, services, integrations, and downstream processing paths.
Data ingestion
Author recommendation
Enterprise Integration Patterns
Classic integration-patterns book referenced by the author.
At the edge of a streaming system, the first decision is the interaction model: synchronous request, acknowledged handoff, event publication, one-way delivery, or a continuous stream.
Fault tolerance during ingestion
The author compares checkpoints and message logging. For streaming systems, message logging is often more practical: it lets the system recover after a failure and re-read the needed range of events.
Receiver-based message logging
Sender-based message logging
Hybrid message logging
Message queue
The queue decouples data collection from analysis. Producers can write events at one rate while consumers read and process them at another.
Message delivery semantics
A message is not duplicated, but it may be lost during failure.
A message is not lost, but the handler must tolerate duplicates.
The side effect should happen once, which requires strict storage and idempotency guarantees.
Stream analysis
Related chapter
DDIA: Stream Processing
DDIA goes deeper into stream state, event time, materialized views, and consistency trade-offs.
The strongest part of the book is its explanation of data in motion: the event has not yet landed in its final store, but it can already change state, aggregates, alerts, and user-facing interfaces.
Processing technologies
Typical components
- Application driver
- Streaming manager
- Stream processor
- Data sources
What to check when choosing a system
Message delivery
Loss, duplicates, acknowledgements, and redelivery
State management
Local state, snapshots, and failure recovery
Fault tolerance
Restart, logging, and side-effect control
Limits of stream algorithms
- •Single pass: the processor often has only one chance to make a decision for an event.
- •Concept drift: data patterns change and the model gradually becomes stale.
- •Resource limits: memory, CPU, and network capacity must keep up with the live stream.
- •Time: event time, processing time, and arrival order are different concepts.
Data windows and aggregation
Sliding window
Overlapping intervals that continuously move forward and provide a fresh aggregate.
Tumbling window
Non-overlapping fixed-size intervals that work well for regular reports and metrics.
How to summarize a stream without storing the full history
Random sampling
Representative subset of the stream for approximate analytics.
LogLog / MinCount
Approximate count of unique elements.
Count-Min Sketch
Frequency estimation with bounded memory usage.
Bloom filter
Fast check for possible membership in a set.
Data storage
Long-term storage
- •Direct write: the stream writes to the target store immediately, but may hit its throughput limit.
- •Indirect write: data lands in an intermediate layer first and is loaded in batches later.
In-memory storage
Caching strategies
Read-through
Read through cache
Refresh-ahead
Preemptive refresh
Write-through
Write through cache
Write-around
Bypass cache on write
Write-behind
Deferred write
Access to processed data
Interaction patterns
- Data synchronization
- RPC / RMI
- Simple messaging
- Publish-subscribe
Delivery protocols
Protocol selection factors
Data consumers
Information applications
Dashboards, reports, visualization, and product analytics.
Third-party integrations
APIs, webhooks, synchronization, and event exchange.
Downstream processing
Additional computation paths that read the prepared stream.
Key questions for a streaming client
- 1.How does the consumer know it is falling behind the input stream?
- 2.What happens if that lag grows silently?
- 3.How can reads and processing scale without breaking ordering or delivery guarantees?
What to remember
"Brevity is the sister of talent" - A. P. Chekhov
The book is short, but still useful: it presents a streaming platform as a system with explicit layers, delivery guarantees, state, windows, storage, and consumers. Many tools have changed since publication, but the engineering questions remain: where state lives, how recovery works, how time is interpreted, and what happens when the stream outruns its processors.
Related chapters
- Kafka: The Definitive Guide, 2nd Edition (short summary) - Hands-on focus on brokers, partitions, and delivery semantics as the foundation of streaming architecture.
- Kappa Architecture: stream-first alternative to Lambda - A single streaming path for online processing and historical replay, extending the ideas from the book.
- Data Pipeline / ETL / ELT Architecture - How to embed stream processing into an end-to-end data platform and team operating model.
- Event-driven architecture: Event Sourcing, CQRS, Saga - Architectural context where event streams become the default integration mechanism across services.
- Distributed message queue - System design case focused on throughput, ordering, durability, and behavior under peak load.
- Designing Data-Intensive Applications, 2nd Edition (short summary) - Core foundation for stream processing, stateful computation, and consistency trade-offs in data-intensive systems.
- Enterprise Integration Patterns (short summary) - Pattern language for reliable event and stream interactions between heterogeneous services.
- Big Data: Principles and best practices of scalable realtime data systems (short summary) - Strategic perspective on realtime data-system architecture and the evolution of large-scale streaming platforms.
- Data Mesh in Action (short summary) - Organizational layer for decomposing a streaming platform into data domains and federated governance.
- Google Global Network: Evolution and Architectural Principles for the AI Era - Network foundation for high-throughput streams: latency budgets, cross-region links, and global-network resilience.
