System Design Space
Knowledge graphSettings

Updated: May 1, 2026 at 6:48 PM

Streaming Data (short summary)

hard

Streaming becomes easier to reason about once you stop seeing it as Kafka plus a consumer and start thinking in ingestion, queue, analysis, event-time, and stream-state layers.

In engineering practice, this book helps design streaming pipelines with explicit ordering, state, late-data handling, and materialization boundaries so the architecture survives replay and backfill.

In interviews and architecture reviews, it is especially useful when you need to show the cost of stream processing: reprocessing work, the correctness impact of late events, and backfill pressure on service objectives.

Practical value of this chapter

Design in practice

Supports streaming-pipeline design with event time, ordering, and stateful processing.

Decision quality

Improves batch-versus-stream and materialization-boundary decisions.

Interview articulation

Enables clear discussion of offsets, windowing, and exactly-once limitations.

Risk and trade-offs

Focuses on late events, reprocessing work, and backfill impact on service objectives.

Source

Book review

Alexander Polomodov's original review on tellmeabout.tech, focused on the book's practical value.

Перейти на сайт

Streaming Data: Understanding the Real-Time Pipeline

Authors: Andrew Psaltis
Publisher: Manning Publications, 2017 (Russian edition: DMK Press, 2018)
Length: 216 pages

Andrew Psaltis on streaming architecture: message queues, delivery semantics, event time, windowing, stream state, and historical backfill.

Original
Translated

The book is useful because it treats streaming as a full data path, not as a Kafka topic plus a consumer. It connects ingestion, buffering, processing, storage, serving interfaces, and downstream consumers into one system.

That framing keeps the important engineering questions visible: what guarantees the queue provides, where state lives, how late data is handled, how historical replay works, and what happens when consumers fall behind the incoming stream.

Streaming system architecture

Psaltis walks through the data path from the source to the final consumer. The model is valuable because it separates responsibility across ingestion, messaging, processing, storage, and serving layers.

Related topic

Kafka: The Definitive Guide

Deep dive into event logs, partitions, consumer groups, and Kafka operations.

Читать обзор
Data ingestion

Collecting events from applications, devices, logs, and external systems.

Message queue

Buffering, routing, and decoupling writers from readers.

Stream analysis

Continuous computation, filtering, enrichment, and aggregation.

In-memory store

Fast access to fresh state and intermediate results.

Data access

APIs, subscriptions, and delivery protocols for processed data.

Data consumers

Dashboards, services, integrations, and downstream processing paths.

Data ingestion

Author recommendation

Enterprise Integration Patterns

Classic integration-patterns book referenced by the author.

Читать обзор

At the edge of a streaming system, the first decision is the interaction model: synchronous request, acknowledged handoff, event publication, one-way delivery, or a continuous stream.

Request-response - The client waits for a result and pays the latency of a synchronous call.
Request-acknowledge - The sender gets proof of receipt while processing may continue later.
Publish-subscribe - The producer publishes an event without knowing all downstream consumers.
One-way delivery - The message is sent without a reply, so reliability rules must be explicit.
Stream - Data is delivered continuously and processed as an event sequence.

Fault tolerance during ingestion

The author compares checkpoints and message logging. For streaming systems, message logging is often more practical: it lets the system recover after a failure and re-read the needed range of events.

RBML

Receiver-based message logging

SBML

Sender-based message logging

HML

Hybrid message logging

Message queue

The queue decouples data collection from analysis. Producers can write events at one rate while consumers read and process them at another.

Message delivery semantics

At most once

A message is not duplicated, but it may be lost during failure.

Low
At least once

A message is not lost, but the handler must tolerate duplicates.

Medium
Exactly once

The side effect should happen once, which requires strict storage and idempotency guarantees.

High

Stream analysis

Related chapter

DDIA: Stream Processing

DDIA goes deeper into stream state, event time, materialized views, and consistency trade-offs.

Читать обзор

The strongest part of the book is its explanation of data in motion: the event has not yet landed in its final store, but it can already change state, aggregates, alerts, and user-facing interfaces.

Processing technologies

Spark StreamingStormFlinkSamza

Typical components

  • Application driver
  • Streaming manager
  • Stream processor
  • Data sources

What to check when choosing a system

📨

Message delivery

Loss, duplicates, acknowledgements, and redelivery

💾

State management

Local state, snapshots, and failure recovery

🛡️

Fault tolerance

Restart, logging, and side-effect control

Limits of stream algorithms

  • Single pass: the processor often has only one chance to make a decision for an event.
  • Concept drift: data patterns change and the model gradually becomes stale.
  • Resource limits: memory, CPU, and network capacity must keep up with the live stream.
  • Time: event time, processing time, and arrival order are different concepts.

Data windows and aggregation

Sliding window

Overlapping intervals that continuously move forward and provide a fresh aggregate.

Tumbling window

Non-overlapping fixed-size intervals that work well for regular reports and metrics.

How to summarize a stream without storing the full history

Random sampling

Representative subset of the stream for approximate analytics.

LogLog / MinCount

Approximate count of unique elements.

Count-Min Sketch

Frequency estimation with bounded memory usage.

Bloom filter

Fast check for possible membership in a set.

Data storage

Long-term storage

  • Direct write: the stream writes to the target store immediately, but may hit its throughput limit.
  • Indirect write: data lands in an intermediate layer first and is loaded in batches later.

In-memory storage

SQLiteRocksDBLevelDBMemcachedRedisMemSQLAerospikeApache Ignite

Caching strategies

Read-through

Read through cache

Refresh-ahead

Preemptive refresh

Write-through

Write through cache

Write-around

Bypass cache on write

Write-behind

Deferred write

Access to processed data

Interaction patterns

  • Data synchronization
  • RPC / RMI
  • Simple messaging
  • Publish-subscribe

Delivery protocols

WebhooksLong PollSSEWebSocket

Protocol selection factors

Update frequency
Direction
Latency
Efficiency
Fault tolerance

Data consumers

📊

Information applications

Dashboards, reports, visualization, and product analytics.

🔗

Third-party integrations

APIs, webhooks, synchronization, and event exchange.

Downstream processing

Additional computation paths that read the prepared stream.

Key questions for a streaming client

  • 1.How does the consumer know it is falling behind the input stream?
  • 2.What happens if that lag grows silently?
  • 3.How can reads and processing scale without breaking ordering or delivery guarantees?

What to remember

"Brevity is the sister of talent" - A. P. Chekhov

The book is short, but still useful: it presents a streaming platform as a system with explicit layers, delivery guarantees, state, windows, storage, and consumers. Many tools have changed since publication, but the engineering questions remain: where state lives, how recovery works, how time is interpreted, and what happens when the stream outruns its processors.

Related chapters

Where to find the book

Enable tracking in Settings