Source
Book Review
Original review by Alexander Polomodov on tellmeabout.tech
Streaming Data: Understanding the Real-Time Pipeline
Authors: Andrew Psaltis
Publisher: Manning Publications, 2017 (Russian edition: DMK Press, 2018)
Length: 216 pages
Andrew Psaltis about stream processing: Collection/Queue/Analysis tiers, delivery semantics, data windows, stream algorithms.
Original
TranslatedStreaming System Architecture
The book examines the entire pipeline of working with data from the source to the final consumer. The reference architecture includes the following links:
Related topic
Kafka: The Definitive Guide
Deep dive into one of the key stream processing technologies
Collecting data from sources
Buffering and Routing
Flow Processing and Analysis
Memory storage
Access to processed data
Data consumers
Collection Tier - Data collection link
Author's recommendation
Enterprise Integration Patterns
Classic book on integration patterns referenced by the author
The chapter looks at interaction patterns for data collection:
Fault tolerance
The author considers two approaches: control points And logging. For streaming systems, logging is more applicable:
Receiver-based message logging
Sender-based message logging
Hybrid message logging
Message Queuing Tier
The purpose of this link is to break the connection between data collection and analysis. Key concepts: producer, broker and consumer.
Message delivery semantics
The message is delivered no more than once, it may be lost
Guaranteed delivery, duplicates possible
Exactly one delivery, the most difficult implementation
Analysis Tier - Streaming Data Analysis
Related chapter
DDIA: Stream Processing
Chapter 11 of DDIA covers the topic of stream processing in depth.
The most meaningful part of the book. Starts with a concept in-flight dataand inversions of the traditional data management model.
Processing technologies
Common Components
- Application Driver
- Streaming Manager
- Stream Processor
- Data Sources
Key features when choosing a system
Message Delivery
Delivery semantics
State Management
State management
Fault Tolerance
Fault tolerance
Limitations of algorithms on a thread
- •Single pass — one chance to process each message
- •Concept drift — model properties can change with new data
- •Limited resources — there is not always enough processing power
- •Time — difference between flow time and event time
Data windows and summary
Sliding Window
Sliding window - overlapping intervals for continuous analysis
Tumbling Window
Jumping window - non-overlapping fixed-size intervals
Methods for summarizing data on a stream
Random sampling
Representative part of the stream
LogLog / MinCount
Counting unique elements
Count-Min Sketch
Element occurrence frequency
Bloom filter
Question about element occurrence
Data storage
Long-term Storage
- •Direct recording - reduces flow rate
- •Indirect recording - ETL with batch loading
In-Memory Storage
Caching Strategies
Read-through
Read-through
Refresh-ahead
Leading update
Write-through
Write-through
Write-around
Bypass entry
Write-behind
Delayed recording
Data Access Tier - Data Access
Interaction Patterns
- Data Sync
- RPC / RMI
- Simple Messaging
- Publish-Subscribe
Delivery protocols
Protocol Selection Factors
Consumer Tier - Data Consumers
Information applications
Dashboards, reports, visualization
Integration with third party systems
API, webhooks, synchronization
Stream processing
Downstream processing
Key questions for a streaming client
- 1.How can a client know that he is not reading fast enough?
- 2.What will happen if he doesn't know about it?
- 3.How to scale the client so that it keeps up with the flow?
Results
“Brevity is the sister of talent” - A.P. Chekhov
The book is useful and short (about 200 pages), which makes it even better. Conceptually, it has not become outdated in the years since its release - the architectural patterns of stream processing remain relevant.
