Source
Book Review
Original review by Alexander Polomodov on tellmeabout.tech
Big Data: Principles and Best Practices of Scalable Realtime Data Systems
Authors: Nathan Marz, James Warren
Publisher: Manning Publications
Length: 328 pages
Nathan Marz about Lambda Architecture: batch/serving/speed layers, data immutability, HyperLogLog and practical examples.
OriginalLambda Architecture
Related topic
DDIA: Batch & Stream Processing
Chapters 10-11 of DDIA cover batch and stream processing in detail.
The book is dedicated Lambda Architecture — an architectural pattern for big data processing systems, consisting of three levels:
Batch Layer
Storing master data in the format of immutable events (append-only). Calculation of arbitrary representations on a complete data set.
Serving Layer
Fast queries on precomputed views. Can be immutable between batch layer recalculations.
Speed Layer
Data flow processing for updating between batch layer recalculations. Approximate aggregates in real time.
Lambda Architecture Map
master dataset + batch views + realtime viewsLambda Architecture объединяет точность batch-пересчётов и low-latency потоковый слой через единый serving контур.
"The Lambda Architecture provides a general-purpose approach to implementing an arbitrary function on an arbitrary dataset and having the function return its results with low latency"
— Nathan Marz
Desired properties of Big Data System
The authors identify the key properties that a big data processing system should have:
Horizontal scaling
Ability to add nodes to increase power
Fault tolerance
Resilience to hardware failures without data loss
Bug fixes
Ability to correct human errors
Low Latency
Quick responses to user requests
Custom requests
Supports any type of data calculations
Minimum difficulty
Simplicity of operational support of the system
Book structure
We recommend
Streaming Data
A modern view of the architecture of streaming systems
The book is divided into parts corresponding to the levels of Lambda Architecture:
Part 1: Batch Layer
Data model, storing master data, computing views on a complete data set.
Part 2: Serving Layer
Indexing and serving precomputed views for fast queries.
Part 3: Speed Layer
Real-time data processing, batch layer delay compensation.
Practical examples
The authors do not limit themselves to theory, but also analyze typical tasks for big data systems:
URL Page Views
Counting website URL views over time
Unique Visitors
Calculating the number of unique users with HyperLogLog
Bounce Rate
Counting web application failures across the entire domain
Technology stack examples
Storage
HDFS
Batch
Hadoop
Serving
ElephantDB
Speed
Storm
* Technologies from the 2015 book. Modern alternatives: Spark, Flink, Kafka Streams
