Big Data (short summary) — System Design Space

Big Data still matters not because it once popularized Lambda Architecture, but because it remains a sharp way to see the cost of separating batch, serving, and speed layers.

In real engineering practice, the book helps show where immutable data, approximate algorithms, and separate compute paths are justified and where the architecture starts losing to its own complexity.

In interviews and architecture discussions, it is especially useful when you need to speak honestly about the point where latency, correctness, and complexity stop coexisting peacefully in one design.

Practical value of this chapter

Design in practice

Builds an end-to-end view of batch paths, stream processing, and serving layers for high-volume analytics.

Decision quality

Improves architecture-style choices around latency, recomputation cost, and result correctness.

Interview articulation

Adds concrete criteria for Lambda, Kappa, or hybrid decisions in interview answers.

Risk and trade-offs

Shows where data architecture starts degrading under complexity growth and changing input data.

Source

Book Review

Original review by Alexander Polomodov on tellmeabout.tech

Перейти на сайт

Big Data: Principles and Best Practices of Scalable Realtime Data Systems

Authors: Nathan Marz, James Warren
Publisher: Manning Publications
Length: 328 pages

Nathan Marz on Lambda Architecture: batch, serving, and speed layers, immutable event history, batch and realtime views, HyperLogLog, and the cost of complexity.

Original

Lambda Architecture

Batch Layer

Stores the master dataset as immutable event history and recomputes accurate views over the full history.

Serving Layer

Indexes precomputed views and serves fast responses while the batch layer prepares the next full recomputation.

Speed Layer

Processes fresh events between batch recomputations and builds approximate aggregates for low-latency answers.

Lambda Architecture map

master dataset + batch views + realtime views

Raw event log

immutable dataset

Batch layer

periodic recompute

Batch views

accurate aggregates

Serving layer

query merge

Raw event log

same source

Speed layer

realtime stream

Realtime views

low-latency approx

Query API

final response

Raw event log

immutable append-only source

Batch layer -> batch views

accurate aggregates over the full dataset

Speed layer -> realtime views

low latency between batch recomputations

Serving layer -> Query API

merge batch and realtime views

Lambda Architecture combines accurate batch recomputation, a fast speed layer, and a single serving layer for queries.

"The Lambda Architecture provides a general-purpose approach to implementing an arbitrary function on an arbitrary dataset and having the function return its results with low latency"

— Nathan Marz

Desired properties of a big data processing system

Before choosing any tools, the authors pin down what a big data processing system has to survive in production. This is the checklist every later decision is measured against:

Horizontal scaling

Capacity grows by adding nodes, not by redesigning the whole system.

Fault tolerance

Hardware failures must not cost you the history or the derived views.

Human-error recovery

Bad code or bad data can be repaired by recomputing from the original history.

Low latency

User-facing queries can return fresh answers without waiting for a full batch run.

Flexible computation

New views and algorithms can be added over already accumulated data.

Controlled complexity

The team can reason about which path owns accuracy, freshness, and serving.

Book structure

We recommend

Streaming Data

A modern view of the architecture of streaming systems

Читать обзор

The book is divided into parts that map to the layers of Lambda Architecture:

Part 1: Batch Layer

Data model, master dataset, and accurate view computation over the full history.

Data ModelMaster DatasetBatch ViewsMapReduce

Part 2: Serving Layer

Indexing and serving precomputed views for fast queries.

IndexingBatch Views ServingElephantDB

Part 3: Speed Layer

Fast processing of fresh events and compensation for the delay between batch recomputations.

Realtime ViewsStream ProcessingApache StormMicro-batching

Practical examples

The theory is tested on everyday analytics tasks — counting events, estimating unique visitors, and computing rates — where an exact recomputation and an approximate answer cost very differently:

📊

URL Page Views

Counting page views by URL and time interval.

👥

Unique Visitors

Estimating unique visitors with HyperLogLog.

🚨

Bounce Rate

Calculating bounce rate across a site or domain.

Technology stack examples

Storage

HDFS

Batch layer

Hadoop

Serving layer

ElephantDB

Speed layer

Storm

* Technologies from the 2015 book. Modern alternatives include Spark, Flink, and Kafka Streams.

Related chapters

Designing Data-Intensive Applications, 2nd Edition (short summary) - Foundational distributed-data theory that complements the Lambda model and clarifies core trade-offs.
Streaming Data (short summary) - Hands-on stream processing and modern operational practice around Lambda's speed-layer ideas.
Kafka: The Definitive Guide, 2nd Edition (short summary) - Event-log platform foundations for ingestion and streaming backbones in large-scale data systems.
Kappa Architecture: stream-first alternative to Lambda - Evolution of Lambda ideas toward one stream-first processing path without a separate batch branch.
Data Pipeline / ETL / ELT Architecture - Operational perspective on data pipelines, orchestration strategy, and data quality controls.
Distributed message queue - Practical queueing case focused on ordering, durability, and throughput under real load.
Distributed file system (GFS/HDFS) - Storage-layer fundamentals behind the batch side of Lambda and distributed file-system architecture.
Data Mesh in Action (short summary) - Organizational evolution from centralized Lambda-era platforms to domain-oriented data ownership.
T-Bank data platform overview - Real platform case combining batch and stream processing, lakehouse patterns, and product thinking around data.
Google Global Network: Evolution and Architectural Principles for the AI Era - Network context for cross-region transfer and low-latency processing of large-scale data streams.

Where to find the book

Original

manning.com

Big Data: Principles and Best Practices of Scalable Realtime Data Systems