Kafka matters not because it is a famous broker, but because the append-only log changes how services integrate, how streaming is built, and how data can be replayed.
In real engineering work, this book helps design partitioning, retention policy, consumer groups, delivery guarantees, and lag control as parts of one event flow rather than a pile of unrelated settings.
In interviews, reviews, and architecture conversations, it is especially useful when you need to show how per-partition ordering, lag spikes, rebalancing, and storage growth affect whole-system reliability, not just the messaging layer.
Practical value of this chapter
Design in practice
Provides a practical framework for Kafka as an event-flow foundation at scale.
Decision quality
Improves partitioning, retention-policy, and consumer-group choices for the workload.
Interview articulation
Helps explain delivery guarantees, replay, and DLQ strategy in production terms.
Risk and trade-offs
Surfaces ordering, consumer-lag spike, and storage-growth risks.
Source
Post in Book Cube
Original review by Alexander Polomodov
Kafka: The Definitive Guide, 2nd Edition
Authors: Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty
Publisher: O'Reilly Media, Inc.
Length: 485 pages
Practical guide to Kafka as a broker and partitioned log: producers, consumer groups, replication, delivery guarantees, Kafka Connect, Kafka Streams, and cluster operations.
Kafka is useful not just as a message queue, but as a partitioned log: producers write records to topics, while consumer groups read partitions in parallel and keep track of their committed offsets.
The book is most valuable when it connects delivery semantics, replication, rebalancing, consumer lag, and stream processing into one operational model. That is where Kafka stops being just a broker and becomes a data platform.
Book editions
Fall 2017: 11 chapters covering Kafka fundamentals, producers, consumers, administration, and stream processing.
Late 2021: expanded edition with dedicated chapters on programmatic cluster management, transactions, security, and cross-cluster replication.
Core Kafka concepts
Messages and record batches
A record carries a key, value, headers, and timestamp. Records are grouped into batches to reduce network and disk overhead.
Topics and partitions
A topic defines a logical stream of records, while partitions split that stream into independent ordered logs that can scale horizontally.
Producers
Clients that publish records to Kafka, choose topics and partition keys, and configure write acknowledgements.
Consumers
Clients that read records from partitions. Consumer groups divide partitions across members and scale processing.
Related topic
Designing Data-Intensive Applications, 2nd Edition
Chapter 11 takes a deep dive into stream processing.
We recommend
Streaming Data
Architecture of streaming systems: from data collection to data consumption
Book structure (2nd Edition, 14 chapters)
Meet Kafka
Introduction to publish/subscribe messaging, Kafka's origins at LinkedIn, and the core vocabulary: messages, batches, schemas, topics, partitions, producers, consumers, and brokers.
Managing Apache Kafka ProgrammaticallyNEW
AdminClient API as an asynchronous interface for managing topics, configurations, consumer groups, and cluster metadata, plus leader election and replica reassignment.
Installing Kafka
Broker installation and configuration, server sizing, and ZooKeeper or KRaft setup. The 2nd edition adds more emphasis on cloud deployments.
Kafka Producers
Producer configuration, serialization with Avro or JSON, partitioners, headers, interceptors, quotas, and write-throughput control.
Kafka Consumers
Consumer groups, partition assignment, offset management (auto-commit, sync, async), rebalance listeners, and standalone consumers.
TransactionsNEW
Exactly-once guarantees, the transactional producer API, read_committed isolation, idempotency, and atomic writes across multiple partitions.
Kafka Internals(under the hood)
Cluster membership, the controller role, replication, ISR, request processing, physical storage, log segments, and indexes.
Reliable Data Delivery
Delivery guarantees: at-most-once, at-least-once, and exactly-once. Producer acknowledgements, retries, consumer behavior, and broker settings that determine reliability in practice.
Securing KafkaNEW
SSL/TLS encryption, SASL authentication (GSSAPI, PLAIN, SCRAM, OAUTHBEARER), ACL-based authorization, auditing, and operational security.
Building Data Pipelines
Kafka Connect source and sink connectors, standalone and distributed modes, transformations, converters, and dead letter queues.
Cross-Cluster Data Mirroring
MirrorMaker 2.0, multi-datacenter architectures (Active-Active, Active-Passive), and replication of topics and consumer offsets between clusters.
Administering Kafka
Topic operations, consumer-group management, partition reassignment, production configuration, and day-to-day cluster operations.
Monitoring Kafka
JMX metrics and the key broker, producer, and consumer signals: under-replicated partitions, consumer lag, and monitoring tools.
Stream Processing
Kafka Streams API: stateless and stateful operations, windowing, stream joins, KTables and KStreams, exactly-once processing, and testing.
New in 2nd edition
- ▸AdminClient API — programmatic cluster management
- ▸Transactions — exactly-once guarantees and atomic writes
- ▸Securing Kafka — SSL/TLS, SASL, ACLs, and operational security
- ▸MirrorMaker 2.0 - improved cross-cluster replication
- ▸KRaft — coverage of the ZooKeeper-free control-plane mode
Message delivery semantics
At-most-once
The consumer commits progress before processing or without reliable retry. Data can be lost, but latency stays low; this can be acceptable for some metrics and technical logs.
At-least-once
Kafka retries delivery when progress has not been committed. Duplicates are possible, so consumers must make their side effects idempotent.
Exactly-once
Kafka limits duplicate side effects through the idempotent producer and transactional API. This is a processing guarantee, not magic removal of every retry.
Kafka cluster architecture
Hover over a component for details or press the button
Partition replication
Leader accepts writes, followers replicate
Key takeaways for system design
- ▸Partitioning is the key to horizontal scaling. The partition key determines both load distribution and ordering boundaries.
- ▸Replication provides fault tolerance. ISR shows which replicas are synchronized enough to participate in write acknowledgement.
- ▸Consumer groups scale processing. In one group, active consumers cannot exceed the number of partitions.
- ▸Retention policy determines how long Kafka keeps records and therefore bounds replay and consumer recovery.
- ▸Kafka Connect simplifies integration with external systems through source and sink connectors without bespoke application code.
Related chapters
- Streaming Data (short summary) - End-to-end streaming architecture perspective, from event ingestion to consumers and windowed processing.
- Designing Data-Intensive Applications, 2nd Edition (short summary) - Foundational model of replication, consistency, and stream processing behind Kafka's design trade-offs.
- Distributed message queue - Practical case study on ordering, throughput, durability, and behavior under failure conditions.
- Event-driven architecture: Event Sourcing, CQRS, Saga - Architectural context where Kafka is often used as the transport backbone for event-driven workflows.
- Kappa Architecture: stream-first alternative to Lambda - Single processing path model where the Kafka log serves as the source of truth for live processing and historical replay.
- Data Pipeline / ETL / ELT Architecture - How Kafka fits into production data platforms across ingestion, orchestration, data quality, and operations.
- Enterprise Integration Patterns (short summary) - Integration pattern language for designing robust producer/consumer and routing interactions.
- Big Data: Principles and best practices of scalable realtime data systems (short summary) - Strategic context for real-time data systems where Kafka frequently becomes a central platform component.
- Google Global Network: Evolution and Architectural Principles for the AI Era - Network context for cross-region replication and high-throughput stream transport at global scale.
- Google TPU: architecture evolution and impact on ML systems - AI workload context where Kafka-style logs and streams feed data and ML pipelines.
