Source
Post in Book Cube
Original book review from Alexander Polomodov
Kafka: The Definitive Guide, 2nd Edition
Authors: Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty
Publisher: O'Reilly Media, Inc.
Length: 485 pages
Distributed stream processing platform: producers, consumers, partitions, replication, delivery semantics and Kafka Streams.
Original
TranslatedBook editions
Fall 2017 - 11 chapters covering the basics of Kafka, producers, consumers, administration and stream processing.
Late 2021 - expanded edition with an emphasis on cloud deployments and new platform capabilities.
Key Concepts of Kafka
Messages and packages
Basic data units in Kafka. Messages are grouped into batches for efficient transmission over the network.
Topics and partitions
Topics are logical message channels divided into partitions for parallel processing and scaling.
Producers
Clients writing messages to Kafka. Kafka is aimed at writers - high write throughput.
Consumers
Clients reading messages from Kafka. Consumer groups provide parallel processing and fault tolerance.
Related topic
Designing Data-Intensive Applications
Chapter 11 takes a deep dive into stream processing.
We recommend
Streaming Data
Architecture of streaming systems: from data collection to data consumption
Book structure (2nd Edition - 14 chapters)
Meet Kafka
Introduction to publish/subscribe messaging, history of creation on LinkedIn, basic concepts: messages, batches, schemas, topics, partitions, producers, consumers, brokers.
Managing Apache Kafka ProgrammaticallyNEW
AdminClient API: asynchronous interface for managing topics, configurations, consumer groups, cluster metadata. Leader election and reassigning replicas.
Installing Kafka
Installation and configuration of brokers, selection of hardware, configuration of ZooKeeper/KRaft. The 2nd edition brings more emphasis on cloud deployments.
Kafka Producers
Configuration of producers, serialization (Avro, JSON), partitioners, headers, interceptors, quotas and bandwidth management.
Kafka Consumers
Consumer groups, partition assignment, offset management (auto-commit, sync, async), rebalance listeners, standalone consumers.
TransactionsNEW
Exactly-once semantics, transactional producer API, read_committed isolation, idempotency and atomic writes.
Kafka Internals(under the hood)
Cluster membership, controller, replication, ISR, request processing, physical storage, log segments and indexes.
Reliable Data Delivery
Delivery guarantees: at-least-once, at-most-once, exactly-once. Configuration of producer (acks, retries), consumer and broker for reliability.
Securing KafkaNEW
SSL/TLS encryption, SASL authentication (GSSAPI, PLAIN, SCRAM, OAUTHBEARER), authorization with ACLs, audit and security in production.
Building Data Pipelines
Kafka Connect: source and sink connectors, standalone and distributed mode, transformations, converters, dead letter queues.
Cross-Cluster Data Mirroring
MirrorMaker 2.0, multi-datacenter architecture (Active-Active, Active-Passive), replication of topics and consumer offsets between clusters.
Administering Kafka
Topic operations, consumer group management, partition reassignment, configuration for production, cluster operations.
Monitoring Kafka
JMX metrics, critical metrics for brokers, producers and consumers. Under-replicated partitions, lag monitoring, monitoring tools.
Stream Processing
Kafka Streams API: stateless and stateful operations, windowing, joins, KTables vs KStreams, exactly-once processing, testing.
New in 2nd edition
- ▸AdminClient API — software cluster management
- ▸Transactions — exactly-once semantics and atomic operations
- ▸Security — SSL/TLS, SASL, ACLs for production
- ▸MirrorMaker 2.0 - improved cross-cluster replication
- ▸KRaft — mention of a new mode without ZooKeeper
Message delivery semantics
At-most-once
The message is delivered no more than once. Possible loss of data. Suitable for metrics and logs.
At-least-once
The message is delivered at least once. Duplicates are possible. Standard Kafka mode.
Exactly-once
The message is delivered exactly once. Requires idempotent producer and transactional API.
Kafka cluster architecture
Hover over a component for details or press the button
Partition replication
Leader accepts writes, followers replicate
Key Takeaways for System Design
- ▸Partitioning is the key to horizontal scaling. The choice of partition key determines the load distribution.
- ▸Replication provides fault tolerance. ISR (In-Sync Replicas) guarantees consistency.
- ▸Consumer groups allow scaling of processing. Number of consumers ≤ number of partitions.
- ▸Retention policy determines how long data is stored. Kafka can act as a log store.
- ▸Kafka Connect simplifies integration with external systems without writing code (source and sink connectors).
