Distributed Message Queue — System Design Space

A distributed message queue is not just a buffer between services. It defines ordering boundaries, delivery rules, retry behavior, and what happens when consumers fall behind.

The chapter ties together event publication, partitioning, offset tracking, consumer groups, redelivery, and quarantine for problematic records into one architecture.

For interviews and engineering discussions, this case is useful because it quickly shows whether you can distinguish plain high throughput from truly reliable asynchronous integration.

Delivery Semantics

The key choice is not the broker brand. It is the delivery model the business can tolerate and how the system handles duplicates, loss, and retries.

Consumer Groups

Parallelism does not appear automatically: you need to reason about partition ownership, rebalance moments, and where ordering is lost.

Redelivery

The retry path should isolate temporary failures instead of turning into a cluster-wide retry storm.

Consumer Lag

Backlog growth matters only when you can connect lag to business delay, overloaded handlers, and degraded modes.

Acing SDI

Practice task from chapter 9

A practical case about distributed message queues as a base layer for asynchronous service integration.

Читать обзор

Distributed message queues are not just buffers between services. They set the rules: which delivery semantics are allowed, where ordering holds, and what happens when consumers fall behind the incoming stream. Those rules decide whether a handler crash becomes a lost message or stays recoverable.

Representative systems

Apache Kafka: Partitioned log, consumer groups, replay, and streaming-heavy workloads.
RabbitMQ: Flexible routing, explicit queues, and fine-grained acknowledgement behavior.
Apache Pulsar: Separated storage and compute, multi-tenant topics, and isolation between workloads.
NATS JetStream: Lightweight event bus with persisted streams and simple operational shape.
AWS SQS/SNS: Managed async messaging for cloud-native integration patterns.

Functional requirements

Publishing and reading a message is only half the job; the second half starts with acknowledgements. The system also needs offset tracking, replay, and parallel processing through consumer groups.

Core API

POST /topics/:name/messages publishes a record to a topic
GET /topics/:name/poll lets consumers fetch records
POST /offsets/commit confirms processed offsets
POST /topics/:name/replay re-reads messages from a chosen offset

Processing reliability

Parallel consumers grouped by topic ownership
Retry flow plus DLQ boundary for irrecoverable messages
At-least-once delivery as the practical baseline
Overload protection through throttling and bounded retries

Non-functional requirements

Peak QPS numbers say little on their own if it is unclear how the system holds under load. What matters just as much is predictable delivery lag, stable behavior during bursts, and graceful degradation once the backlog starts growing.

Requirement	Target	Reason
Throughput	High even during short bursts	Producers should not stall just because consumers temporarily fall behind
Delivery lag	Controlled end-to-end delay	Business flow depends on time-to-processing, not only on broker append latency
Scalability	Growth through partitions and consumer groups	Capacity should grow without a full redesign of the queue layer
Durability	Confirmed writes survive node loss	A single broker failure should not erase already acknowledged records
Predictable degradation	Bounded retries and isolated failures	Retry storms should not collapse the whole asynchronous pipeline

Deep dive

Kafka (book summary)

Partitioned logs, consumer groups, replication, and practical operational trade-offs.

Читать обзор

Architecture overview

The baseline queue shape combines broker ingress, a partitioned replicated log, explicit acknowledgement policy, and separate retry and quarantine paths for problematic records. That acknowledgement policy is what defines which records count as written at the moment of failure.

Architecture Overview

partitioned log, consumer groups, and retry control

The diagram covers publish flow, consume flow, and the retry/DLQ control loop.

Ingress Plane

Producer Services

publish events

Broker Frontend

ingress API

Topic Router

partition key

Log Plane

Topic Partitions

P0, P1, P2...

Replicated Log

leader and replicas

Delivery Plane

Consumer Group

parallel workers

Offset Store

commit state

Retry Topic

backoff queue

DLQ

poison isolation

Producers, broker, and routing

ingress and partition choice

Partitions and replicated log

durable storage

Consumers, offsets, and DLQ

processing control

Data Model

Queue event structure and placement model inside a partitioned log.

Event Envelope

key

order:1234

payload

{ status: "created", amount: 9900 }

headers

message_idtrace_idretry_count

Log Placement

partitioning

hash(key) -> topic: orders / partition: 7

offsets

offset: 912334 (append-only)

retention

7d / 100GB per partition / compaction

Ordering

Guaranteed within a partition, but not across partitions.

Replay

Offset lets consumers resume processing after crashes.

Idempotency

`message_id` helps deduplicate repeated deliveries.

Publish and consume path through components

The append path is the most visible part, but the more interesting question is what happens after fetch: business processing, offset commit, retry routing, DLQ isolation, and rebalance behavior when workers change.

Publish and consume path explorer

Interactive walkthrough of how a record moves from producer ingress to consumer processing and offset commit.

Producer Service

publish batch

Broker Frontend

auth + quota

Topic Router

hash(key)

Leader Append

append log

Replicas + Ack

ISR/quorum

Producer Service

publish batch

Broker Frontend

auth + quota

Topic Router

hash(key)

Leader Append

append log

Replicas + Ack

ISR/quorum

Publish path: producer sends a record through broker ingress, it lands on the leader partition, and ack returns after the chosen replication policy is satisfied.

Publish path

Partition key defines ordering scope and load distribution across partitions.
Ack policy (leader vs quorum) controls latency vs durability trade-off.
Producer batching and compression are often essential for burst-heavy traffic.
Replication lag should be monitored separately from end-to-end consumer lag.

Delivery semantics and operational control

Delivery choice is always a trade-off between loss risk, duplicate risk, and implementation complexity. Separately, you need to be explicit about consumer lag, backlog growth, and when the queue starts degrading the whole downstream system.

Delivery semantics

At-most-once: fewer duplicates, higher loss risk.
At-least-once: practical baseline, but consumers must stay idempotent.
Effectively-once: comes from dedupe and side-effect control, not from a magic broker flag.
Ordering is usually guaranteed per partition, not across the entire topic.

Operational controls

Track backlog depth, acknowledgement time, and consumer lag separately.
Bound retries and use backoff so redelivery does not turn into a cluster-wide storm.
Make retention and replay policy explicit instead of treating them as defaults.
Keep a runbook for quarantine, manual inspection, and safe replay into the main flow.

Common mistakes

Promising global ordering without explaining cost, coordination, and lost parallelism.
Relying on broker features alone and forgetting idempotent business handlers.
No clear quarantine path for poison messages and no bounded retry policy.
Using broker throughput as a proxy for real business completion latency.

What to make explicit in interviews

Where ordering is guaranteed: per partition, per key, or nowhere globally.
Which delivery semantics are chosen and why the business can tolerate their failure mode.
When offsets are committed and what happens if the consumer crashes between side effect and commit.
How retries, quarantine, manual remediation, and safe replay work together.

References

Apache Kafka — official documentation: log, partitions, replication, delivery semantics (Apache Software Foundation)Apache Pulsar — Messaging Concepts: subscriptions, acknowledgements, retry and dead-letter topics (Apache Software Foundation)Alex Xu — How to Choose a Message Queue: Kafka vs RabbitMQ (ByteByteGo, 2023)Martin Kleppmann — Designing Data-Intensive Applications: message brokers and stream processing (O'Reilly)

Related chapters

Event-Driven Architecture - Queue-centric event routing patterns, saga choreography, and async domain workflows.
Kafka (book summary) - Detailed treatment of partitioned logs, consumer groups, and messaging trade-offs.
System Design for Interviews and Beyond (short summary) - Interview framing techniques for high-throughput asynchronous integration systems.
Consistency and idempotency patterns - Idempotent consumer design and duplicate-effect control under at-least-once delivery.
Chat System - Applied real-time scenario where queues drive fan-out, delivery guarantees, and retries.