A distributed message queue is not just a buffer between services. It defines ordering boundaries, delivery rules, retry behavior, and what happens when consumers fall behind.
The chapter ties together event publication, partitioning, offset tracking, consumer groups, redelivery, and quarantine for problematic records into one architecture.
For interviews and engineering discussions, this case is useful because it quickly shows whether you can distinguish plain high throughput from truly reliable asynchronous integration.
Delivery Semantics
The key choice is not the broker brand. It is the delivery model the business can tolerate and how the system handles duplicates, loss, and retries.
Consumer Groups
Parallelism does not appear automatically: you need to reason about partition ownership, rebalance moments, and where ordering is lost.
Redelivery
The retry path should isolate temporary failures instead of turning into a cluster-wide retry storm.
Consumer Lag
Backlog growth matters only when you can connect lag to business delay, overloaded handlers, and degraded modes.
Acing SDI
Practice task from chapter 9
A practical case about distributed message queues as a base layer for asynchronous service integration.
Distributed message queues are not just buffers between services. They define delivery semantics, ordering boundaries, retry behavior, retention rules, and what happens when consumers fall behind the incoming stream.
Representative systems
- Apache Kafka: Partitioned log, consumer groups, replay, and streaming-heavy workloads.
- RabbitMQ: Flexible routing, explicit queues, and fine-grained acknowledgement behavior.
- Apache Pulsar: Separated storage and compute, multi-tenant topics, and isolation between workloads.
- NATS JetStream: Lightweight event bus with persisted streams and simple operational shape.
- AWS SQS/SNS: Managed async messaging for cloud-native integration patterns.
Functional requirements
The system has to do more than accept and return messages. It also needs explicit acknowledgement flow, offset tracking, replay, and parallel processing through consumer groups.
Core API
POST /topics/:name/messagespublishes a record to a topicGET /topics/:name/polllets consumers fetch recordsPOST /offsets/commitconfirms processed offsetsPOST /topics/:name/replayre-reads messages from a chosen offset
Processing reliability
- Parallel consumers grouped by topic ownership
- Retry flow plus DLQ boundary for irrecoverable messages
- At-least-once delivery as the practical baseline
- Overload protection through throttling and bounded retries
Non-functional requirements
Queue design is not only about peak QPS. You also need predictable delivery lag, stable behavior during bursts, and graceful degradation when backlog starts growing.
| Requirement | Target | Reason |
|---|---|---|
| Throughput | High even during short bursts | Producers should not stall just because consumers temporarily fall behind |
| Delivery lag | Controlled end-to-end delay | Business flow depends on time-to-processing, not only on broker append latency |
| Scalability | Growth through partitions and consumer groups | Capacity should grow without a full redesign of the queue layer |
| Durability | Confirmed writes survive node loss | A single broker failure should not erase already acknowledged records |
| Predictable degradation | Bounded retries and isolated failures | Retry storms should not collapse the whole asynchronous pipeline |
Deep dive
Kafka (book summary)
Partitioned logs, consumer groups, replication, and practical operational trade-offs.
Architecture overview
The baseline queue shape combines broker ingress, a partitioned replicated log, explicit acknowledgement policy, and separate retry and quarantine paths for problematic records.
Architecture Overview
partitioned log, consumer groups, and retry controlThe diagram covers publish flow, consume flow, and the retry/DLQ control loop.
Data Model
Queue event structure and placement model inside a partitioned log.
Event Envelope
key
order:1234
payload
{ status: "created", amount: 9900 }
headers
Log Placement
partitioning
hash(key) -> topic: orders / partition: 7
offsets
offset: 912334 (append-only)
retention
7d / 100GB per partition / compaction
Ordering
Guaranteed within a partition, but not across partitions.
Replay
Offset lets consumers resume processing after crashes.
Idempotency
`message_id` helps deduplicate repeated deliveries.
Publish and consume path through components
The important part is not only the append path. You also need to show what happens after fetch: business processing, offset commit, retry routing, DLQ isolation, and rebalance behavior when workers change.
Publish and consume path explorer
Interactive walkthrough of how a record moves from producer ingress to consumer processing and offset commit.
Publish path
- Partition key defines ordering scope and load distribution across partitions.
- Ack policy (leader vs quorum) controls latency vs durability trade-off.
- Producer batching and compression are often essential for burst-heavy traffic.
- Replication lag should be monitored separately from end-to-end consumer lag.
Delivery semantics and operational control
Delivery choice is always a trade-off between loss risk, duplicate risk, and implementation complexity. Separately, you need to be explicit about consumer lag, backlog growth, and when the queue starts degrading the whole downstream system.
Delivery semantics
- At-most-once: fewer duplicates, higher loss risk.
- At-least-once: practical baseline, but consumers must stay idempotent.
- Effectively-once: comes from dedupe and side-effect control, not from a magic broker flag.
- Ordering is usually guaranteed per partition, not across the entire topic.
Operational controls
- Track backlog depth, acknowledgement time, and consumer lag separately.
- Bound retries and use backoff so redelivery does not turn into a cluster-wide storm.
- Make retention and replay policy explicit instead of treating them as defaults.
- Keep a runbook for quarantine, manual inspection, and safe replay into the main flow.
Common mistakes
- Promising global ordering without explaining cost, coordination, and lost parallelism.
- Relying on broker features alone and forgetting idempotent business handlers.
- No clear quarantine path for poison messages and no bounded retry policy.
- Using broker throughput as a proxy for real business completion latency.
What to make explicit in interviews
- Where ordering is guaranteed: per partition, per key, or nowhere globally.
- Which delivery semantics are chosen and why the business can tolerate their failure mode.
- When offsets are committed and what happens if the consumer crashes between side effect and commit.
- How retries, quarantine, manual remediation, and safe replay work together.
Related chapters
- Event-Driven Architecture - Queue-centric event routing patterns, saga choreography, and async domain workflows.
- Kafka (book summary) - Detailed treatment of partitioned logs, consumer groups, and messaging trade-offs.
- System Design for Interviews and Beyond (short summary) - Interview framing techniques for high-throughput asynchronous integration systems.
- Consistency and idempotency patterns - Idempotent consumer design and duplicate-effect control under at-least-once delivery.
- Chat System - Applied real-time scenario where queues drive fan-out, delivery guarantees, and retries.
