System Design Space
Knowledge graphSettings

Updated: April 11, 2026 at 2:45 PM

Distributed Message Queue

medium

Classic task: event delivery, offset tracking, retries, consumer lag, and load control in asynchronous messaging.

A distributed message queue is not just a buffer between services. It defines ordering boundaries, delivery rules, retry behavior, and what happens when consumers fall behind.

The chapter ties together event publication, partitioning, offset tracking, consumer groups, redelivery, and quarantine for problematic records into one architecture.

For interviews and engineering discussions, this case is useful because it quickly shows whether you can distinguish plain high throughput from truly reliable asynchronous integration.

Delivery Semantics

The key choice is not the broker brand. It is the delivery model the business can tolerate and how the system handles duplicates, loss, and retries.

Consumer Groups

Parallelism does not appear automatically: you need to reason about partition ownership, rebalance moments, and where ordering is lost.

Redelivery

The retry path should isolate temporary failures instead of turning into a cluster-wide retry storm.

Consumer Lag

Backlog growth matters only when you can connect lag to business delay, overloaded handlers, and degraded modes.

Acing SDI

Practice task from chapter 9

A practical case about distributed message queues as a base layer for asynchronous service integration.

Читать обзор

Distributed message queues are not just buffers between services. They define delivery semantics, ordering boundaries, retry behavior, retention rules, and what happens when consumers fall behind the incoming stream.

Representative systems

  • Apache Kafka: Partitioned log, consumer groups, replay, and streaming-heavy workloads.
  • RabbitMQ: Flexible routing, explicit queues, and fine-grained acknowledgement behavior.
  • Apache Pulsar: Separated storage and compute, multi-tenant topics, and isolation between workloads.
  • NATS JetStream: Lightweight event bus with persisted streams and simple operational shape.
  • AWS SQS/SNS: Managed async messaging for cloud-native integration patterns.

Functional requirements

The system has to do more than accept and return messages. It also needs explicit acknowledgement flow, offset tracking, replay, and parallel processing through consumer groups.

Core API

  • POST /topics/:name/messages publishes a record to a topic
  • GET /topics/:name/poll lets consumers fetch records
  • POST /offsets/commit confirms processed offsets
  • POST /topics/:name/replay re-reads messages from a chosen offset

Processing reliability

  • Parallel consumers grouped by topic ownership
  • Retry flow plus DLQ boundary for irrecoverable messages
  • At-least-once delivery as the practical baseline
  • Overload protection through throttling and bounded retries

Non-functional requirements

Queue design is not only about peak QPS. You also need predictable delivery lag, stable behavior during bursts, and graceful degradation when backlog starts growing.

RequirementTargetReason
ThroughputHigh even during short burstsProducers should not stall just because consumers temporarily fall behind
Delivery lagControlled end-to-end delayBusiness flow depends on time-to-processing, not only on broker append latency
ScalabilityGrowth through partitions and consumer groupsCapacity should grow without a full redesign of the queue layer
DurabilityConfirmed writes survive node lossA single broker failure should not erase already acknowledged records
Predictable degradationBounded retries and isolated failuresRetry storms should not collapse the whole asynchronous pipeline

Deep dive

Kafka (book summary)

Partitioned logs, consumer groups, replication, and practical operational trade-offs.

Читать обзор

Architecture overview

The baseline queue shape combines broker ingress, a partitioned replicated log, explicit acknowledgement policy, and separate retry and quarantine paths for problematic records.

Architecture Overview

partitioned log, consumer groups, and retry control

The diagram covers publish flow, consume flow, and the retry/DLQ control loop.

Producers, broker, and routing
ingress and partition choice
Partitions and replicated log
durable storage
Consumers, offsets, and DLQ
processing control

Data Model

Queue event structure and placement model inside a partitioned log.

Event Envelope

key

order:1234

payload

{ status: "created", amount: 9900 }

headers

message_idtrace_idretry_count

Log Placement

partitioning

hash(key) -> topic: orders / partition: 7

offsets

offset: 912334 (append-only)

retention

7d / 100GB per partition / compaction

Ordering

Guaranteed within a partition, but not across partitions.

Replay

Offset lets consumers resume processing after crashes.

Idempotency

`message_id` helps deduplicate repeated deliveries.

Publish and consume path through components

The important part is not only the append path. You also need to show what happens after fetch: business processing, offset commit, retry routing, DLQ isolation, and rebalance behavior when workers change.

Publish and consume path explorer

Interactive walkthrough of how a record moves from producer ingress to consumer processing and offset commit.

1
Producer Service
publish batch
2
Broker Frontend
auth + quota
3
Topic Router
hash(key)
4
Leader Append
append log
5
Replicas + Ack
ISR/quorum
Publish path: producer sends a record through broker ingress, it lands on the leader partition, and ack returns after the chosen replication policy is satisfied.

Publish path

  1. Partition key defines ordering scope and load distribution across partitions.
  2. Ack policy (leader vs quorum) controls latency vs durability trade-off.
  3. Producer batching and compression are often essential for burst-heavy traffic.
  4. Replication lag should be monitored separately from end-to-end consumer lag.

Delivery semantics and operational control

Delivery choice is always a trade-off between loss risk, duplicate risk, and implementation complexity. Separately, you need to be explicit about consumer lag, backlog growth, and when the queue starts degrading the whole downstream system.

Delivery semantics

  • At-most-once: fewer duplicates, higher loss risk.
  • At-least-once: practical baseline, but consumers must stay idempotent.
  • Effectively-once: comes from dedupe and side-effect control, not from a magic broker flag.
  • Ordering is usually guaranteed per partition, not across the entire topic.

Operational controls

  • Track backlog depth, acknowledgement time, and consumer lag separately.
  • Bound retries and use backoff so redelivery does not turn into a cluster-wide storm.
  • Make retention and replay policy explicit instead of treating them as defaults.
  • Keep a runbook for quarantine, manual inspection, and safe replay into the main flow.

Common mistakes

  • Promising global ordering without explaining cost, coordination, and lost parallelism.
  • Relying on broker features alone and forgetting idempotent business handlers.
  • No clear quarantine path for poison messages and no bounded retry policy.
  • Using broker throughput as a proxy for real business completion latency.

What to make explicit in interviews

  • Where ordering is guaranteed: per partition, per key, or nowhere globally.
  • Which delivery semantics are chosen and why the business can tolerate their failure mode.
  • When offsets are committed and what happens if the consumer crashes between side effect and commit.
  • How retries, quarantine, manual remediation, and safe replay work together.

Related chapters

Enable tracking in Settings