System Design Space
Knowledge graphSettings

Updated: April 21, 2026 at 4:55 PM

Notification System

medium

Classic notification system case: push, email, and SMS delivery, device-token management, delivery guarantees, and retry behavior under failure.

A notification system gets hard not when it sends one message, but when one event must be delivered reliably to many users, through several channels, across multiple devices.

The chapter connects queues, templates, user preferences, external-provider protection, and idempotency into one delivery path that can survive traffic spikes.

For interviews and engineering discussions, this case is valuable because it forces explicit choices about delivery guarantees, retry behavior, and acceptable degradation.

Latency budget

Each delivery channel needs an explicit latency target and a defined fallback path when it cannot meet that target.

Channel fan-out

The way one event expands across recipients and channels directly shapes the scalability and cost of the whole system.

User state

Model connected devices, quiet hours, delivery order, and the latest user preferences explicitly.

Graceful degradation

When a provider fails or saturates, the system should lose non-critical quality before it loses the entire delivery path.

A notification system stops being a simple channel wrapper once one event must reach many users, through several channels, across multiple devices, with preferences, priorities, and partial failures in play. A strong design answer has to go beyond listing push, email, and SMS: it needs explicit delivery guarantees, retry behavior, provider protection, and a clear degradation path.

Chapter 7

Acing SDI: Notification

A walkthrough of the case in Zhiyong Tan's book

Читать обзор

Notification Types

Mobile push

APNs for iOS and FCM for Android. Delivery can still happen when the app is closed.

Email

Transactional and marketing email. SendGrid, SES, Mailgun.

SMS

Twilio, Vonage. Useful for 2FA and critical alerts.

In-app delivery

In-app notifications and WebSocket delivery while the user is online.

Requirements

Functional

  • FR1
    Send push, email, SMS, and in-app notifications
  • FR2
    Support templated messages with personalization
  • FR3
    Respect user notification preferences
  • FR4
    Support scheduled delivery
  • FR5
    Enforce rate limits and batch similar notifications

Non-functional

  • NFR1
    10M+ notifications per day
  • NFR2
    Seconds-level delivery
  • NFR3
    At-least-once delivery with deduplication
  • NFR4
    99.9% availability

High-level architecture

High-level architecture

Color-coded delivery channels: Push (blue), Email (green), SMS (amber)

Push channelEmail channelSMS channelShared processing path

Business services

Shared path

Event sources

Notification service

Shared path

Routing and templates

Message queue

Shared path

Kafka / SQS

Fan-out by delivery channel: Push / Email / SMS

Push workers

Push

APNs + FCM

Email workers

Email

SendGrid / SES

SMS workers

SMS

Twilio / Vonage

APNs

Push

Apple

FCM

Push

Google

SendGrid / SES

Email

Email providers

Twilio

SMS

SMS provider

Seconds-level delivery
At-least-once with deduplication
Horizontal scaling

Apple Push Notification Service (APNs)

APNs architecture

APNs runs over a persistent HTTP/2 connection and supports certificate-based or token-based authentication.

Provider → APNs → device

POST /3/device/<device_token>

Key details

  • Device Token — the unique device token, which can change over time
  • Payload — a JSON payload up to 4 KB with alert, badge, and sound fields
  • Priority — 10 for immediate delivery or 5 for power-saving mode
  • Expiration — TTL for devices that are currently offline
  • Collapse ID — lets a newer notification replace an older one in the same group

Sample APNs payload

{
  "aps": {
    "alert": {
      "title": "New Message",
      "body": "You have a new message from John"
    },
    "badge": 5,
    "sound": "default",
    "mutable-content": 1 // for Notification Service Extension
  },
  "custom_data": {
    "conversation_id": "abc123"
  }
}

Firebase Cloud Messaging (FCM)

FCM architecture

FCM supports Android, iOS, and web clients. It exposes an HTTP v1 API with OAuth 2.0 authentication.

POST /v1/projects/<project_id>/messages:send

Message types

  • Notification — rendered automatically by FCM
  • Data — processed explicitly by the application
  • Hybrid — combines both styles in one message
💡 Data messages give you full control over rendering behavior

Sample FCM HTTP v1 request

{
  "message": {
    "token": "device_registration_token",
    "notification": {
      "title": "New Order",
      "body": "Your order #1234 has been shipped"
    },
    "data": {
      "order_id": "1234",
      "click_action": "OPEN_ORDER_DETAIL"
    },
    "android": {
      "priority": "high",
      "ttl": "86400s"
    },
    "apns": {
      "headers": {
        "apns-priority": "10"
      }
    }
  }
}

WebSocket

Chat System

Real-time delivery over WebSocket

Читать обзор

Device token management

Device-token management is a critical part of the system. Tokens can change, devices go offline, and one user often owns several devices and client applications.

Device registry table

ColumnTypeDescription
user_idBIGINTReference to the user
device_tokenVARCHAR(255)APNs or FCM token
platformENUMios, android, web
app_versionVARCHAR(20)Used for compatibility checks and rollout policy
last_active_atTIMESTAMPHelps remove stale tokens
created_atTIMESTAMPRegistration time

Important: process invalid-token feedback from APNs and FCM. Regularly remove stale tokens for devices that have not shown up for a long time.

User preferences

Preference levels

  • Global — turn all notifications on or off
  • Channel — control push, email, and SMS separately
  • Category — split marketing, transactional, and social notifications
  • Quiet Hours — define silent periods and time-based restrictions

Frequency caps

  • Cap notifications per hour or per day
  • Batch similar notifications together
  • Reserve a priority queue for critical alerts

Message queue architecture

Kafka or SQS decouples services in time and helps preserve clear delivery guarantees. Each delivery channel usually gets its own topic and consumer group.

Topics:
├── notifications.push.ios      → Push workers (APNs)
├── notifications.push.android  → Push workers (FCM)
├── notifications.email         → Email workers
├── notifications.sms           → SMS workers
├── notifications.websocket     → WebSocket workers
└── notifications.dlq           → Dead Letter Queue

Partitioning: by user_id to preserve per-user ordering
Retention: 7 days (for retries and incident debugging)

Rate Limiting

Rate Limiter

Protection against overloading external providers

Читать обзор

Delivery guarantees

At-Most-Once

Fire and forget. The simplest possible approach.

⚠️ Messages may be lost

At-Least-Once

Retries with acknowledgment. Requires deduplication.

✓ A practical default for most systems

Exactly-Once

Idempotency keys plus the transactional outbox pattern.

Complex, expensive, and rarely necessary

Deduplication strategy

// Idempotency key = hash(user_id + notification_type + content_hash + date)

Redis SET with TTL:
  SETNX notification:dedupe:{idempotency_key} 1 EX 86400

If the key already exists → skip sending

Retries and error handling

Retry strategy

  • Exponential Backoff: 1s → 2s → 4s → 8s...
  • Max Retries: 3-5 attempts
  • Jitter: add randomness so retries do not synchronize
  • Circuit Breaker: trip when the provider is failing broadly

Error classification

  • Retryable: 429, 503, network timeout
  • Non-retryable: 400, 401, invalid token
  • DLQ: after the retry budget is exhausted

Patterns

Release It!

Circuit Breaker and other stability patterns

Читать обзор

Scaling

Horizontal scaling

• Stateless workers that autoscale with queue depth

• Partition by user_id to preserve per-user ordering

• Separate worker pools for each channel

Provider limits

• APNs: HTTP/2 multiplexing, roughly 4000 requests per second per connection

• FCM: up to 600K messages per minute on paid plans

• Use connection pooling

Broadcast notifications: for large broadcasts to millions of users, use FCM Topics or expand the send through Kafka with batch processing.

Observability

Metrics

  • Sent and failed notifications per channel
  • Delivery latency (p50, p99)
  • Queue depth and consumer lag
  • Provider-side error rates

Logging

  • Structured logs with a correlation ID
  • Provider response codes
  • Applied user preferences

Alerting

  • Spikes in error rates
  • Growing queue backlog
  • Provider degradation

What to emphasize in an interview

What to demonstrate

• The difference between APNs and FCM

• How the device-token lifecycle works

• When at-least-once is enough and why deduplication matters

• How to protect external providers with rate limits

• Where user preferences and quiet hours are enforced

Common follow-up questions

• How do you preserve notification order for one user?

• How do you handle a broadcast to 10M users?

• How do you avoid duplicate notifications?

• How do you prioritize critical alerts?

Related chapters

Enable tracking in Settings