System Design Space
Knowledge graphSettings

Updated: March 2, 2026 at 8:49 AM

Notification System

mid

Push notifications (APNs, FCM), email, SMS: device token management, delivery guarantees, retry patterns.

Notification System is a critical component of modern applications. It combines push notifications (iOS/Android), email, SMS, in-app messages and real-time WebSocket delivery. This task often comes up in System Design interviews, especially in companies with a large user base.

Chapter 7

Acing SDI: Notification

Analysis of the problem in the book by Zhiyong Tan

Читать обзор

Notification Types

Push (Mobile)

APNs (iOS), FCM (Android). Delivery even when the application is closed.

Email

Transactional and marketing. SendGrid, SES, Mailgun.

SMS

Twilio, Vonage. 2FA, critical alerts.

In-App / WebSocket

Real-time delivery when the application is open.

Requirements

Functional

  • FR1
    Sending push, email, SMS, in-app notifications
  • FR2
    Template support with personalization
  • FR3
    User Preference Settings
  • FR4
    Delayed sending (scheduled)
  • FR5
    Rate limiting and batching

Non-functional

  • NFR1
    10M+ notifications per day
  • NFR2
    Soft real-time: delivery in seconds
  • NFR3
    At-least-once delivery (with deduplication)
  • NFR4
    99.9% availability

High-Level Architecture

High-Level Architecture

Color-coded channels: Push (blue), Email (green), SMS (amber)

Push channelEmail channelSMS channelCore shared path

Services

Core

Triggers

Notification Service

Core

Routing + templates

Message Queue

Core

Kafka / SQS

Fanout by delivery channels: Push / Email / SMS

Push Worker

Push

APNs + FCM

Email Worker

Email

SendGrid / SES

SMS Worker

SMS

Twilio / Vonage

APNs

Push

Apple

FCM

Push

Google

SendGrid / SES

Email

Email providers

Twilio

SMS

SMS provider

Soft real-time delivery
At-least-once + dedupe
Horizontal scaling

Apple Push Notification Service (APNs)

Architecture of APNs

APNs use a persistent HTTP/2 connection with certificate-based or token-based authentication.

Provider → APNs → Device

POST /3/device/<device_token>

Key Features

  • Device Token — unique device ID (changes)
  • Payload — up to 4KB JSON with alert, badge, sound
  • Priority — 10 (immediate) or 5 (power-saving)
  • Expiration — TTL for offline devices
  • Collapse ID — replacement of the previous notification

APNs Payload Example

{
  "aps": {
    "alert": {
      "title": "New Message",
      "body": "You have a new message from John"
    },
    "badge": 5,
    "sound": "default",
    "mutable-content": 1 // for Notification Service Extension
  },
  "custom_data": {
    "conversation_id": "abc123"
  }
}

Firebase Cloud Messaging (FCM)

FCM Architecture

FCM supports Android, iOS and Web. Uses HTTP v1 API with OAuth 2.0 authentication.

POST /v1/projects/<project_id>/messages:send

Message Types

  • Notification — FCM shows automatically
  • Data - the application processes itself
  • Hybrid - both types in one message
💡 Data messages give full control over display

FCM HTTP v1 Request

{
  "message": {
    "token": "device_registration_token",
    "notification": {
      "title": "New Order",
      "body": "Your order #1234 has been shipped"
    },
    "data": {
      "order_id": "1234",
      "click_action": "OPEN_ORDER_DETAIL"
    },
    "android": {
      "priority": "high",
      "ttl": "86400s"
    },
    "apns": {
      "headers": {
        "apns-priority": "10"
      }
    }
  }
}

WebSocket

Chat System

Real-time delivery via WebSocket

Читать обзор

Device Token Management

Managing device tokens is a critical part of the system. Tokens can change, devices can go offline, users can have multiple devices.

Device Registry Table

ColumnTypeDescription
user_idBIGINTFK to users
device_tokenVARCHAR(255)APNs/FCM token
platformENUMios, android, web
app_versionVARCHAR(20)For feature flags
last_active_atTIMESTAMPTo clear stale tokens
created_atTIMESTAMPRegistration date

Important: Process feedback from APNs/FCM about invalid tokens. Regularly clear stale tokens (devices inactive for 30+ days).

User Preferences

Setting levels

  • Global — on/off all notifications
  • Channel — push, email, SMS separately
  • Category — marketing, transactional, social
  • Quiet Hours — time restrictions

Frequency Capping

  • Max N notifications per hour/day
  • Aggregation (bundling) of similar notifications
  • Priority queue for critical alerts

Message Queue Architecture

Kafka/SQS provides decoupling between services and delivery guarantee. Each channel has its own topic and consumer group.

Topics:
├── notifications.push.ios → Push Workers (APNs)
├── notifications.push.android → Push Workers (FCM)
├── notifications.email → Email Workers
├── notifications.sms → SMS Workers
├── notifications.websocket → WebSocket Workers
└── notifications.dlq → Dead Letter Queue

Partitioning: by user_id for ordering per user
Retention: 7 days (for retry and debugging)

Rate Limiting

Rate Limiter

Protection against overload of external providers

Читать обзор

Delivery Guarantees

At-Most-Once

Fire and forget. The simplest approach.

⚠️ Messages may be lost

At-Least-Once

Retry with acknowledgment. Requires deduplication.

✓ Recommended for most cases

Exactly-Once

Idempotency key + transactional outbox.

Difficult, expensive, rarely needed

Deduplication Strategy

// Idempotency Key = hash(user_id + notification_type + content_hash + date)

Redis SET with TTL:
  SETNX notification:dedupe:{idempotency_key} 1 EX 86400

If the key already exists → skip sending

Retry & Error Handling

Retry Strategy

  • Exponential Backoff: 1s → 2s → 4s → 8s...
  • Max Retries: 3-5 attempts
  • Jitter: random addition to distribute the load
  • Circuit Breaker: in case of massive provider errors

Error Classification

  • Retryable: 429, 503, network timeout
  • Non-retryable: 400, 401, invalid token
  • DLQ: after exhaustion of retries

Patterns

Release It!

Circuit Breaker and other stability patterns

Читать обзор

Scaling Considerations

Horizontal Scaling

• Stateless workers — auto-scale by queue depth

• Partition by user_id for ordering guarantees

• Separate worker pools per channel

Provider Rate Limits

• APNs: HTTP/2 multiplexing, ~4000 req/s per connection

• FCM: 600K messages/min (paid plans)

• Use connection pooling

Broadcast Notifications: For mass mailing (millions of users), use FCM Topics or fan-out via Kafka with batch processing.

Observability

Metrics

  • Notifications sent/failed per channel
  • Delivery latency (p50, p99)
  • Queue depth and consumer lag
  • Provider error rates

Logging

  • Structured logs with correlation ID
  • Provider response codes
  • User preferences applied

Alerting

  • Spike in error rates
  • Queue backlog growing
  • Provider degradation

Key takeaways from the interview

Show understanding

• Difference between APNs and FCMs

• Device token lifecycle

• At-least-once vs exactly-once delivery

• Rate limiting of providers

• User preferences and quiet hours

Frequent follow-up questions

• How to provide ordering notifications?

• How to process a broadcast for 10M users?

• How to avoid duplicate notifications?

• How to prioritize critical alerts?

Related materials

Related materials

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov