A notification system gets hard not when it sends one message, but when one event must be delivered reliably to many users, through several channels, across multiple devices.
The chapter connects queues, templates, user preferences, external-provider protection, and idempotency into one delivery path that can survive traffic spikes.
For interviews and engineering discussions, this case is valuable because it forces explicit choices about delivery guarantees, retry behavior, and acceptable degradation.
Latency budget
Each delivery channel needs an explicit latency target and a defined fallback path when it cannot meet that target.
Channel fan-out
The way one event expands across recipients and channels directly shapes the scalability and cost of the whole system.
User state
Model connected devices, quiet hours, delivery order, and the latest user preferences explicitly.
Graceful degradation
When a provider fails or saturates, the system should lose non-critical quality before it loses the entire delivery path.
A notification system stops being a simple channel wrapper once one event must reach many users, through several channels, across multiple devices, with preferences, priorities, and partial failures in play. A strong design answer has to go beyond listing push, email, and SMS: it needs explicit delivery guarantees, retry behavior, provider protection, and a clear degradation path.
Chapter 7
Acing SDI: Notification
A walkthrough of the case in Zhiyong Tan's book
Notification Types
Mobile push
APNs for iOS and FCM for Android. Delivery can still happen when the app is closed.
Transactional and marketing email. SendGrid, SES, Mailgun.
SMS
Twilio, Vonage. Useful for 2FA and critical alerts.
In-app delivery
In-app notifications and WebSocket delivery while the user is online.
Requirements
Functional
- FR1Send push, email, SMS, and in-app notifications
- FR2Support templated messages with personalization
- FR3Respect user notification preferences
- FR4Support scheduled delivery
- FR5Enforce rate limits and batch similar notifications
Non-functional
- NFR110M+ notifications per day
- NFR2Seconds-level delivery
- NFR3At-least-once delivery with deduplication
- NFR499.9% availability
High-level architecture
High-level architecture
Color-coded delivery channels: Push (blue), Email (green), SMS (amber)
Business services
Event sources
Notification service
Routing and templates
Message queue
Kafka / SQS
Push workers
APNs + FCM
Email workers
SendGrid / SES
SMS workers
Twilio / Vonage
APNs
Apple
FCM
SendGrid / SES
Email providers
Twilio
SMS provider
Apple Push Notification Service (APNs)
APNs architecture
APNs runs over a persistent HTTP/2 connection and supports certificate-based or token-based authentication.
Provider → APNs → device
POST /3/device/<device_token>
Key details
- Device Token — the unique device token, which can change over time
- Payload — a JSON payload up to 4 KB with alert, badge, and sound fields
- Priority — 10 for immediate delivery or 5 for power-saving mode
- Expiration — TTL for devices that are currently offline
- Collapse ID — lets a newer notification replace an older one in the same group
Sample APNs payload
{
"aps": {
"alert": {
"title": "New Message",
"body": "You have a new message from John"
},
"badge": 5,
"sound": "default",
"mutable-content": 1 // for Notification Service Extension
},
"custom_data": {
"conversation_id": "abc123"
}
}Firebase Cloud Messaging (FCM)
FCM architecture
FCM supports Android, iOS, and web clients. It exposes an HTTP v1 API with OAuth 2.0 authentication.
POST /v1/projects/<project_id>/messages:send
Message types
- Notification — rendered automatically by FCM
- Data — processed explicitly by the application
- Hybrid — combines both styles in one message
Sample FCM HTTP v1 request
{
"message": {
"token": "device_registration_token",
"notification": {
"title": "New Order",
"body": "Your order #1234 has been shipped"
},
"data": {
"order_id": "1234",
"click_action": "OPEN_ORDER_DETAIL"
},
"android": {
"priority": "high",
"ttl": "86400s"
},
"apns": {
"headers": {
"apns-priority": "10"
}
}
}
}WebSocket
Chat System
Real-time delivery over WebSocket
Device token management
Device-token management is a critical part of the system. Tokens can change, devices go offline, and one user often owns several devices and client applications.
Device registry table
| Column | Type | Description |
|---|---|---|
| user_id | BIGINT | Reference to the user |
| device_token | VARCHAR(255) | APNs or FCM token |
| platform | ENUM | ios, android, web |
| app_version | VARCHAR(20) | Used for compatibility checks and rollout policy |
| last_active_at | TIMESTAMP | Helps remove stale tokens |
| created_at | TIMESTAMP | Registration time |
Important: process invalid-token feedback from APNs and FCM. Regularly remove stale tokens for devices that have not shown up for a long time.
User preferences
Preference levels
- Global — turn all notifications on or off
- Channel — control push, email, and SMS separately
- Category — split marketing, transactional, and social notifications
- Quiet Hours — define silent periods and time-based restrictions
Frequency caps
- Cap notifications per hour or per day
- Batch similar notifications together
- Reserve a priority queue for critical alerts
Message queue architecture
Kafka or SQS decouples services in time and helps preserve clear delivery guarantees. Each delivery channel usually gets its own topic and consumer group.
Topics: ├── notifications.push.ios → Push workers (APNs) ├── notifications.push.android → Push workers (FCM) ├── notifications.email → Email workers ├── notifications.sms → SMS workers ├── notifications.websocket → WebSocket workers └── notifications.dlq → Dead Letter Queue Partitioning: by user_id to preserve per-user ordering Retention: 7 days (for retries and incident debugging)
Rate Limiting
Rate Limiter
Protection against overloading external providers
Delivery guarantees
At-Most-Once
Fire and forget. The simplest possible approach.
⚠️ Messages may be lost
At-Least-Once
Retries with acknowledgment. Requires deduplication.
✓ A practical default for most systems
Exactly-Once
Idempotency keys plus the transactional outbox pattern.
Complex, expensive, and rarely necessary
Deduplication strategy
// Idempotency key = hash(user_id + notification_type + content_hash + date)
Redis SET with TTL:
SETNX notification:dedupe:{idempotency_key} 1 EX 86400
If the key already exists → skip sendingRetries and error handling
Retry strategy
- Exponential Backoff: 1s → 2s → 4s → 8s...
- Max Retries: 3-5 attempts
- Jitter: add randomness so retries do not synchronize
- Circuit Breaker: trip when the provider is failing broadly
Error classification
- Retryable: 429, 503, network timeout
- Non-retryable: 400, 401, invalid token
- DLQ: after the retry budget is exhausted
Patterns
Release It!
Circuit Breaker and other stability patterns
Scaling
Horizontal scaling
• Stateless workers that autoscale with queue depth
• Partition by user_id to preserve per-user ordering
• Separate worker pools for each channel
Provider limits
• APNs: HTTP/2 multiplexing, roughly 4000 requests per second per connection
• FCM: up to 600K messages per minute on paid plans
• Use connection pooling
Broadcast notifications: for large broadcasts to millions of users, use FCM Topics or expand the send through Kafka with batch processing.
Observability
Metrics
- Sent and failed notifications per channel
- Delivery latency (p50, p99)
- Queue depth and consumer lag
- Provider-side error rates
Logging
- Structured logs with a correlation ID
- Provider response codes
- Applied user preferences
Alerting
- Spikes in error rates
- Growing queue backlog
- Provider degradation
What to emphasize in an interview
What to demonstrate
• The difference between APNs and FCM
• How the device-token lifecycle works
• When at-least-once is enough and why deduplication matters
• How to protect external providers with rate limits
• Where user preferences and quiet hours are enforced
Common follow-up questions
• How do you preserve notification order for one user?
• How do you handle a broadcast to 10M users?
• How do you avoid duplicate notifications?
• How do you prioritize critical alerts?
Related chapters
- Chat System - provides a practical real-time scenario where in-app notifications and WebSocket delivery sit directly on the main user path.
- Event-Driven Architecture: Event Sourcing, CQRS, Saga - explains the asynchronous event-processing model that usually becomes the backbone of a notification pipeline.
- Distributed Message Queue - goes deeper into queue design, partitioning, and delivery guarantees for large-scale notification fan-out.
- Rate Limiter - helps protect external providers such as APNs, FCM, SMS gateways, and email vendors from burst traffic and abusive patterns.
- Fault tolerance patterns: Circuit Breaker, Bulkhead, Retry - extends the conversation around retries, DLQ handling, and graceful degradation when a delivery channel fails.
- WebSocket protocol - covers the in-app delivery channel built on persistent connections and low-latency updates.
- Release It! (short summary) - adds practical stability-engineering patterns for reliable notification delivery in production systems.
