Notification System is a critical component of modern applications. It combines push notifications (iOS/Android), email, SMS, in-app messages and real-time WebSocket delivery. This task often comes up in System Design interviews, especially in companies with a large user base.
Chapter 7
Acing SDI: Notification
Analysis of the problem in the book by Zhiyong Tan
Notification Types
Push (Mobile)
APNs (iOS), FCM (Android). Delivery even when the application is closed.
Transactional and marketing. SendGrid, SES, Mailgun.
SMS
Twilio, Vonage. 2FA, critical alerts.
In-App / WebSocket
Real-time delivery when the application is open.
Requirements
Functional
- FR1Sending push, email, SMS, in-app notifications
- FR2Template support with personalization
- FR3User Preference Settings
- FR4Delayed sending (scheduled)
- FR5Rate limiting and batching
Non-functional
- NFR110M+ notifications per day
- NFR2Soft real-time: delivery in seconds
- NFR3At-least-once delivery (with deduplication)
- NFR499.9% availability
High-Level Architecture
High-Level Architecture
Color-coded channels: Push (blue), Email (green), SMS (amber)
Services
Triggers
Notification Service
Routing + templates
Message Queue
Kafka / SQS
Push Worker
APNs + FCM
Email Worker
SendGrid / SES
SMS Worker
Twilio / Vonage
APNs
Apple
FCM
SendGrid / SES
Email providers
Twilio
SMS provider
Apple Push Notification Service (APNs)
Architecture of APNs
APNs use a persistent HTTP/2 connection with certificate-based or token-based authentication.
Provider → APNs → Device
POST /3/device/<device_token>
Key Features
- Device Token — unique device ID (changes)
- Payload — up to 4KB JSON with alert, badge, sound
- Priority — 10 (immediate) or 5 (power-saving)
- Expiration — TTL for offline devices
- Collapse ID — replacement of the previous notification
APNs Payload Example
{
"aps": {
"alert": {
"title": "New Message",
"body": "You have a new message from John"
},
"badge": 5,
"sound": "default",
"mutable-content": 1 // for Notification Service Extension
},
"custom_data": {
"conversation_id": "abc123"
}
}Firebase Cloud Messaging (FCM)
FCM Architecture
FCM supports Android, iOS and Web. Uses HTTP v1 API with OAuth 2.0 authentication.
POST /v1/projects/<project_id>/messages:send
Message Types
- Notification — FCM shows automatically
- Data - the application processes itself
- Hybrid - both types in one message
FCM HTTP v1 Request
{
"message": {
"token": "device_registration_token",
"notification": {
"title": "New Order",
"body": "Your order #1234 has been shipped"
},
"data": {
"order_id": "1234",
"click_action": "OPEN_ORDER_DETAIL"
},
"android": {
"priority": "high",
"ttl": "86400s"
},
"apns": {
"headers": {
"apns-priority": "10"
}
}
}
}WebSocket
Chat System
Real-time delivery via WebSocket
Device Token Management
Managing device tokens is a critical part of the system. Tokens can change, devices can go offline, users can have multiple devices.
Device Registry Table
| Column | Type | Description |
|---|---|---|
| user_id | BIGINT | FK to users |
| device_token | VARCHAR(255) | APNs/FCM token |
| platform | ENUM | ios, android, web |
| app_version | VARCHAR(20) | For feature flags |
| last_active_at | TIMESTAMP | To clear stale tokens |
| created_at | TIMESTAMP | Registration date |
Important: Process feedback from APNs/FCM about invalid tokens. Regularly clear stale tokens (devices inactive for 30+ days).
User Preferences
Setting levels
- Global — on/off all notifications
- Channel — push, email, SMS separately
- Category — marketing, transactional, social
- Quiet Hours — time restrictions
Frequency Capping
- Max N notifications per hour/day
- Aggregation (bundling) of similar notifications
- Priority queue for critical alerts
Message Queue Architecture
Kafka/SQS provides decoupling between services and delivery guarantee. Each channel has its own topic and consumer group.
Topics: ├── notifications.push.ios → Push Workers (APNs) ├── notifications.push.android → Push Workers (FCM) ├── notifications.email → Email Workers ├── notifications.sms → SMS Workers ├── notifications.websocket → WebSocket Workers └── notifications.dlq → Dead Letter Queue Partitioning: by user_id for ordering per user Retention: 7 days (for retry and debugging)
Rate Limiting
Rate Limiter
Protection against overload of external providers
Delivery Guarantees
At-Most-Once
Fire and forget. The simplest approach.
⚠️ Messages may be lost
At-Least-Once
Retry with acknowledgment. Requires deduplication.
✓ Recommended for most cases
Exactly-Once
Idempotency key + transactional outbox.
Difficult, expensive, rarely needed
Deduplication Strategy
// Idempotency Key = hash(user_id + notification_type + content_hash + date)
Redis SET with TTL:
SETNX notification:dedupe:{idempotency_key} 1 EX 86400
If the key already exists → skip sendingRetry & Error Handling
Retry Strategy
- Exponential Backoff: 1s → 2s → 4s → 8s...
- Max Retries: 3-5 attempts
- Jitter: random addition to distribute the load
- Circuit Breaker: in case of massive provider errors
Error Classification
- Retryable: 429, 503, network timeout
- Non-retryable: 400, 401, invalid token
- DLQ: after exhaustion of retries
Patterns
Release It!
Circuit Breaker and other stability patterns
Scaling Considerations
Horizontal Scaling
• Stateless workers — auto-scale by queue depth
• Partition by user_id for ordering guarantees
• Separate worker pools per channel
Provider Rate Limits
• APNs: HTTP/2 multiplexing, ~4000 req/s per connection
• FCM: 600K messages/min (paid plans)
• Use connection pooling
Broadcast Notifications: For mass mailing (millions of users), use FCM Topics or fan-out via Kafka with batch processing.
Observability
Metrics
- Notifications sent/failed per channel
- Delivery latency (p50, p99)
- Queue depth and consumer lag
- Provider error rates
Logging
- Structured logs with correlation ID
- Provider response codes
- User preferences applied
Alerting
- Spike in error rates
- Queue backlog growing
- Provider degradation
Key takeaways from the interview
Show understanding
• Difference between APNs and FCMs
• Device token lifecycle
• At-least-once vs exactly-once delivery
• Rate limiting of providers
• User preferences and quiet hours
Frequent follow-up questions
• How to provide ordering notifications?
• How to process a broadcast for 10M users?
• How to avoid duplicate notifications?
• How to prioritize critical alerts?
