Designing an instant messaging system is one of the classic tasks in System Design interviews. This is a complex task that affects real-time communications, scaling stateful connections, message consistency and offline delivery.
Related chapter
Book review by Alex Xu
A detailed analysis of chat systems is in Chapter 12 of the book by Alex Xu.
Examples of real systems
1Functional Requirements
1-on-1 chats between users.
Group chats (up to N participants).
Text and media message sending.
Online status indicator and typing indicators.
Read receipts and message history synchronization across devices.
Push notifications for offline users.
2Non-functional requirements
Latency: < 100ms
Real-time message delivery for online users.
Availability: 99.99% uptime
Messenger should remain available under partial failures.
Consistency: delivery and ordering guarantees
Messages must not be lost or delivered out of order.
Scalability: 50M concurrent connections
System should scale horizontally as audience grows.
System scale (example)
3Selecting a communication protocol
Related chapter
WebSocket Protocol
Detailed analysis of WebSocket: handshake, keepalive, reconnect and best practices.
Comparison of approaches
| Approach | Latency | Server Load | Use Case |
|---|---|---|---|
| HTTP Polling | High (interval) | Very tall | Legacy fallback |
| Long Polling | Average | High | Simple notifications |
| WebSocket ✓ | Minimum | Optimal | Real-time chats |
| Server-Sent Events | Minimum | Average | Unidirectional flow |
Why WebSocket?
- ✓Bidirectional: The client and server can send messages at any time
- ✓Persistent connection: One connection per session, minimum overhead
- ✓Low latency: No HTTP handshake on every message
- ✓Efficient: Less traffic and server load
4High-Level architecture
Chat System: High-Level Map
realtime routing + durable storage + offline push deliveryRealtime Plane
Durable + Offline Plane
Core chat-system topology: realtime WebSocket path, durable storage, and a dedicated offline push pipeline.
Online delivery path
- User A sends a message via a WebSocket connection.
- WS Gateway transfers payload to Chat Router.
- Session Registry finds the target chat server for User B.
- When the session is active, the message is instantly delivered to the recipient.
Offline delivery path
- The message is first acknowledged and stored in the Message Store.
- Delivery Queue creates a push notification task with a retry policy.
- Push Service sends the event to APNS/FCM.
- When reconnecting, the client is synchronized using the last-seen message ID.
5Message storage
Data storage
Database Internals
The choice between SQL and NoSQL depends on access patterns.
Comparison of storage approaches
| Database | Pros | Cons | Use Case |
|---|---|---|---|
| PostgreSQL | ACID, friend | Sharding is complicated | Small scale |
| Cassandra ✓ | Horizontal zoom, great recording | Eventual consistency | Facebook/Discord chats |
| HBase | Wide-column, Hadoop integration | Complexity of operations | Facebook Messenger |
Data Schema (Cassandra)
-- Message table (partitioned by chat_id)
CREATE TABLE messages (
chat_id UUID,
message_id TIMEUUID, -- Snowflake ID or TIMEUUID
sender_id UUID,
content TEXT,
created_at TIMESTAMP,
PRIMARY KEY ((chat_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
-- Quick access to recent chat messages
SELECT * FROM messages
WHERE chat_id = ?
ORDER BY message_id DESC
LIMIT 50;Why is message_id important?
- •Ordering: TIMEUUID or Snowflake ID guarantee order
- •Pagination: "Load messages up to message_id X"
- •Deduplication: Idempotent operations during retry
- •Sync: "Give all messages after message_id Y"
6Presence Service (Online status)
Heartbeat mechanism
The client periodically sends a heartbeat (every 5-30 seconds). If the heartbeat is not received, the user is considered offline.
// Redis status storage
SET user:{user_id}:last_active {timestamp}
EXPIRE user:{user_id}:last_active 30
// Check online status
GET user:{user_id}:last_active
// If the key exists - onlineFanout problem
If a user has 500 friends, each status change requires 500 notifications. Solutions:
- •Lazy loading: status is requested when opening a chat
- •Batch updates: sending changes every N seconds
- •Selective push: only for active chats
7Group chats
Scaling groups
Direct fanout via WebSocket. Everyone receives the message directly.
Message Queue + async workers. Batch delivery.
Pub/Sub model. Subscribe to the channel, not to individual messages.
Data schema for groups
-- Groups
CREATE TABLE groups (
group_id UUID PRIMARY KEY,
name TEXT,
created_by UUID,
created_at TIMESTAMP
);
-- Group members (for quick lookup)
CREATE TABLE group_members (
group_id UUID,
user_id UUID,
joined_at TIMESTAMP,
role TEXT, -- admin, member
PRIMARY KEY ((group_id), user_id)
);
-- User groups (reverse index)
CREATE TABLE user_groups (
user_id UUID,
group_id UUID,
last_read TIMEUUID, -- for unread count
PRIMARY KEY ((user_id), group_id)
);8Synchronization and offline delivery
Synchronization
Last-seen message ID
The last-seen message ID pattern is the key to effective synchronization.
Sync Protocol
Each device stores last_synced_message_id. When connected:
- 1The client sends its last_synced_message_id
- 2The server returns all messages after this ID
- 3The client applies the changes and updates last_synced_message_id
Offline Message Queue
For offline users, messages accumulate:
-- Queue of unread messages
CREATE TABLE offline_messages (
user_id UUID,
message_id TIMEUUID,
chat_id UUID,
sender_id UUID,
content TEXT,
PRIMARY KEY ((user_id), message_id)
) WITH default_time_to_live = 2592000; -- 30 days TTL
-- When a user connects
SELECT * FROM offline_messages WHERE user_id = ?;
-- After synchronization, we delete the delivered9Scaling WebSocket servers
⚠️ Main difficulty
WebSocket connections stateful. You can't just add servers behind the Load Balancer - you need to know which server a specific user is on.
Session Registry
The centralized registry (Redis) stores the user → server mapping:
// When a user connects
HSET user_sessions user_123 server_5
// When sending a message
target_server = HGET user_sessions user_456
// When disconnected
HDEL user_sessions user_123Pub/Sub between servers
Servers communicate via Redis Pub/Sub or Kafka:
// Server 1 publishes a message
PUBLISH chat_server_5 {
"type": "message",
"to": "user_456",
"content": "Hello!"
}
// Server 5 receives and delivers
// via local WebSocket connectionSticky Sessions as an alternative
Load Balancer can use sticky sessions (binding the user to the server via IP or cookie). But this makes failover and rebalancing more difficult. Session Registry is recommended.
10Key interview points
✓ Be sure to discuss
- •Selecting a protocol (WebSocket vs alternatives)
- •How to route messages between servers
- •Message ordering and delivery guarantees
- •Offline sync and push notifications
- •Scaling stateful connections
💡 Additional themes
- •End-to-end encryption (Signal Protocol)
- •Read receipts and typing indicators
- •Media storage (S3 + CDN)
- •Rate limiting for spam prevention
- •Multi-device sync
Common interview mistakes
- ✗Forget about the stateful nature of WebSocket and the complexity of scaling
- ✗Don't think through the offline scenario and push notifications
- ✗Ignore message ordering when distributed delivery
- ✗Don't discuss fanout problem for group chats
