System Design Space
Knowledge graphSettings

Updated: March 2, 2026 at 9:29 AM

Chat System

mid

Classic task: WebSocket, real-time messaging, presence, groups, synchronization and scaling.

Designing an instant messaging system is one of the classic tasks in System Design interviews. This is a complex task that affects real-time communications, scaling stateful connections, message consistency and offline delivery.

Related chapter

Book review by Alex Xu

A detailed analysis of chat systems is in Chapter 12 of the book by Alex Xu.

Читать обзор

Examples of real systems

WhatsApp
Telegram
Slack
Discord

1Functional Requirements

1-on-1 chats between users.

Group chats (up to N participants).

Text and media message sending.

Online status indicator and typing indicators.

Read receipts and message history synchronization across devices.

Push notifications for offline users.

2Non-functional requirements

Latency: < 100ms

Real-time message delivery for online users.

Availability: 99.99% uptime

Messenger should remain available under partial failures.

Consistency: delivery and ordering guarantees

Messages must not be lost or delivered out of order.

Scalability: 50M concurrent connections

System should scale horizontally as audience grows.

System scale (example)

DAU:500M
Messages/day:100B
Concurrent connections:50M
Medium size:100 bytes

3Selecting a communication protocol

Related chapter

WebSocket Protocol

Detailed analysis of WebSocket: handshake, keepalive, reconnect and best practices.

Читать обзор

Comparison of approaches

ApproachLatencyServer LoadUse Case
HTTP PollingHigh (interval)Very tallLegacy fallback
Long PollingAverageHighSimple notifications
WebSocket ✓MinimumOptimalReal-time chats
Server-Sent EventsMinimumAverageUnidirectional flow

Why WebSocket?

  • Bidirectional: The client and server can send messages at any time
  • Persistent connection: One connection per session, minimum overhead
  • Low latency: No HTTP handshake on every message
  • Efficient: Less traffic and server load

4High-Level architecture

Chat System: High-Level Map

realtime routing + durable storage + offline push delivery

Realtime Plane

Client -> WS Gateway -> Chat Router
realtime message path
Session Registry + Presence
routing + online status

Durable + Offline Plane

Message Store -> Delivery Queue
durable persistence + async fanout
Push Service -> APNS/FCM
offline notifications

Core chat-system topology: realtime WebSocket path, durable storage, and a dedicated offline push pipeline.

Online delivery path

  1. User A sends a message via a WebSocket connection.
  2. WS Gateway transfers payload to Chat Router.
  3. Session Registry finds the target chat server for User B.
  4. When the session is active, the message is instantly delivered to the recipient.

Offline delivery path

  1. The message is first acknowledged and stored in the Message Store.
  2. Delivery Queue creates a push notification task with a retry policy.
  3. Push Service sends the event to APNS/FCM.
  4. When reconnecting, the client is synchronized using the last-seen message ID.

5Message storage

Data storage

Database Internals

The choice between SQL and NoSQL depends on access patterns.

Читать обзор

Comparison of storage approaches

DatabaseProsConsUse Case
PostgreSQLACID, friendSharding is complicatedSmall scale
Cassandra ✓Horizontal zoom, great recordingEventual consistencyFacebook/Discord chats
HBaseWide-column, Hadoop integrationComplexity of operationsFacebook Messenger

Data Schema (Cassandra)

-- Message table (partitioned by chat_id)
CREATE TABLE messages (
    chat_id UUID,
    message_id TIMEUUID, -- Snowflake ID or TIMEUUID
    sender_id UUID,
    content TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY ((chat_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- Quick access to recent chat messages
SELECT * FROM messages 
WHERE chat_id = ? 
ORDER BY message_id DESC 
LIMIT 50;

Why is message_id important?

  • Ordering: TIMEUUID or Snowflake ID guarantee order
  • Pagination: "Load messages up to message_id X"
  • Deduplication: Idempotent operations during retry
  • Sync: "Give all messages after message_id Y"

6Presence Service (Online status)

Heartbeat mechanism

The client periodically sends a heartbeat (every 5-30 seconds). If the heartbeat is not received, the user is considered offline.

// Redis status storage
SET user:{user_id}:last_active {timestamp}
EXPIRE user:{user_id}:last_active 30

// Check online status
GET user:{user_id}:last_active
// If the key exists - online

Fanout problem

If a user has 500 friends, each status change requires 500 notifications. Solutions:

  • Lazy loading: status is requested when opening a chat
  • Batch updates: sending changes every N seconds
  • Selective push: only for active chats

7Group chats

Scaling groups

Small
< 100 participants

Direct fanout via WebSocket. Everyone receives the message directly.

Medium
100-10K participants

Message Queue + async workers. Batch delivery.

Large
>10K (channels)

Pub/Sub model. Subscribe to the channel, not to individual messages.

Data schema for groups

-- Groups
CREATE TABLE groups (
    group_id UUID PRIMARY KEY,
    name TEXT,
    created_by UUID,
    created_at TIMESTAMP
);

-- Group members (for quick lookup)
CREATE TABLE group_members (
    group_id UUID,
    user_id UUID,
    joined_at TIMESTAMP,
    role TEXT, -- admin, member
    PRIMARY KEY ((group_id), user_id)
);

-- User groups (reverse index)
CREATE TABLE user_groups (
    user_id UUID,
    group_id UUID,
    last_read TIMEUUID, -- for unread count
    PRIMARY KEY ((user_id), group_id)
);

8Synchronization and offline delivery

Synchronization

Last-seen message ID

The last-seen message ID pattern is the key to effective synchronization.

Читать обзор

Sync Protocol

Each device stores last_synced_message_id. When connected:

  1. 1The client sends its last_synced_message_id
  2. 2The server returns all messages after this ID
  3. 3The client applies the changes and updates last_synced_message_id

Offline Message Queue

For offline users, messages accumulate:

-- Queue of unread messages
CREATE TABLE offline_messages (
    user_id UUID,
    message_id TIMEUUID,
    chat_id UUID,
    sender_id UUID,
    content TEXT,
    PRIMARY KEY ((user_id), message_id)
) WITH default_time_to_live = 2592000; -- 30 days TTL

-- When a user connects
SELECT * FROM offline_messages WHERE user_id = ?;
-- After synchronization, we delete the delivered

9Scaling WebSocket servers

⚠️ Main difficulty

WebSocket connections stateful. You can't just add servers behind the Load Balancer - you need to know which server a specific user is on.

Session Registry

The centralized registry (Redis) stores the user → server mapping:

// When a user connects
HSET user_sessions user_123 server_5

// When sending a message
target_server = HGET user_sessions user_456

// When disconnected
HDEL user_sessions user_123

Pub/Sub between servers

Servers communicate via Redis Pub/Sub or Kafka:

// Server 1 publishes a message
PUBLISH chat_server_5 {
  "type": "message",
  "to": "user_456",
  "content": "Hello!"
}

// Server 5 receives and delivers
// via local WebSocket connection

Sticky Sessions as an alternative

Load Balancer can use sticky sessions (binding the user to the server via IP or cookie). But this makes failover and rebalancing more difficult. Session Registry is recommended.

10Key interview points

✓ Be sure to discuss

  • Selecting a protocol (WebSocket vs alternatives)
  • How to route messages between servers
  • Message ordering and delivery guarantees
  • Offline sync and push notifications
  • Scaling stateful connections

💡 Additional themes

  • End-to-end encryption (Signal Protocol)
  • Read receipts and typing indicators
  • Media storage (S3 + CDN)
  • Rate limiting for spam prevention
  • Multi-device sync

Common interview mistakes

  • Forget about the stateful nature of WebSocket and the complexity of scaling
  • Don't think through the offline scenario and push notifications
  • Ignore message ordering when distributed delivery
  • Don't discuss fanout problem for group chats

Related materials

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov