Knowledge graphSettings

Updated: April 21, 2026 at 4:55 PM

Chat System

medium

Classic case: long-lived WebSocket connections, real-time delivery, presence, cross-device synchronization, and offline notifications.

Chat gets hard not when sending one message, but when the system must hold millions of long-lived connections, preserve message ordering, and synchronize multiple devices after reconnects.

The case ties together WebSocket gateways, server-to-server routing, durable history, presence, offline delivery, and push notifications into one working architecture.

For interviews and architecture reviews, it is useful because it forces explicit decisions about what must arrive instantly, what must preserve order, and what can be recovered later.

Latency Budget

Every hop from the WebSocket edge to the recipient needs a clear budget, or the chat stops feeling instant.

Session State

You need to know which server owns the active connection and when the system should treat a user as offline.

Offline Delivery

History storage, push notifications, and reconnect sync deserve their own design rather than being collapsed into one path.

Group Fan-out

As groups grow, message writes and message delivery should separate so the cost of fan-out stays under control.

A chat system becomes difficult not when sending one message, but when it has to hold millions of long-lived connections, stay within a tight latency budget, synchronize multiple devices, and recover message history after reconnects. That is why this case is a classic system-design interview question: it tests not only WebSocket knowledge, but also whether you can connect real-time delivery, durable storage, offline handling, and push channels into one coherent architecture.

Related chapter

Alex Xu book review

A detailed chat-system breakdown appears in chapter 12 of Alex Xu's book.

Читать обзор

Examples of real systems

WhatsApp
Telegram
Slack
Discord

1Functional Requirements

1-on-1 chats between users.

Group chats with an upper bound on participants.

Text and media message delivery.

Online status and typing indicators.

Read receipts and message-history sync across devices.

Push notifications for users outside an active session.

2Non-Functional Requirements

Latency: < 100 ms

Online users should receive messages almost instantly.

Availability: 99.99% uptime

The messenger must stay available even under partial failures.

Consistency: delivery and ordering

Messages should not be lost or reordered within the same chat.

Scalability: 50M concurrent connections

The architecture must grow horizontally with the audience.

Example system scale

DAU:500M
Messages per day:100B
Concurrent connections:50M
Average message size:100 bytes

3Choosing the Communication Protocol

Related chapter

WebSocket Protocol

A deeper look at WebSocket handshake, keepalive, reconnect behavior, and production guidance.

Читать обзор

Comparing the options

ApproachLatencyServer loadBest fit
HTTP PollingHighVery highLegacy fallback
Long PollingMediumHighSimple notifications
WebSocket ✓MinimalOptimalReal-time chat and collaboration
Server-Sent EventsMinimalMediumOne-way notifications

Why WebSocket usually wins

  • Bidirectional channel: both client and server can send events whenever they need to.
  • One connection per session: the system does not pay for a new HTTP handshake on every message.
  • Lower protocol overhead: less extra traffic and less pressure on the server tier.
  • Cleaner delivery model: online delivery, acknowledgements, and reconnect handling can all sit behind the same transport.

4High-Level Architecture

In a production design, you need to explain how the session registry maps each user to a specific WebSocket server and how presence tells the system whether it can deliver immediately to an active session or should switch to the offline path.

Chat System: Architecture Map

connection routing, message storage, and offline delivery

Realtime Connection Layer

Clients -> Gateway -> Router
primary online-delivery path
Registry + Presence Service
routing and online status

Storage and Offline Delivery

Store -> Delivery Queue
durable history and async processing
Push Service -> APNS / FCM
notifications for offline users

Reference chat-system layout: long-lived connections, message routing, durable history, and a separate offline delivery path.

Online delivery path

  1. User A sends a message through a WebSocket connection.
  2. The WebSocket gateway forwards the event to the chat-routing layer.
  3. The session registry tells the system which server currently owns user B's session.
  4. If that session is active, the message is pushed directly to the recipient.

Offline delivery path

  1. The message is acknowledged and persisted in durable storage.
  2. The delivery queue creates a push task and retry schedule.
  3. The push service hands the event to APNS or FCM.
  4. After reconnect, the client pulls everything after its last confirmed sync point.

5Message Storage

Data storage

Database Internals

The choice between SQL and NoSQL depends on write patterns, reads, and pagination.

Читать обзор

Comparing storage options

DatabaseProsConsBest fit
PostgreSQLTransactions and a familiar stackHarder to scale out cleanlySmaller deployments
Cassandra ✓Scales horizontally and handles writes wellRequires careful work around eventual consistencyLarge chat platforms and messengers
HBaseWide-column model and Hadoop integrationHigher operational complexityVery large analytical platforms

Data schema in Cassandra

-- Message table (partitioned by chat_id)
CREATE TABLE messages (
    chat_id       UUID,
    message_id    TIMEUUID,  -- Snowflake ID or TIMEUUID
    sender_id     UUID,
    content       TEXT,
    created_at    TIMESTAMP,
    PRIMARY KEY ((chat_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- Fast access to recent chat messages
SELECT * FROM messages
WHERE chat_id = ?
ORDER BY message_id DESC
LIMIT 50;

Why `message_id` matters so much

  • Ordering: TIMEUUID or Snowflake ID help reconstruct one clear sequence of events.
  • Pagination: it becomes easy to ask for messages before or after a known point.
  • Idempotency: retries and duplicate delivery attempts do not create extra rows.
  • Sync: devices can request everything after the last known message.

6Presence and Online Status

Activity checks

A client typically sends a heartbeat every 5-30 seconds. If the system stops receiving it, the user is treated as offline and the design no longer relies on immediate delivery to an open session.

// Redis stores the last activity timestamp
SET user:{user_id}:last_active {timestamp}
EXPIRE user:{user_id}:last_active 30

// Check online status
GET user:{user_id}:last_active
// If the key exists, the user is considered online

Group fan-out pressure

Even a simple status change can explode into a large fan-out problem when a user has hundreds of contacts or participates in many groups.

  • Load presence lazily when a relevant conversation is actually opened.
  • Batch updates instead of pushing every tiny change immediately.
  • Push only to active or high-priority conversations.

7Group Chats

Groups break the naive “one message, one recipient” model. As membership grows, the system increasingly needs to separate the write step from delivery and move toward a more event-driven distribution model.

How the architecture changes as groups grow

Small
up to 100 members

Direct delivery over WebSocket still stays simple and manageable.

Medium
100-10K members

It is safer to separate message writes from delivery and hand distribution to background workers.

Very large
channels and communities 10K+

You want a channel-subscription model rather than a personal push to every member.

Data schema for groups

-- Groups
CREATE TABLE groups (
    group_id UUID PRIMARY KEY,
    name TEXT,
    created_by UUID,
    created_at TIMESTAMP
);

-- Group members (for fast lookup)
CREATE TABLE group_members (
    group_id UUID,
    user_id UUID,
    joined_at TIMESTAMP,
    role TEXT, -- admin, member
    PRIMARY KEY ((group_id), user_id)
);

-- User groups (reverse index)
CREATE TABLE user_groups (
    user_id UUID,
    group_id UUID,
    last_read TIMEUUID, -- for unread counts
    PRIMARY KEY ((user_id), group_id)
);

8Synchronization and Offline Delivery

Synchronization

Last-seen message ID

The last-seen message ID pattern makes cross-device history sync much cheaper.

Читать обзор

Sync protocol

Each device stores `last_synced_message_id`. On reconnect, the client sends that checkpoint and the server returns everything that came after it.

  1. 1The client sends its `last_synced_message_id`.
  2. 2The server returns all messages after that identifier.
  3. 3The client applies the delta and updates its sync point.

Offline queue

For offline users, it helps to separate durable history storage from a dedicated queue that handles delayed delivery and retries.

-- Queue of unread messages
CREATE TABLE offline_messages (
    user_id UUID,
    message_id TIMEUUID,
    chat_id UUID,
    sender_id UUID,
    content TEXT,
    PRIMARY KEY ((user_id), message_id)
) WITH default_time_to_live = 2592000; -- 30 days TTL

-- When a user reconnects
SELECT * FROM offline_messages WHERE user_id = ?;
-- Delivered rows are deleted after synchronization

9Scaling WebSocket Servers

Main challenge

WebSocket connections are stateful. You cannot simply add more servers behind a load balancer without knowing which node currently owns a user's active connection.

Session registry

A centralized Redis table stores the mapping user → server so routing does not depend on whichever node the balancer happened to pick.

// When a user connects
HSET user_sessions user_123 server_5

// When sending a message
target_server = HGET user_sessions user_456

// When a user disconnects
HDEL user_sessions user_123

Server-to-server delivery

Redis Pub/Sub or Kafka can carry inter-server events so the target server consumes the event from its own channel and delivers it locally.

// Server 1 publishes a message
PUBLISH chat_server_5 {
  "type": "message",
  "to": "user_456",
  "content": "Hello!"
}

// Server 5 receives the event and delivers it
// over its local WebSocket connection

Sticky sessions as an alternative

A load balancer can try to keep a user on the same server through sticky sessions based on IP or cookies. That can work, but it makes failover and rebalancing harder, so a dedicated session registry is usually easier to reason about.

10Key Interview Points

What you should always cover

  • Why WebSocket is needed here and where fallback options remain relevant.
  • How messages are routed between servers and how the recipient's active session is found.
  • How ordering and delivery guarantees are enforced.
  • How offline sync and push notifications are handled.
  • How the long-lived connection layer scales.

Good follow-up topics

  • End-to-end encryption and Signal Protocol.
  • Read receipts and typing indicators.
  • Separating media storage from text storage, for example with S3 and a CDN.
  • Rate limits and spam prevention.
  • Synchronization across multiple devices.

Common interview mistakes

  • Forgetting that WebSocket connections are stateful and therefore do not scale “just like HTTP.”
  • Skipping the offline path, reconnect flow, and push notifications.
  • Ignoring message ordering once delivery becomes distributed across servers.
  • Not discussing how fan-out changes in large group chats or channel-like workloads.

Related chapters

Enable tracking in Settings