March 27, 20269 min read

Design a Chat System Like WhatsApp: System Design Deep Dive

System design walkthrough for a real-time chat system — WebSocket connections, message delivery, group chat fan-out, offline handling, and end-to-end encryption.

system-design chat-system whatsapp interviews real-time

Designing a chat system is one of the most common and most revealing system design interview questions. It touches on real-time communication, delivery guarantees, presence tracking, group messaging, and scaling persistent connections — all in one problem. It's the interview question that separates people who've thought about distributed systems from people who've just read about them.

Let's walk through it like a real interview conversation.

Requirements

Functional:

1-on-1 messaging
Group chat (up to 500 members)
Online/offline/last-seen status
Message delivery status (sent, delivered, read)
Push notifications for offline users
Media messages (images, videos, files)
Message history and search

Non-functional:

Real-time delivery (sub-second latency for online users)
Message ordering within a conversation
At-least-once delivery (no lost messages)
High availability (chat should work even during partial outages)
500M daily active users, average 40 messages per user per day = 20B messages/day

Quick estimation:

Messages per second: 20B / 86,400 ≈ 230K messages/sec
Average message size: 200 bytes (text) → ~46 GB/day for text
Media: assume 10% of messages have media, average 200KB → ~400 TB/day
Concurrent WebSocket connections: ~100M at peak

These numbers tell us we need a distributed system with many servers handling persistent connections. This isn't something a single machine can handle.

Connection Protocol

Here's the thing about chat: HTTP was designed for request-response. Client asks, server answers. But chat needs the server to push messages to clients without being asked. You have three options:

Polling — client asks "any new messages?" every few seconds. Simple but wasteful. Thousands of empty responses burning bandwidth and server resources. Long polling — client sends a request, server holds it open until there's a new message (or timeout). Better than polling but still has overhead from constantly re-establishing connections. WebSocket — persistent, bidirectional TCP connection. Client and server can send messages at any time. This is the right answer for chat. After an initial HTTP handshake, the connection upgrades to WebSocket and stays open.

Client ←→ WebSocket Connection ←→ Chat Server
         (persistent, bidirectional)

Use WebSocket as the primary protocol, with long polling as a fallback for environments where WebSocket doesn't work (some corporate proxies block it).

High-Level Architecture

┌──────────┐     WebSocket      ┌─────────────────┐
│ Client A ├────────────────────┤ Chat Server 1   │
└──────────┘                    └────────┬────────┘
                                         │
┌──────────┐     WebSocket      ┌────────▼────────┐
│ Client B ├────────────────────┤ Chat Server 2   │
└──────────┘                    └────────┬────────┘
                                         │
                                ┌────────▼────────┐
                                │ Message Service │
                                └──┬──────┬───┬───┘
                                   │      │   │
                          ┌────────▼┐  ┌──▼───▼────┐
                          │  Kafka  │  │ Presence   │
                          │ (queue) │  │ Service    │
                          └────┬────┘  │ (Redis)    │
                               │       └────────────┘
                          ┌────▼────┐
                          │Database │
                          │(Cassandra)│
                          └─────────┘

Core components:

Chat servers — maintain WebSocket connections with clients. Each server handles tens of thousands of concurrent connections.
Message service — routes messages between users, handles group fan-out, ensures delivery.
Presence service — tracks who's online and which chat server they're connected to. Backed by Redis for fast lookups.
Message queue (Kafka) — decouples message production from storage and delivery. Guarantees messages aren't lost.
Database (Cassandra) — stores message history. Optimized for the access pattern we need.

Message Flow: 1-on-1 Chat

Let's trace what happens when Alice sends "hey" to Bob:

Alice's client sends the message over WebSocket to Chat Server 1
Chat Server 1 sends the message to the Message Service
Message Service does three things:

- Writes the message to Kafka (durability first) - Queries the Presence Service: "Where is Bob?" - Sends acknowledgment back to Alice ("message sent" ✓)

Presence Service responds: "Bob is connected to Chat Server 2"
Message Service forwards the message to Chat Server 2
Chat Server 2 pushes the message to Bob over WebSocket
Bob's client sends "delivered" acknowledgment back
Meanwhile, Kafka consumer persists the message to Cassandra

If Bob is offline at step 4, the message gets stored in the database and a push notification is sent via FCM (Android) or APNs (iOS). When Bob comes back online, the client pulls all undelivered messages.

Message Storage

The access pattern for chat messages is very specific: "Give me the messages for conversation X, ordered by time, paginated." This is a sequential read pattern.

Why Cassandra works well here:

Partition key: conversation_id
Clustering key: message_timestamp (descending)

CREATE TABLE messages (
    conversation_id UUID,
    message_id      TIMEUUID,
    sender_id       UUID,
    content         TEXT,
    media_url       TEXT,
    message_type    TEXT,  -- 'text', 'image', 'video'
    created_at      TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

All messages for a conversation live on the same partition, sorted by time. Fetching the latest 50 messages is a single sequential read — extremely fast.

Why not SQL? It would work at smaller scale. But at 230K messages/sec, the write throughput requirements push us toward a database designed for high-volume writes. Cassandra handles this natively with its log-structured storage engine.

Group Chat: The Fan-Out Decision

Group messaging introduces a key design decision: when Alice sends a message to a group of 200 people, how do you deliver it?

Fan-out-on-write: When a message arrives, create a copy for each group member. Each member has their own message queue/inbox.

Alice sends "hello" to Group (200 members)
→ Write 200 copies, one per member's inbox
→ Each member reads from their own inbox

Pros: Reading is fast — each user just reads their own inbox. Simple client logic.
Cons: 1 message becomes 200 writes. For large groups, this is expensive. If a group has 500 members and gets 100 messages/hour, that's 50,000 writes/hour just for one group.

Fan-out-on-read: Store the message once, tagged with the group ID. When a member opens the group, query all messages for that group.

Alice sends "hello" to Group
→ Write 1 copy to group's message store
→ Each member queries the group's messages when they open the chat

Pros: 1 write regardless of group size. Storage efficient.
Cons: Reading requires querying the group store every time. For active groups, this is fine. For users in 50 groups, opening the app means 50 queries.

The practical answer: Fan-out-on-write for small groups (under 100 members) because read performance matters more. Fan-out-on-read for large groups (100+) because the write amplification is too expensive. WhatsApp caps groups at ~1024 members partly because of this tradeoff.

Delivery Semantics: Sent, Delivered, Read

Three levels of message acknowledgment:

Sent (single check ✓) — server received the message from the sender
Delivered (double check ✓✓) — recipient's device received the message
Read (blue checks) — recipient opened the conversation

Implementation:

Sent:      Server → sender's client: "message_id stored"
Delivered: Recipient's client → server: "message_id received"
           Server → sender's client: "message_id delivered"
Read:      Recipient opens chat → server: "messages up to X read"
           Server → sender's client: "message_id read"

For groups, track delivery and read status per member. This creates a lot of status updates, so batch them — don't send individual acks for every message in a group of 200.

Handling Offline Users

When a user is offline:

Messages are stored in the database (they would be anyway)
A push notification is sent via FCM/APNs with a preview
When the user reconnects, the client sends its last received message ID
Server returns all messages after that ID

Client reconnects:
→ "My last message_id is abc123"
← Server sends all messages after abc123, per conversation

This is a pull model on reconnect. The client drives synchronization, which is simpler and more reliable than the server trying to push a backlog.

Media Messages

You never send actual media through the chat pipeline. That would choke everything.

1. Client uploads image to S3/CDN → gets URL

Client sends message with type: "image", media_url: "https://cdn.../img.jpg"
Recipient receives message, client downloads image from CDN

The chat message itself is just a small JSON payload with a URL. The heavy lifting of media transfer is offloaded to object storage and CDN infrastructure.

For thumbnails, generate them server-side on upload and include a thumbnail URL in the message. The recipient sees the thumbnail immediately while the full image loads.

End-to-End Encryption

Let's be honest — you probably won't need to implement E2EE in a 45-minute interview. But mentioning it shows awareness.

The concept (Signal Protocol, used by WhatsApp): each user has a public/private key pair. Messages are encrypted with the recipient's public key on the sender's device and decrypted with the recipient's private key on their device. The server never sees plaintext.

The implication for system design: the server can't index or search encrypted message content. Features like server-side search don't work with E2EE. This is a real tradeoff that WhatsApp accepts (and Telegram doesn't, since Telegram's regular chats aren't E2E encrypted).

Scaling Persistent Connections

With 100M concurrent WebSocket connections, you need a lot of chat servers. If each server handles 50K concurrent connections, you need 2,000 chat servers.

User-to-server mapping: Use consistent hashing to assign users to chat servers. The Presence Service (Redis) stores the mapping: user_id → chat_server_id. Server failure: If a chat server goes down, all connected users reconnect. The load balancer routes them to a new server, and the Presence Service updates. Clients should have reconnection logic with exponential backoff. Cross-server message delivery: When Alice (on Server 1) messages Bob (on Server 2), the Message Service looks up Bob's server in the Presence Service and routes accordingly. This is a service mesh pattern — servers communicate through the message routing layer, not directly.

What Makes This Answer Strong

A strong answer to "design a chat system" covers:

The WebSocket decision (and why, not just what)
Message flow traced through the system
Storage model (Cassandra partition design for sequential reads)
Group chat fan-out tradeoff (and the hybrid approach)
Offline message handling
Media as URL references, not inline content
Delivery receipts as a status machine

The common mistake is spending too long on any one area. The interviewer wants breadth with selective depth. Cover the full architecture in 15 minutes, then dive deep into whatever the interviewer finds most interesting.

Practice building out these distributed system designs interactively at CodeUp — working through the message flow step by step builds the intuition that reading alone can't.