March 27, 202610 min read

Design a Rate Limiter: System Design Interview Problem

Complete system design walkthrough for a rate limiter — token bucket, sliding window, leaky bucket algorithms, distributed rate limiting with Redis, and HTTP headers.

system-design rate-limiter interviews backend algorithms

Rate limiting is one of those interview questions that seems simple but has surprising depth. The basic concept takes 30 seconds to explain — don't let clients make too many requests. But the implementation choices, distributed coordination, and algorithm tradeoffs fill a full 45-minute interview. Plus, unlike "design Twitter," a rate limiter is a component you might actually build yourself rather than stitching together managed services.

Why Rate Limiting Exists

Before jumping into algorithms, make sure you can explain why rate limiting matters:

Prevent abuse — stop DDoS attacks, brute force login attempts, API scraping
Ensure fair usage — one heavy user shouldn't degrade service for everyone
Protect downstream services — your API might handle 10K QPS but the database behind it can only do 2K
Cost control — especially for paid APIs, prevent accidental (or intentional) over-usage
Compliance — some services contractually guarantee rate limits to third parties

The Five Algorithms

There are five rate limiting algorithms worth knowing. Each has different characteristics, and interviewers love asking you to compare them.

1. Token Bucket

The most commonly used algorithm. Imagine a bucket that holds tokens. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected.

Parameters:

bucket_size — maximum number of tokens (burst capacity)
refill_rate — tokens added per second

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate
        self.last_refill = time.time()

def allow_request(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now

if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Allow 10 requests/second with burst up to 20
limiter = TokenBucket(capacity=20, refill_rate=10)

Pros: Simple, memory-efficient (just two numbers per user), allows bursts up to bucket size. Cons: Tuning bucket_size and refill_rate requires thought. Used by: Amazon, Stripe, most API gateways.

2. Leaky Bucket

Requests enter a queue (bucket) and are processed at a fixed rate, like water leaking from a bucket. If the queue is full, new requests are dropped.

Incoming requests → [Queue (fixed size)] → Process at constant rate
                    ↓ (overflow dropped)

Pros: Smooths out traffic perfectly — output rate is constant regardless of input bursts. Good for systems that need a steady processing rate. Cons: Doesn't handle legitimate bursts well. A burst of requests that should all succeed will be queued and delayed. Older requests might be stuck in the queue while newer, more relevant ones are dropped. Used by: NGINX's limit_req with burst parameter.

3. Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window. If the count exceeds the limit, reject.

Window: 10:00:00 - 10:00:59 → count: 87 / limit: 100 → ALLOW
Window: 10:01:00 - 10:01:59 → count: 100 / limit: 100 → REJECT new

Pros: Simple, memory-efficient (one counter per window per user). Cons: Boundary burst problem. If the limit is 100/minute and a user sends 100 requests at 10:00:50 and 100 more at 10:01:10, they've sent 200 requests in 20 seconds but each window only sees 100. This edge case is the reason the next two algorithms exist.

4. Sliding Window Log

Keep a log of timestamps for each request. When a new request arrives, remove all timestamps older than the window, then check if the count exceeds the limit.

import time
from collections import deque

class SlidingWindowLog:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window = window_seconds
        self.timestamps = deque()

def allow_request(self):
        now = time.time()
        # Remove timestamps outside the window
        while self.timestamps and self.timestamps[0] <= now - self.window:
            self.timestamps.popleft()

if len(self.timestamps) < self.limit:
            self.timestamps.append(now)
            return True
        return False

Pros: Precise — no boundary burst problem. Exactly limit requests allowed in any rolling window. Cons: Memory-heavy. Storing a timestamp for every request means a user making 1000 requests/minute requires storing 1000 timestamps. At scale with millions of users, this adds up.

5. Sliding Window Counter

The practical compromise. Combine the current window count with a weighted portion of the previous window count.

Previous window (10:00 - 10:01): 84 requests
Current window  (10:01 - 10:02): 36 requests
Current time: 10:01:15 (25% into current window)

Weighted count = 84 * 0.75 (75% of prev window remaining) + 36 = 99
Limit: 100 → ALLOW

Pros: Low memory (two counters per user), smooth — no boundary burst problem (approximately), good enough for most applications. Cons: Approximate — not perfectly precise in edge cases. But in practice, the approximation is close enough.

Algorithm Comparison

Algorithm	Memory	Precision	Burst Handling	Complexity
Token Bucket	Low (2 values)	Good	Allows controlled bursts	Low
Leaky Bucket	Low (queue)	Good	Smooths all bursts	Low
Fixed Window	Low (1 counter)	Boundary issue	Poor at edges	Very Low
Sliding Window Log	High (all timestamps)	Exact	Perfect	Medium
Sliding Window Counter	Low (2 counters)	Approximate	Good	Low

Which to recommend in an interview? Token bucket for most use cases. Sliding window counter if you need rate-per-window semantics without the memory cost of the log approach. Mention the tradeoffs and let the interviewer steer.

Where to Place the Rate Limiter

Three options, each with tradeoffs:

API Gateway level — rate limit before requests reach your services. Best for global rate limiting, simple rules. API gateways like Kong, AWS API Gateway, and NGINX have built-in rate limiting. Downside: limited context about the request (can't rate limit based on user plan without decoding the auth token first). Middleware — a rate limiting middleware in your application. More context available (user ID, plan, endpoint). Can apply different limits per endpoint. Slightly more latency than gateway-level. Application level — rate limiting logic embedded in your service code. Maximum flexibility but tightly coupled. Hard to maintain consistently across services. The practical answer: Use the API gateway for coarse-grained limits (IP-based, global QPS caps) and middleware for fine-grained limits (per-user, per-endpoint, plan-based).

Distributed Rate Limiting

Here's where it gets interesting. On a single server, rate limiting is trivial — just an in-memory data structure. But with multiple servers behind a load balancer, you need shared state.

The problem:

User makes 100 requests
→ Load balancer distributes across 5 servers
→ Each server sees ~20 requests
→ Each server's local rate limiter thinks everything is fine
→ User actually exceeded the 50/minute limit

The solution: Redis.

Redis is the standard choice for distributed rate limiting because:

Sub-millisecond latency

Atomic operations (INCR, EXPIRE)

Shared state across all app servers

Token bucket in Redis (Lua script for atomicity):

-- KEYS[1]: rate limit key (e.g., "ratelimit:user:123")
-- ARGV[1]: bucket capacity
-- ARGV[2]: refill rate (tokens/sec)
-- ARGV[3]: current timestamp

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= 1 then
    tokens = tokens - 1
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 60)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 60)
    return 0  -- rejected
end

Why Lua? Redis executes Lua scripts atomically — no other command can interleave between the read and write. Without atomicity, two concurrent requests could both read 1 remaining token and both succeed. The Lua script prevents this race condition. What about Redis failure? If Redis goes down, you have a choice: fail open (allow all requests — risk abuse) or fail closed (reject all requests — risk blocking legitimate users). Most systems fail open because a brief period without rate limiting is less damaging than blocking all traffic.

Rate Limit Headers

Good rate limiters communicate their state to clients through HTTP headers:

HTTP/1.1 200 OK X-RateLimit-Limit: 100 X-RateLimit-Remaining: 57 X-RateLimit-Reset: 1711540800

HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1711540800 Retry-After: 30

X-RateLimit-Limit — total requests allowed in the window
X-RateLimit-Remaining — requests remaining in current window
X-RateLimit-Reset — Unix timestamp when the limit resets
Retry-After — seconds to wait before retrying (on 429 responses)

This is a small detail but mentioning it in an interview shows you think about the client experience, not just the server implementation.

Client Identification

How do you identify who to rate limit?

API key — most common for public APIs. Each key has its own limits.
User ID — for authenticated users. Different plans get different limits.
IP address — for unauthenticated traffic. Problematic: NAT means thousands of users behind one IP. Proxies and VPNs complicate things.
Combination — IP + endpoint for unauthenticated, user ID + endpoint for authenticated.

Per-endpoint vs per-user vs global:

Global:       Total API: 10,000 req/min (protect the system)
Per-user:     Each user: 100 req/min (fairness)
Per-endpoint: POST /upload: 10 req/min per user (protect expensive operations)

Layer these — a request must pass all applicable limits.

Putting It All Together

┌──────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Client  ├────►│ API Gateway  ├────►│ Rate Limiter ├────►│ Application  │
│          │◄────┤ (IP limits)  │◄────┤ (User limits)│◄────┤ Service      │
└──────────┘     └──────────────┘     └──────┬───────┘     └──────────────┘
                                             │
                                      ┌──────▼───────┐
                                      │    Redis     │
                                      │ (shared      │
                                      │  counters)   │
                                      └──────────────┘

Request arrives at API gateway — IP-based rate check
If allowed, request reaches rate limiter middleware — user/endpoint-based check against Redis
If allowed, request proceeds to application
Response includes rate limit headers
If rejected at any stage, return 429 with Retry-After

Common Follow-Up Questions

"How do you handle rate limiting across multiple data centers?" Each data center has its own Redis cluster. Options: local rate limiting (each DC has independent limits — simple but a user could get 2x the global limit) or synchronized (replicate counters across DCs — adds latency). Most systems use local limiting with a slightly lower per-DC limit. "What about rate limiting WebSocket connections?" Rate limit messages per connection per time window. Also limit the total number of concurrent connections per user. "How do you avoid rate limiting your own internal services?" Whitelist internal service IPs or use separate authentication (service tokens) with higher or no limits.

The rate limiter question is a favorite because it's practical, has clear algorithmic depth, and naturally leads to distributed systems discussions. Nail the algorithm comparison, show the Redis implementation, and discuss the client experience — that covers everything the interviewer is looking for.

Practice system design problems with structured feedback at CodeUp — the rate limiter is a great starting point because it's self-contained enough to actually build, not just whiteboard.