Design a Rate Limiter: System Design Interview Problem
Complete system design walkthrough for a rate limiter — token bucket, sliding window, leaky bucket algorithms, distributed rate limiting with Redis, and HTTP headers.
Rate limiting is one of those interview questions that seems simple but has surprising depth. The basic concept takes 30 seconds to explain — don't let clients make too many requests. But the implementation choices, distributed coordination, and algorithm tradeoffs fill a full 45-minute interview. Plus, unlike "design Twitter," a rate limiter is a component you might actually build yourself rather than stitching together managed services.
Why Rate Limiting Exists
Before jumping into algorithms, make sure you can explain why rate limiting matters:
- Prevent abuse — stop DDoS attacks, brute force login attempts, API scraping
- Ensure fair usage — one heavy user shouldn't degrade service for everyone
- Protect downstream services — your API might handle 10K QPS but the database behind it can only do 2K
- Cost control — especially for paid APIs, prevent accidental (or intentional) over-usage
- Compliance — some services contractually guarantee rate limits to third parties
The Five Algorithms
There are five rate limiting algorithms worth knowing. Each has different characteristics, and interviewers love asking you to compare them.
1. Token Bucket
The most commonly used algorithm. Imagine a bucket that holds tokens. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected.
Parameters:bucket_size— maximum number of tokens (burst capacity)refill_rate— tokens added per second
import time
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate
self.last_refill = time.time()
def allow_request(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.refill_rate
)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# Allow 10 requests/second with burst up to 20
limiter = TokenBucket(capacity=20, refill_rate=10)
Pros: Simple, memory-efficient (just two numbers per user), allows bursts up to bucket size.
Cons: Tuning bucket_size and refill_rate requires thought.
Used by: Amazon, Stripe, most API gateways.
2. Leaky Bucket
Requests enter a queue (bucket) and are processed at a fixed rate, like water leaking from a bucket. If the queue is full, new requests are dropped.
Incoming requests → [Queue (fixed size)] → Process at constant rate
↓ (overflow dropped)
Pros: Smooths out traffic perfectly — output rate is constant regardless of input bursts. Good for systems that need a steady processing rate.
Cons: Doesn't handle legitimate bursts well. A burst of requests that should all succeed will be queued and delayed. Older requests might be stuck in the queue while newer, more relevant ones are dropped.
Used by: NGINX's limit_req with burst parameter.
3. Fixed Window Counter
Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window. If the count exceeds the limit, reject.
Window: 10:00:00 - 10:00:59 → count: 87 / limit: 100 → ALLOW
Window: 10:01:00 - 10:01:59 → count: 100 / limit: 100 → REJECT new
Pros: Simple, memory-efficient (one counter per window per user).
Cons: Boundary burst problem. If the limit is 100/minute and a user sends 100 requests at 10:00:50 and 100 more at 10:01:10, they've sent 200 requests in 20 seconds but each window only sees 100. This edge case is the reason the next two algorithms exist.
4. Sliding Window Log
Keep a log of timestamps for each request. When a new request arrives, remove all timestamps older than the window, then check if the count exceeds the limit.
import time
from collections import deque
class SlidingWindowLog:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window = window_seconds
self.timestamps = deque()
def allow_request(self):
now = time.time()
# Remove timestamps outside the window
while self.timestamps and self.timestamps[0] <= now - self.window:
self.timestamps.popleft()
if len(self.timestamps) < self.limit:
self.timestamps.append(now)
return True
return False
Pros: Precise — no boundary burst problem. Exactly limit requests allowed in any rolling window.
Cons: Memory-heavy. Storing a timestamp for every request means a user making 1000 requests/minute requires storing 1000 timestamps. At scale with millions of users, this adds up.
5. Sliding Window Counter
The practical compromise. Combine the current window count with a weighted portion of the previous window count.
Previous window (10:00 - 10:01): 84 requests
Current window (10:01 - 10:02): 36 requests
Current time: 10:01:15 (25% into current window)
Weighted count = 84 * 0.75 (75% of prev window remaining) + 36 = 99
Limit: 100 → ALLOW
Pros: Low memory (two counters per user), smooth — no boundary burst problem (approximately), good enough for most applications.
Cons: Approximate — not perfectly precise in edge cases. But in practice, the approximation is close enough.
Algorithm Comparison
| Algorithm | Memory | Precision | Burst Handling | Complexity |
|---|---|---|---|---|
| Token Bucket | Low (2 values) | Good | Allows controlled bursts | Low |
| Leaky Bucket | Low (queue) | Good | Smooths all bursts | Low |
| Fixed Window | Low (1 counter) | Boundary issue | Poor at edges | Very Low |
| Sliding Window Log | High (all timestamps) | Exact | Perfect | Medium |
| Sliding Window Counter | Low (2 counters) | Approximate | Good | Low |
Where to Place the Rate Limiter
Three options, each with tradeoffs:
API Gateway level — rate limit before requests reach your services. Best for global rate limiting, simple rules. API gateways like Kong, AWS API Gateway, and NGINX have built-in rate limiting. Downside: limited context about the request (can't rate limit based on user plan without decoding the auth token first). Middleware — a rate limiting middleware in your application. More context available (user ID, plan, endpoint). Can apply different limits per endpoint. Slightly more latency than gateway-level. Application level — rate limiting logic embedded in your service code. Maximum flexibility but tightly coupled. Hard to maintain consistently across services. The practical answer: Use the API gateway for coarse-grained limits (IP-based, global QPS caps) and middleware for fine-grained limits (per-user, per-endpoint, plan-based).Distributed Rate Limiting
Here's where it gets interesting. On a single server, rate limiting is trivial — just an in-memory data structure. But with multiple servers behind a load balancer, you need shared state.
The problem:User makes 100 requests
→ Load balancer distributes across 5 servers
→ Each server sees ~20 requests
→ Each server's local rate limiter thinks everything is fine
→ User actually exceeded the 50/minute limit
The solution: Redis.
Redis is the standard choice for distributed rate limiting because:
- Sub-millisecond latency
- Atomic operations (INCR, EXPIRE)
- Shared state across all app servers
Token bucket in Redis (Lua script for atomicity):
-- KEYS[1]: rate limit key (e.g., "ratelimit:user:123")
-- ARGV[1]: bucket capacity
-- ARGV[2]: refill rate (tokens/sec)
-- ARGV[3]: current timestamp
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 60)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 60)
return 0 -- rejected
end
Why Lua? Redis executes Lua scripts atomically — no other command can interleave between the read and write. Without atomicity, two concurrent requests could both read 1 remaining token and both succeed. The Lua script prevents this race condition.
What about Redis failure? If Redis goes down, you have a choice: fail open (allow all requests — risk abuse) or fail closed (reject all requests — risk blocking legitimate users). Most systems fail open because a brief period without rate limiting is less damaging than blocking all traffic.
Rate Limit Headers
Good rate limiters communicate their state to clients through HTTP headers:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 57
X-RateLimit-Reset: 1711540800
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711540800
Retry-After: 30
- X-RateLimit-Limit — total requests allowed in the window
- X-RateLimit-Remaining — requests remaining in current window
- X-RateLimit-Reset — Unix timestamp when the limit resets
- Retry-After — seconds to wait before retrying (on 429 responses)
Client Identification
How do you identify who to rate limit?
- API key — most common for public APIs. Each key has its own limits.
- User ID — for authenticated users. Different plans get different limits.
- IP address — for unauthenticated traffic. Problematic: NAT means thousands of users behind one IP. Proxies and VPNs complicate things.
- Combination — IP + endpoint for unauthenticated, user ID + endpoint for authenticated.
Global: Total API: 10,000 req/min (protect the system)
Per-user: Each user: 100 req/min (fairness)
Per-endpoint: POST /upload: 10 req/min per user (protect expensive operations)
Layer these — a request must pass all applicable limits.
Putting It All Together
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Client ├────►│ API Gateway ├────►│ Rate Limiter ├────►│ Application │
│ │◄────┤ (IP limits) │◄────┤ (User limits)│◄────┤ Service │
└──────────┘ └──────────────┘ └──────┬───────┘ └──────────────┘
│
┌──────▼───────┐
│ Redis │
│ (shared │
│ counters) │
└──────────────┘
- Request arrives at API gateway — IP-based rate check
- If allowed, request reaches rate limiter middleware — user/endpoint-based check against Redis
- If allowed, request proceeds to application
- Response includes rate limit headers
- If rejected at any stage, return 429 with Retry-After
Common Follow-Up Questions
"How do you handle rate limiting across multiple data centers?" Each data center has its own Redis cluster. Options: local rate limiting (each DC has independent limits — simple but a user could get 2x the global limit) or synchronized (replicate counters across DCs — adds latency). Most systems use local limiting with a slightly lower per-DC limit. "What about rate limiting WebSocket connections?" Rate limit messages per connection per time window. Also limit the total number of concurrent connections per user. "How do you avoid rate limiting your own internal services?" Whitelist internal service IPs or use separate authentication (service tokens) with higher or no limits.The rate limiter question is a favorite because it's practical, has clear algorithmic depth, and naturally leads to distributed systems discussions. Nail the algorithm comparison, show the Redis implementation, and discuss the client experience — that covers everything the interviewer is looking for.
Practice system design problems with structured feedback at CodeUp — the rate limiter is a great starting point because it's self-contained enough to actually build, not just whiteboard.