- Published on
Rate Limiting and Caching, The Two Things That Keep APIs Alive
- Authors

- Name
- Alex Peng
- @aJinTonic
At some point in your career you'll ship an API endpoint that works perfectly, until it doesn't. Either traffic spikes and the service falls over, or a downstream dependency goes down and your API drags everything with it. Rate limiting and caching are the two tools that address both of these problems. They're often treated as infrastructure concerns, but understanding them deeply makes you a better API designer.
Rate Limiting
Rate limiting is a policy that restricts how many requests a client can make in a given time window. The goal is to protect your backend from being overwhelmed, whether from a misbehaving client, a buggy retry loop, or a genuine traffic spike.
Token Bucket
The most common algorithm is the token bucket. The idea:
- A bucket holds up to
Ntokens - Tokens are added at a steady rate (e.g., 10 tokens/second)
- Each request consumes 1 token
- If the bucket is empty, the request is rejected
class TokenBucket {
private tokens: number
private lastRefill: number
constructor(
private capacity: number,
private refillRate: number // tokens per ms
) {
this.tokens = capacity
this.lastRefill = Date.now()
}
consume(): boolean {
this.refill()
if (this.tokens < 1) return false
this.tokens -= 1
return true
}
private refill() {
const now = Date.now()
const elapsed = now - this.lastRefill
this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate)
this.lastRefill = now
}
}
The key property of token bucket: it allows bursting. A client that hasn't made requests recently has accumulated tokens, so they can make a burst of requests. This is usually the right behavior, you want to handle legitimate bursts, not artificially smooth everything to a flat rate.
Sliding Window
Token bucket tracks state per client. If you're running multiple backend instances, those buckets need to be centralized, typically in Redis:
async function isAllowed(clientId: string, limit: number, windowMs: number): Promise<boolean> {
const key = `rate_limit:${clientId}`
const now = Date.now()
const windowStart = now - windowMs
const pipeline = redis.pipeline()
pipeline.zremrangebyscore(key, 0, windowStart) // remove old entries
pipeline.zadd(key, now, `${now}-${Math.random()}`) // add current request
pipeline.zcard(key) // count requests in window
pipeline.expire(key, Math.ceil(windowMs / 1000))
const results = await pipeline.exec()
const count = results[2][1] as number
return count <= limit
}
This is a sliding window implementation using a Redis sorted set. Each request is stored with its timestamp as the score. To check the rate, you count entries in the current window. This is accurate and handles distributed systems correctly.
What to Return
When you reject a rate-limited request, be explicit:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1700000030
429 Too Many Requests with a Retry-After header tells well-behaved clients exactly when to retry. This is the difference between a client that hammers you and one that backs off gracefully.
Caching
Caching stores the result of expensive operations so subsequent requests can skip the work. The challenge is not implementing a cache, it's knowing what to cache, for how long, and how to handle staleness.
Cache-Aside (Lazy Loading)
The most common pattern. The application manages the cache explicitly:
async function getUser(id: number): Promise<User> {
const cacheKey = `user:${id}`
// 1. Check cache
const cached = await redis.get(cacheKey)
if (cached) return JSON.parse(cached)
// 2. Cache miss, fetch from DB
const user = await db.query('SELECT * FROM users WHERE id = $1', [id])
// 3. Populate cache
await redis.setex(cacheKey, 300, JSON.stringify(user)) // TTL: 5 minutes
return user
}
This is called cache-aside because the cache sits beside the database, the application decides when to populate it. The downside: the first request after a cache miss always hits the database. If many requests arrive simultaneously for the same uncached key, you get a thundering herd, all of them hit the database at once.
A simple mitigation is to use a lock:
async function getUser(id: number): Promise<User> {
const cacheKey = `user:${id}`
const lockKey = `lock:user:${id}`
const cached = await redis.get(cacheKey)
if (cached) return JSON.parse(cached)
// Only let one request fetch from DB
const acquired = await redis.set(lockKey, '1', 'NX', 'EX', 5)
if (!acquired) {
// Wait briefly and retry from cache
await sleep(50)
return getUser(id)
}
const user = await db.query('SELECT * FROM users WHERE id = $1', [id])
await redis.setex(cacheKey, 300, JSON.stringify(user))
await redis.del(lockKey)
return user
}
Cache Invalidation
The hardest part of caching. You have two options:
TTL-based expiry: the cache entry expires after a fixed time. Simple, but you trade some accuracy for simplicity, data might be stale for up to TTL seconds.
Explicit invalidation: when data changes, actively delete or update the cache entry:
async function updateUser(id: number, data: Partial<User>): Promise<User> {
const user = await db.query(
'UPDATE users SET name = $1 WHERE id = $2 RETURNING *',
[data.name, id]
)
await redis.del(`user:${id}`) // invalidate cache
return user
}
This is more accurate but requires you to know every cache key that might be affected by a write. For simple entities this is fine; for complex data with many derived views, it gets complicated fast.
Choosing a TTL
TTL is a tradeoff between freshness and cache hit rate:
- Short TTL (seconds): nearly always fresh, but low hit rate, you're paying for the cache infrastructure without much benefit
- Long TTL (hours/days): high hit rate, but stale data is more likely
A useful heuristic: set TTL to the longest staleness you can tolerate. For user profiles, maybe 5 minutes. For product catalog data, maybe an hour. For static reference data (country lists, currency codes), days.
Combining Both
Rate limiting and caching work together. A common pattern in API design:
- Rate limiter sits at the edge (API gateway, middleware), fast, rejects bad actors before they hit any business logic
- Cache sits close to the data layer, expensive queries hit the database only once per TTL
- Between them, your application logic handles only the requests that actually need to do real work
Neither is a substitute for the other. Rate limiting doesn't help if a single allowed request triggers a 10-second database query. Caching doesn't help if one client can make unlimited requests and exhaust your cache infrastructure.
Getting these right before you need them is a lot easier than debugging a production incident at 2am.