Rate Limiting and Caching, The Two Things That Keep APIs Alive

At some point in your career you'll ship an API endpoint that works perfectly, until it doesn't. Either traffic spikes and the service falls over, or a downstream dependency goes down and your API drags everything with it. Rate limiting and caching are the two tools that address both of these problems. They're often treated as infrastructure concerns, but understanding them deeply makes you a better API designer.

Rate Limiting

Rate limiting is a policy that restricts how many requests a client can make in a given time window. The goal is to protect your backend from being overwhelmed, whether from a misbehaving client, a buggy retry loop, or a genuine traffic spike.

Token Bucket

The most common algorithm is the token bucket. The idea:

A bucket holds up to N tokens
Tokens are added at a steady rate (e.g., 10 tokens/second)
Each request consumes 1 token
If the bucket is empty, the request is rejected

class TokenBucket {
  private tokens: number
  private lastRefill: number

  constructor(
    private capacity: number,
    private refillRate: number // tokens per ms
  ) {
    this.tokens = capacity
    this.lastRefill = Date.now()
  }

  consume(): boolean {
    this.refill()
    if (this.tokens < 1) return false
    this.tokens -= 1
    return true
  }

  private refill() {
    const now = Date.now()
    const elapsed = now - this.lastRefill
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate)
    this.lastRefill = now
  }
}

The key property of token bucket: it allows bursting. A client that hasn't made requests recently has accumulated tokens, so they can make a burst of requests. This is usually the right behavior, you want to handle legitimate bursts, not artificially smooth everything to a flat rate.

Sliding Window

Token bucket tracks state per client. If you're running multiple backend instances, those buckets need to be centralized, typically in Redis:

async function isAllowed(clientId: string, limit: number, windowMs: number): Promise<boolean> {
  const key = `rate_limit:${clientId}`
  const now = Date.now()
  const windowStart = now - windowMs

  const pipeline = redis.pipeline()
  pipeline.zremrangebyscore(key, 0, windowStart)      // remove old entries
  pipeline.zadd(key, now, `${now}-${Math.random()}`)  // add current request
  pipeline.zcard(key)                                  // count requests in window
  pipeline.expire(key, Math.ceil(windowMs / 1000))

  const results = await pipeline.exec()
  const count = results[2][1] as number

  return count <= limit
}

This is a sliding window implementation using a Redis sorted set. Each request is stored with its timestamp as the score. To check the rate, you count entries in the current window. This is accurate and handles distributed systems correctly.

What to Return

When you reject a rate-limited request, be explicit:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1700000030

429 Too Many Requests with a Retry-After header tells well-behaved clients exactly when to retry. This is the difference between a client that hammers you and one that backs off gracefully.

Caching

Caching stores the result of expensive operations so subsequent requests can skip the work. The challenge is not implementing a cache, it's knowing what to cache, for how long, and how to handle staleness.

Cache-Aside (Lazy Loading)

The most common pattern. The application manages the cache explicitly:

async function getUser(id: number): Promise<User> {
  const cacheKey = `user:${id}`

  // 1. Check cache
  const cached = await redis.get(cacheKey)
  if (cached) return JSON.parse(cached)

  // 2. Cache miss, fetch from DB
  const user = await db.query('SELECT * FROM users WHERE id = $1', [id])

  // 3. Populate cache
  await redis.setex(cacheKey, 300, JSON.stringify(user)) // TTL: 5 minutes

  return user
}

This is called cache-aside because the cache sits beside the database, the application decides when to populate it. The downside: the first request after a cache miss always hits the database. If many requests arrive simultaneously for the same uncached key, you get a thundering herd, all of them hit the database at once.

A simple mitigation is to use a lock:

async function getUser(id: number): Promise<User> {
  const cacheKey = `user:${id}`
  const lockKey = `lock:user:${id}`

  const cached = await redis.get(cacheKey)
  if (cached) return JSON.parse(cached)

  // Only let one request fetch from DB
  const acquired = await redis.set(lockKey, '1', 'NX', 'EX', 5)
  if (!acquired) {
    // Wait briefly and retry from cache
    await sleep(50)
    return getUser(id)
  }

  const user = await db.query('SELECT * FROM users WHERE id = $1', [id])
  await redis.setex(cacheKey, 300, JSON.stringify(user))
  await redis.del(lockKey)

  return user
}

Cache Invalidation

The hardest part of caching. You have two options:

TTL-based expiry: the cache entry expires after a fixed time. Simple, but you trade some accuracy for simplicity, data might be stale for up to TTL seconds.

Explicit invalidation: when data changes, actively delete or update the cache entry:

async function updateUser(id: number, data: Partial<User>): Promise<User> {
  const user = await db.query(
    'UPDATE users SET name = $1 WHERE id = $2 RETURNING *',
    [data.name, id]
  )
  await redis.del(`user:${id}`) // invalidate cache
  return user
}

This is more accurate but requires you to know every cache key that might be affected by a write. For simple entities this is fine; for complex data with many derived views, it gets complicated fast.

Choosing a TTL

TTL is a tradeoff between freshness and cache hit rate:

Short TTL (seconds): nearly always fresh, but low hit rate, you're paying for the cache infrastructure without much benefit
Long TTL (hours/days): high hit rate, but stale data is more likely

A useful heuristic: set TTL to the longest staleness you can tolerate. For user profiles, maybe 5 minutes. For product catalog data, maybe an hour. For static reference data (country lists, currency codes), days.

Combining Both

Rate limiting and caching work together. A common pattern in API design:

Rate limiter sits at the edge (API gateway, middleware), fast, rejects bad actors before they hit any business logic
Cache sits close to the data layer, expensive queries hit the database only once per TTL
Between them, your application logic handles only the requests that actually need to do real work

Neither is a substitute for the other. Rate limiting doesn't help if a single allowed request triggers a 10-second database query. Caching doesn't help if one client can make unlimited requests and exhaust your cache infrastructure.

Getting these right before you need them is a lot easier than debugging a production incident at 2am.