Sliding Window- Fixed Rate : Practical Rate Limiting for Web APIs

6 min readOct 31, 2020

Why do we need Rate Limiting?

Picture this : You have a service with a great REST API, and its being used by many clients. These could be services within your own organization or some third party applications.

It’s all going good but one day lightening strikes.

Users are complaining of high latency and your service being unavailable intermittently. You investigate and find out that one of your clients was hitting your API 10X times than its normal request rate, that led to 100% CPU utilisation on your DB and other resources. When that client cools off, things become normal. You talk to the client, turns out it was a bug and they fix it.

You relax, things are normal and then it happens again.

This time it turns out it was a different client. They had to do a bulk invalidation of a local cache . With no-cache to serve traffic, every request to their service also required an API requests to your service until their cache warmed up. This resulted in a huge spike of traffic to your service.

In both cases, albeit unintentional, clients caused a DOS (denial of service) attack on your service. It could also happen intentionally i.e. a malicious client sending high traffic to your service with the intention of causing DOS.

That gets you thinking, what can I do to protect my service such that one or more offending clients, don’t result in errors or unavailability for your other well behaved clients. In such cases you need to have a quota for the clients calling your API. If a client exceeds their quota they get throttled i.e. their request is not processed. This concept is commonly known as “Rate Limiting”.

Below we present implementation of a practical algorithm that runs at scale with very low overhead and has been used in wild to handle a ton of traffic.

Sliding Window-Fixed Rate Algorithm

Lets walk through an example -

Say, we would like to impose a limit of 500 requests per min for an API.

So here is how this algorithm works

Time is divided into buckets of size as required, in this case , we need per minute, so bucket size is 1 min. An hour can be represented as 60 buckets of size 0–59, so a mental model of this partition would be something like below

Rate Limiter

It keeps a total count of requests per bucket as they arrive.
Say a request arrives at Minute 23, Second 45 (23:45). This request is in bucket index 23.
Now we already know two things a) count of request in previous bucket i.e bucket index 22 and the total number of requests so far in bucket 23. Let’s say the count in bucket 22 was 400 and the count so far in bucket 23 is 250.

Here is the fixed-rate part for the calculation. Let’s pretend that the 400 requests were spread evenly in bucket 22 i.e. requests arrived at constant rate.
If we draw a window of size 1 min ending at the request arrival time, 25% of this span(15/60) lies in window index 22.
So we can say that number of requests in that portion of bucket would be =15/60 of total count , which is equal to 25% of 400 =100
In window 23, the total requests so far are 250. Hence the total number of requests in the 1 min window ending at 23:45 = 250+100=350
Adding this request would mean that the total would be 351 which is less than 500 rpm threshold we wanted for this API. Hence the rate limiter would increment the counter for bucket 23 to 251 and allow the request

Advantages of this Algorithm

There are many advantages of this algorithm

Storage

Even though conceptually we have multiple buckets e.g. 60 buckets of 1 min size for an hour, the calculation only requires us to know the count for the current and previous bucket. So only two counters need to be managed at any given time

Low Latency

Most services would run multiple servers serving the API and hence the rate limiter needs to apply global rate limiting. This means rate limiter at all servers need to be aware of the counters. A classic setup of distributed API servers looks like below -

Rate Limiter accessing central counter store

For every request, Rate Limiter fetches two counters (previous and current bucket) , does a simple calculation (in server memory) and then increments current counter.

In our case we use Memcached to store counters and fetch both bucket counters using a single MultGet call . Counters are incremented using Incr API. Each bucket has a TTL depending on the time window chosen so older buckets expire automatically.

Easy to implement flexible window size

Your bucket size requirement may vary depending on the kind of controls you want to put on the API. For example an API may have rate limits like 10 requests per second as well as 3000 requests per 10 minute. This isn’t too difficult to implement with the above approach. With the 10 minute window, the buckets would look like below

Everything else in the logic and calculation remains same

Implementation

A concrete implementation of this algorithm with flexible window size support is presented here https://github.com/monmohan/rate-limiting

Using the library

The Rate Limiter check needs to happen even before the API request is processed by your service. Generally most Web frameworks have a concept of a middleware such as Request Filters or Interceptors. Rate limiter allow/deny check is commonly done at this layer.
Lets say we have an API, and we want to allow a maximum of 10,000 request per client per 15 minute for this API
In general, clients/callers of API would be determined by some kind of API Key e.g. OAuth2 application ID
Library provides a simple way to initialize the rate limiter and use it for any number of clients. Sample code below

rateLimiter := PerNMinute(10000,15)/*
* Give it a backing store
* Below is an example for using local memcached as backing counter store. In production, you would likely use a cluster or something like AWS Elasticache
*/rateLimiter.Store = &memcached.CounterStore{Client: memcache.New("127.0.0.1:11211")}

When the request arrives, identify the client making the request and check with the rate limiter -

clientKey=....//extract from the incoming request
if !rateLimiter.Allow(clientKey){ 
      return HTTP status 429
}
//else proceed with handling the request

That’s it.

Summary

We have used this library with much success, applying rate limiting controls to many of our APIs , handling hundreds of thousands of requests per second without any noticeable overhead.

Its important to note that there are other popular algorithms like Token Bucket and its variant Leaky Bucket which can also be used for Rate Limiting/traffic control. However, the simplicity, effectiveness and most importantly low contention on the backing counter store, makes this algorithm a great alternative.