Excerpt

I’m on a Rails project using OpenAI. We’re sending over large amounts of text to provide as much context as possible, and recently ran into issues with rate limiting.
As it turns out, OpenAI’s rate limits are a little more complicated than other APIs.
Rate limits are measured in five ways: RPM (requests per minute), RPD (requests per day), TPM (tokens per minute), TPD (tokens per day), and IPM (images per minute). Rate limits can be hit across any of the options depending on what occurs first.
In our case, we were hitting our TPM (tokens per minute) rate limit.
Regardless of whether a rate limit is exceeded, OpenAI will return the following headers.
```plain text
x-ratelimit-reset-requests: 1s
x-ratelimit-reset-tokens: 6m0s
```
It should be noted that these headers represent the amount of time that needs to pass before the rate limit returns to its initial state.
At the time of this writing, OpenAI does not have a first-party Ruby library, but the community has gravitated towards ruby-openai, which is what our project is using. When a rate limit is hit, it raises Faraday::TooManyRequestsError, which gives us access to those headers via #response_headers.
Because OpenAI will return two headers (one for requests per minute and one for tokens per minute), we play it safe and wait based on the greater of the two values.
Rather than roll our own script to parse the header values, we can use Chronic Duration to do this for us. We can then define our own custom error class in an initializer to build the wait value for us.
In order to use this value, we need to leverage #rescue_from in combination with #retry_job. This is because we need to set the wait value dynamically based on the headers, and #retry_on does not provide a way to do this.
Below is a distilled example.
```plain text
# app/jobs/send_prompt_job.rb
class SendPromptJob < ApplicationJob
queue_as :default
MAX_ATTEMPTS = 2
rescue_from OpenAI::RateLimitError do |error|
if executions < MAX_ATTEMPTS
backoff = Backoff.polynomially_longer(executions:)
retry_job wait: error.wait.seconds + backoff
else
Rails.logger.info "Exhausted attempts"
end
end
def perform()
OpenAI::Client.new.chat(...)
rescue Faraday::TooManyRequestsError => error
raise OpenAI::RateLimitError.new(error)
end
end
# lib/backoff.rb
class Backoff
DEFAULT_JITTER = 0.15
def self.polynomially_longer(executions:, jitter: DEFAULT_JITTER)
((executions**4) + (Kernel.rand * (executions**4) * jitter)) + 2
end
end
# config/initializers/openai.rb
module OpenAI
class RateLimitError < StandardError
attr_reader :reset_requests_in_seconds, :reset_tokens_in_seconds
def initialize(faraday_error)
headers = faraday_error.response_headers&.with_indifferent_access || {}
@reset_requests_in_seconds = headers.fetch("x-ratelimit-reset-requests", "0s")
@reset_tokens_in_seconds = headers.fetch("x-ratelimit-reset-tokens", "0s")
super("The API has hit the rate limit")
end
def wait
[
parse_duration(reset_requests_in_seconds),
parse_duration(reset_tokens_in_seconds)
].max
end
private
def parse_duration(value)
ChronicDuration.parse(value) || 0
end
end
end
```
Since we can’t leverage retry_on, we need to ensure we eventually stop retrying the job if it continues to fail.
```plain text
executions < MAX_ATTEMPTS
```
Additionally, you’ll also note that we add a “backoff” mechanism per OpenAI’s recommendation.
```plain text
backoff = Backoff.polynomially_longer(executions:)
retry_job wait: error.wait.seconds + backoff
```
## Avoid rate limits by being proactive
We took a reactive approach to the problem, but I do want to highlight that there’s an opportunity to be proactive by examining the headers that return the amount of remaining requests or tokens that are permitted before exhausting the rate limit.
```plain text
x-ratelimit-remaining-requests
x-ratelimit-remaining-tokens
```
Unfortunately, ruby-openai does not return response headers, but there is a workaround. You can create a custom Faraday middleware, and pass it to the client in a block.
```plain text
class ExtractRateLimitHeaders< Faraday::Middleware
def on_complete(env)
# Store these values somewhere
remaining_requests = env.response_headers["x-ratelimit-remaining-requests"]
remaining_tokens = env.response_headers["x-ratelimit-remaining-tokens"]
end
end
client = OpenAI::Client.new do |faraday|
faraday.use ExtractRateLimitHeaders
end
```
You could then use this information to reduce the number of tokens you plan on sending to OpenAI by comparing its size with remaining_tokens. Or, if you’re keeping track of how many requests you’re making, you could compare that value with remaining_requests.
```plain text
# Ensure you're within the request and/or token limit before making a request
if (current_requests < remaining_requests && current_tokens < remaining_tokens)
client.chat(...)
end
```
Alternatively, you could temporarily switch to a model with higher token and request limits, or temporarily reduce the amount of tokens sent in the request.
## If you enjoyed this post, you might also like: