How to respect OpenAI's rate limits in Rails

https://thoughtbot.com/blog/openai-rate-limits · scraped

![](https://prod-files-secure.s3.us-west-2.amazonaws.com/871f1661-80b8-4d0c-ac3b-2adfc6ff4c66/25a9e414-86a0-4cbc-8227-644fe316688c/Zn0Q2JbWFbowe7qY_default-article-background.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=ASIAZI2LB466TO67O2ZP%2F20260519%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20260519T194002Z&X-Amz-Expires=3600&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBMaCXVzLXdlc3QtMiJIMEYCIQD2eGokQGAy7b1ZQxMZHVZ1oLulQ5WcjCg2PFv6AAKQGAIhAMCFiURCA%2FwzkslaC7UrWmKi7blagm02ehEIvhDdDQzJKogECNz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEQABoMNjM3NDIzMTgzODA1Igz2GkIRDcvNK6zE0UMq3APEdK2ptJknpLhPTgMf%2FZd7X0u82XT0nhSuGf3U2Y8RCzKC9fK7c5zJakvXdSqvP5MXbsmkid3lIFOPB%2FygzRHiS%2BXnrvzx9sg9y%2F2MmN%2FmFsdZGpMPbR0JZtj2E3XQbuXhA2oySh3zBHrtizeG99jPO%2BL%2Fwvi2mTCozbiLwoJCn5m%2FZyD9thc9blinuMnXakLbEAJCPN4Fn4iHwdRhzKvGyewpI%2FQwtQH4G4JU5cS3%2FKP7epbbh1y%2BRptWuYLIILYz%2Fg1DBOyuN4EQsXrVdOTBu66120YmaXXcAgEHIDKt7M2RNGOlf6VXwssiSP06FcZEXX92uCS9q6X4rHWgmRIlUoSOKR5AloNCuLwNwPJsEcT2VnA6Fth7tRfJJI1det6x8brFonpJov4NxvzLBZgEyfcs1rwsIhFyfL8oq8WtL5Voj8Q%2FhaRkGbQjPOnVMx%2BbHdeRSlKftMPlrLgpkwzn%2Fcyk%2FHlZ%2Brw3e5vJLmdqX7YeTYuTaikrAA%2Bv1NcbRNoR%2BGTFpxKNf216g3zTPpI5a7seiiU%2Fu6mILE72HPYzkovFmVzwl%2Bgco%2BpvAb6%2BNHtgwerGovXRnN5fifiM8voHGNYAR3MAe5CZSsXxcMuyma4y5lWfDinGoIqmmDDY3LLQBjqkAWKjOi3RcdlDHVq3pTxZ5ngeV0Pbp2DCRuBPZ0KBU0%2F8WYywIoYavjbb70dq63yyb3U6lLJTYbMisPE0PNfZJjAJV4u%2Fi5N2R7VwQdU8H17lWPqm6unm%2FXBO2%2FK%2FQN%2FUsQRgNUbeNYDaVO0yA%2BuhcRV596GCxR2iJka6yXqBAV3tuxLh%2F4yKjzfvE9OzyvjmjOWupJv4V48h3p2dewXPlt49CE6A&X-Amz-Signature=e4ce5606cc051b2bfe54a769bb22f03471b50c6d261513d43eccbe25f6f29af7&X-Amz-SignedHeaders=host&x-amz-checksum-mode=ENABLED&x-id=GetObject) I’m on a Rails project using OpenAI. We’re sending over large amounts of text to provide as much context as possible, and recently ran into issues with rate limiting. As it turns out, OpenAI’s rate limits are a little more complicated than other APIs. Rate limits are measured in five ways: RPM (requests per minute), RPD (requests per day), TPM (tokens per minute), TPD (tokens per day), and IPM (images per minute). Rate limits can be hit across any of the options depending on what occurs first. In our case, we were hitting our TPM (tokens per minute) rate limit. Regardless of whether a rate limit is exceeded, OpenAI will return the following headers. ```plain text x-ratelimit-reset-requests: 1s x-ratelimit-reset-tokens: 6m0s ``` It should be noted that these headers represent the amount of time that needs to pass before the rate limit returns to its initial state. At the time of this writing, OpenAI does not have a first-party Ruby library, but the community has gravitated towards ruby-openai, which is what our project is using. When a rate limit is hit, it raises Faraday::TooManyRequestsError, which gives us access to those headers via #response_headers. Because OpenAI will return two headers (one for requests per minute and one for tokens per minute), we play it safe and wait based on the greater of the two values. Rather than roll our own script to parse the header values, we can use Chronic Duration to do this for us. We can then define our own custom error class in an initializer to build the wait value for us. In order to use this value, we need to leverage #rescue_from in combination with #retry_job. This is because we need to set the wait value dynamically based on the headers, and #retry_on does not provide a way to do this. Below is a distilled example. ```plain text # app/jobs/send_prompt_job.rb class SendPromptJob < ApplicationJob queue_as :default MAX_ATTEMPTS = 2 rescue_from OpenAI::RateLimitError do |error| if executions < MAX_ATTEMPTS backoff = Backoff.polynomially_longer(executions:) retry_job wait: error.wait.seconds + backoff else Rails.logger.info "Exhausted attempts" end end def perform() OpenAI::Client.new.chat(...) rescue Faraday::TooManyRequestsError => error raise OpenAI::RateLimitError.new(error) end end # lib/backoff.rb class Backoff DEFAULT_JITTER = 0.15 def self.polynomially_longer(executions:, jitter: DEFAULT_JITTER) ((executions**4) + (Kernel.rand * (executions**4) * jitter)) + 2 end end # config/initializers/openai.rb module OpenAI class RateLimitError < StandardError attr_reader :reset_requests_in_seconds, :reset_tokens_in_seconds def initialize(faraday_error) headers = faraday_error.response_headers&.with_indifferent_access || {} @reset_requests_in_seconds = headers.fetch("x-ratelimit-reset-requests", "0s") @reset_tokens_in_seconds = headers.fetch("x-ratelimit-reset-tokens", "0s") super("The API has hit the rate limit") end def wait [ parse_duration(reset_requests_in_seconds), parse_duration(reset_tokens_in_seconds) ].max end private def parse_duration(value) ChronicDuration.parse(value) || 0 end end end ``` Since we can’t leverage retry_on, we need to ensure we eventually stop retrying the job if it continues to fail. ```plain text executions < MAX_ATTEMPTS ``` Additionally, you’ll also note that we add a “backoff” mechanism per OpenAI’s recommendation. ```plain text backoff = Backoff.polynomially_longer(executions:) retry_job wait: error.wait.seconds + backoff ``` ## Avoid rate limits by being proactive We took a reactive approach to the problem, but I do want to highlight that there’s an opportunity to be proactive by examining the headers that return the amount of remaining requests or tokens that are permitted before exhausting the rate limit. ```plain text x-ratelimit-remaining-requests x-ratelimit-remaining-tokens ``` Unfortunately, ruby-openai does not return response headers, but there is a workaround. You can create a custom Faraday middleware, and pass it to the client in a block. ```plain text class ExtractRateLimitHeaders< Faraday::Middleware def on_complete(env) # Store these values somewhere remaining_requests = env.response_headers["x-ratelimit-remaining-requests"] remaining_tokens = env.response_headers["x-ratelimit-remaining-tokens"] end end client = OpenAI::Client.new do |faraday| faraday.use ExtractRateLimitHeaders end ``` You could then use this information to reduce the number of tokens you plan on sending to OpenAI by comparing its size with remaining_tokens. Or, if you’re keeping track of how many requests you’re making, you could compare that value with remaining_requests. ```plain text # Ensure you're within the request and/or token limit before making a request if (current_requests < remaining_requests && current_tokens < remaining_tokens) client.chat(...) end ``` Alternatively, you could temporarily switch to a model with higher token and request limits, or temporarily reduce the amount of tokens sent in the request. ## If you enjoyed this post, you might also like:

▼

Scraped Content

— 692 words · 2026-05-19 12:40:04 UTC ·

Excerpt

Visibility

Visible to everyone

Reading Status

Related Bookmarks

My Note

Saved!

Annotations

Agent findings

info URL returned 403 (likely bot-blocked, not necessarily broken) health · Jul 20

Export as Markdown