How to Build a Retry Strategy for LLM API Calls
Back to blogEngineering

How to Build a Retry Strategy for LLM API Calls

Rate limit errors, provider timeouts, and transient failures are inevitable. Here is a production-grade retry strategy with exponential backoff, jitter, and fallback routing.

Author

AICredits Team

Published

3 Apr 2026

Reading time

6 min read

What goes wrong and why

LLM API calls fail for three distinct reasons, each requiring a different response:

  • 429 Rate limit — you sent too many requests per minute. Wait and retry.
  • Timeout — the provider took too long. Retry only if idempotent.
  • 500/502/503 Provider error — transient infrastructure issue. Retry with backoff.

The worst mistake is treating all failures the same way. Retrying immediately on a 429 makes your rate limit problem worse.

Exponential backoff with jitter using tenacity

from tenacity import (
    retry,
    stop_after_attempt,
    stop_after_delay,
    wait_exponential,
    retry_if_exception_type,
)
from openai import OpenAI, RateLimitError, APIConnectionError
 
client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",
)
 
RETRYABLE = (RateLimitError, APIConnectionError)
 
@retry(
    retry=retry_if_exception_type(RETRYABLE),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    stop=stop_after_attempt(4) | stop_after_delay(60),
)
def call_llm(prompt: str, model: str = "openai/gpt-4o-mini") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

This waits 2s → 4s → 8s → 16s (capped at 30s) between attempts, with random jitter to prevent thundering herd.

Which errors to retry

from openai import APIStatusError
 
RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504}
 
def should_retry(error: Exception) -> bool:
    if isinstance(error, APIConnectionError):
        return True  # network issue, always retry
    if isinstance(error, APIStatusError):
        if error.status_code == 429:
            # Check Retry-After header if present
            retry_after = error.response.headers.get("Retry-After")
            if retry_after:
                import time; time.sleep(int(retry_after))
            return True
        return error.status_code in RETRYABLE_STATUS_CODES
    return False
 
# Never retry: 400 (bad request), 401 (auth), 403 (forbidden), 404 (not found)

Full retry + fallback pattern

from tenacity import RetryError
 
PRIMARY  = "openai/gpt-4o"
FALLBACK = "anthropic/claude-3-5-haiku-20241022"
 
@retry(
    retry=retry_if_exception_type(RETRYABLE),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    stop=stop_after_attempt(3),
)
def call_primary(prompt: str) -> str:
    return call_llm(prompt, model=PRIMARY)
 
def safe_call(prompt: str) -> str:
    try:
        return call_primary(prompt)
    except (RetryError, Exception):
        # Primary exhausted — fall back to a different provider
        return call_llm(prompt, model=FALLBACK)
 
result = safe_call("Summarise the benefits of using an LLM gateway.")
print(result)

Circuit breaker via AICredits

AICredits implements a circuit breaker at the gateway level — unhealthy provider keys are automatically skipped for 30 seconds. This means you do not need to implement circuit breaking in your application code. The gateway handles it transparently.

What you do need to handle in your code:

  • Retry on transient errors (429, 5xx) with exponential backoff
  • Application-level fallback to a different model after retries are exhausted
  • Logging: record which model served each request and whether a fallback was triggered

Related Articles

Continue in Docs

Need implementation commands and endpoint details? Go to quickstart or API reference.