AICredits logo
Operations

Latency

Understand and optimise AICredits latency. Benchmarks, latency sources, and tips for reducing time-to-first-token.

Use this page with an AI assistant

Opens a new chat with this docs URL and the correct AICredits base URLs.

This page covers where latency comes from in the AICredits request path, typical benchmarks, and how to optimise your integration.

Request Flow

Every request passes through these steps before the first token is returned:

StepTypical LatencyNotes
API key validation< 2msRedis cache hit; ~10ms on cache miss
Rate limit check< 1msRedis counter increment
Guardrails (PII masking)2–5msOnly when enabled
Provider routing< 1msHealth check from in-memory state
Provider network + TTFT100ms–5sVaries by model and provider

The AICredits overhead (all steps except the last) is typically under 10ms.

Time to First Token (TTFT)

TTFT is the dominant latency component for chat completions and scales with:

  • Model size — larger models are slower (GPT-4o > GPT-4o-mini, Claude Sonnet > Haiku)
  • Prompt length — longer prompts take longer to process
  • Provider load — shared capacity; higher during peak hours

Typical TTFT ranges by model:

ModelTypical TTFT
gpt-4o-mini150–400ms
gpt-4o300–800ms
anthropic/claude-haiku-4.5200–500ms
anthropic/claude-sonnet-4.6400–1200ms
gemini-2.0-flash200–600ms
deepseek-chat300–900ms

Tips for Lower Latency

Use Streaming

For user-facing applications, always enable streaming (stream: true). The user sees the first token in 200–500ms instead of waiting for the full response. Total generation time is the same, but perceived latency is much lower.

Choose a Faster Model

For low-latency use cases, use a smaller, faster model:

Fast completion
# High-latency: overkill for classification tasks
response = client.chat.completions.create(model="anthropic/claude-sonnet-4.6", ...)

# Low-latency: right-sized for simple tasks
response = client.chat.completions.create(model="openai/gpt-4o-mini", ...)

Keep Prompts Short

Longer prompts increase TTFT. For RAG applications, retrieve only the most relevant chunks rather than sending the entire document corpus. For Claude models, use Prompt Caching to avoid re-processing long, repeated context.

Use Semantic Caching

For applications with repetitive queries, Semantic Caching returns cached responses in under 50ms — bypassing the provider entirely.

Place Geographically Close

AICredits is hosted in India. If your users or servers are in India or Southeast Asia, you benefit from lower network latency to the API endpoint (api.aicredits.in).

Latency Debug Headers

When ENABLE_DEBUG_HEADERS is set on the server, responses include latency breakdown headers:

HeaderDescription
X-AICredits-ProviderWhich provider served the request
X-AICredits-ModelActual model used (may differ if rerouted)
X-AICredits-Cachehit, miss, or disabled (semantic cache)

Debug headers are only available on accounts with debug mode enabled. Contact support to enable them for your account during development.

On this page