Latency

Understand and optimise AICredits latency. Benchmarks, latency sources, and tips for reducing time-to-first-token.

This page covers where latency comes from in the AICredits request path, typical benchmarks, and how to optimise your integration.

Request Flow

Every request passes through these steps before the first token is returned:

Step	Typical Latency	Notes
API key validation	< 2ms	Redis cache hit; ~10ms on cache miss
Rate limit check	< 1ms	Redis counter increment
Guardrails (PII masking)	2–5ms	Only when enabled
Provider routing	< 1ms	Health check from in-memory state
Provider network + TTFT	100ms–5s	Varies by model and provider

The AICredits overhead (all steps except the last) is typically under 10ms.

Time to First Token (TTFT)

TTFT is the dominant latency component for chat completions and scales with:

Model size — larger models are slower (GPT-4o > GPT-4o-mini, Claude Sonnet > Haiku)
Prompt length — longer prompts take longer to process
Provider load — shared capacity; higher during peak hours

Typical TTFT ranges by model:

Model	Typical TTFT
gpt-4o-mini	150–400ms
gpt-4o	300–800ms
anthropic/claude-haiku-4.5	200–500ms
anthropic/claude-sonnet-4.6	400–1200ms
gemini-2.0-flash	200–600ms
deepseek-chat	300–900ms

For user-facing applications, always enable streaming (stream: true). The user sees the first token in 200–500ms instead of waiting for the full response. Total generation time is the same, but perceived latency is much lower.

Choose a Faster Model

For low-latency use cases, use a smaller, faster model:

Fast completion

# High-latency: overkill for classification tasks
response = client.chat.completions.create(model="anthropic/claude-sonnet-4.6", ...)

# Low-latency: right-sized for simple tasks
response = client.chat.completions.create(model="openai/gpt-4o-mini", ...)

Keep Prompts Short

Longer prompts increase TTFT. For RAG applications, retrieve only the most relevant chunks rather than sending the entire document corpus. For Claude models, use Prompt Caching to avoid re-processing long, repeated context.

Use Semantic Caching

For applications with repetitive queries, Semantic Caching returns cached responses in under 50ms — bypassing the provider entirely.

Place Geographically Close

AICredits is hosted in India. If your users or servers are in India or Southeast Asia, you benefit from lower network latency to the API endpoint (api.aicredits.in).

Latency Debug Headers

When ENABLE_DEBUG_HEADERS is set on the server, responses include latency breakdown headers:

Header	Description
`X-AICredits-Provider`	Which provider served the request
`X-AICredits-Model`	Actual model used (may differ if rerouted)
`X-AICredits-Cache`	`hit`, `miss`, or `disabled` (semantic cache)

Debug headers are only available on accounts with debug mode enabled. Contact support to enable them for your account during development.