Latency
Understand and optimise AICredits latency. Benchmarks, latency sources, and tips for reducing time-to-first-token.
Use this page with an AI assistant
Opens a new chat with this docs URL and the correct AICredits base URLs.
This page covers where latency comes from in the AICredits request path, typical benchmarks, and how to optimise your integration.
Request Flow
Every request passes through these steps before the first token is returned:
| Step | Typical Latency | Notes |
|---|---|---|
| API key validation | < 2ms | Redis cache hit; ~10ms on cache miss |
| Rate limit check | < 1ms | Redis counter increment |
| Guardrails (PII masking) | 2–5ms | Only when enabled |
| Provider routing | < 1ms | Health check from in-memory state |
| Provider network + TTFT | 100ms–5s | Varies by model and provider |
The AICredits overhead (all steps except the last) is typically under 10ms.
Time to First Token (TTFT)
TTFT is the dominant latency component for chat completions and scales with:
- Model size — larger models are slower (GPT-4o > GPT-4o-mini, Claude Sonnet > Haiku)
- Prompt length — longer prompts take longer to process
- Provider load — shared capacity; higher during peak hours
Typical TTFT ranges by model:
| Model | Typical TTFT |
|---|---|
| gpt-4o-mini | 150–400ms |
| gpt-4o | 300–800ms |
| anthropic/claude-haiku-4.5 | 200–500ms |
| anthropic/claude-sonnet-4.6 | 400–1200ms |
| gemini-2.0-flash | 200–600ms |
| deepseek-chat | 300–900ms |
Tips for Lower Latency
Use Streaming
For user-facing applications, always enable streaming (stream: true). The user sees the first token in 200–500ms instead of waiting for the full response. Total generation time is the same, but perceived latency is much lower.
Choose a Faster Model
For low-latency use cases, use a smaller, faster model:
# High-latency: overkill for classification tasks
response = client.chat.completions.create(model="anthropic/claude-sonnet-4.6", ...)
# Low-latency: right-sized for simple tasks
response = client.chat.completions.create(model="openai/gpt-4o-mini", ...)Keep Prompts Short
Longer prompts increase TTFT. For RAG applications, retrieve only the most relevant chunks rather than sending the entire document corpus. For Claude models, use Prompt Caching to avoid re-processing long, repeated context.
Use Semantic Caching
For applications with repetitive queries, Semantic Caching returns cached responses in under 50ms — bypassing the provider entirely.
Place Geographically Close
AICredits is hosted in India. If your users or servers are in India or Southeast Asia, you benefit from lower network latency to the API endpoint (api.aicredits.in).
Latency Debug Headers
When ENABLE_DEBUG_HEADERS is set on the server, responses include latency breakdown headers:
| Header | Description |
|---|---|
X-AICredits-Provider | Which provider served the request |
X-AICredits-Model | Actual model used (may differ if rerouted) |
X-AICredits-Cache | hit, miss, or disabled (semantic cache) |
Debug headers are only available on accounts with debug mode enabled. Contact support to enable them for your account during development.