Platform Features
AICredits features — semantic caching, automatic failover, streaming, JSON response healing, and team credit delegation.
Use this page with an AI assistant
Opens a new chat with this docs URL and the correct AICredits base URLs.
Advanced features that make AICredits more than just a proxy — caching, guardrails, automatic failover, and more.
Semantic Caching
Not currently enabled
Semantic caching is not enabled on the hosted platform at this time. The infrastructure (pgvector embeddings, per-user cache) is built and will be rolled out as a configurable feature. Requests pass through to providers with no overhead in the meantime.
When enabled, AICredits caches responses based on the semantic meaning of your queries using pgvector embeddings. If a new query is semantically similar to a cached one (95% similarity threshold), the cached response is returned instantly — saving both cost and latency.
| Aspect | Details |
|---|---|
| Technology | pgvector (PostgreSQL vector similarity) |
| Similarity Threshold | 95% cosine similarity |
| Scope | Per-user (your cache is private) |
| Default cache TTL | 24 hours (entries expire automatically) |
| Cache hit charge | 0.05× normal model request cost |
Cost Savings — when available
Cache hits are billed at a discounted read rate. No provider completion call is made, and the default charge is 0.05× the normal model request cost, lower than provider prompt-cache read pricing. This is particularly effective for applications with repetitive queries.
Prompt Caching
For Claude models, you can cache large, repeated context (system prompts, document chunks, codebase snippets) so it is only processed once. Subsequent requests that hit the cache are charged at a deeply discounted rate.
Opt-in only — caching costs more on the first call
Cache writes are billed at 1.25× the standard input token rate. Cache reads are billed at 0.1× (a 90% discount). You break even after the second request. Only enable caching when you will reuse the same context across multiple calls.
To request caching, add "cache": true to your request body. The proxy injects the appropriate cache headers for the provider — no provider-specific SDK changes needed.
curl https://api.aicredits.in/v1/chat/completions \
-H "Authorization: Bearer sk-..." \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic/claude-sonnet-4.6",
"cache": true,
"messages": [
{
"role": "system",
"content": "You are an expert software engineer. [... large codebase context ...]"
},
{ "role": "user", "content": "Add a unit test for the billing module." }
]
}'| Aspect | Details |
|---|---|
| Supported providers | Claude (Anthropic) only — OpenAI caching is automatic |
| Minimum context size | ~1,000 tokens (≈ 4,000 characters). Smaller prompts are not cached. |
| Cache TTL | 5 minutes (resets each time the cache is hit) |
| Write cost | 1.25× standard input rate |
| Read cost | 0.1× standard input rate (90% discount) |
Best use cases
Large system prompts with your full codebase, long documents, or fixed instructions that stay the same across many calls. Keep the cached portion at the top of your system prompt so the cache key is stable.
Provider Fallback
Requests are routed directly to the upstream provider. On failure, retries happen automatically across healthy keys with exponential backoff.
1. Direct Provider — Calls the provider directly using round-robin across configured API keys. Each key is health-checked — unhealthy keys are skipped automatically.
2. Retry with Backoff — On 429 or 5xx responses, retries across remaining healthy keys with exponential backoff (500ms → 1s → 2s, capped at 5s).
The circuit breaker tracks provider health per API key. Unhealthy keys are skipped for 30 seconds to avoid adding latency from repeated failures.
Streaming
Full streaming support via Server-Sent Events (SSE). Set stream: true in your request. The response format is identical to OpenAI's streaming format, so all compatible SDKs work out of the box.
Streaming works across all providers. AICredits translates provider-specific streaming formats (e.g., Anthropic's Messages API) to the standard OpenAI SSE format.
Response Healing
When you request "response_format": {"type": "json_object"} and the provider returns malformed JSON, AICredits can automatically attempt to repair the response. This includes fixing:
- Truncated JSON (missing closing brackets/braces)
- Extra text before/after the JSON object
- Common formatting issues
Response healing is enabled per deployment. When active, it processes responses transparently — you receive valid JSON even if the provider's raw output was slightly malformed.
Data Retention
Request metadata (model, tokens, cost, timestamp) is retained for 30 days to power your usage history and billing views. A background worker automatically purges records older than 30 days.
Teams & Credit Delegation
Teams let you allocate a portion of your wallet to teammates, students, or collaborators. Each member gets their own credit cap — their API requests are charged from your wallet only when they actually make calls. Unused allocation costs you nothing.
| Action | Who |
|---|---|
| Create a team | Wallet owner |
| Invite member by email + set ₹ cap | Wallet owner |
| Make API calls (charged to owner) | Invited member |
| View per-member usage + spend | Wallet owner |
| Adjust or reclaim allocation | Wallet owner |
A member's spending is capped at their allocated_inr. When they reach the cap, their API key returns a budget-exceeded error — your wallet is never overdrawn beyond what you committed.
Manage your teams at Dashboard → Teams.