AICredits logo
Features

Semantic Caching

Cache LLM responses by semantic similarity using pgvector. Identical-meaning queries return cached responses instantly at a discounted read rate.

Use this page with an AI assistant

Opens a new chat with this docs URL and the correct AICredits base URLs.

Not currently enabled

Semantic caching is built and functional but not yet enabled on the hosted platform. Requests pass through to providers with no overhead in the meantime. It will be rolled out as a configurable per-account feature.

Semantic caching caches LLM responses based on the meaning of your queries, not just exact text matches. When a new query is semantically similar to a cached one (within a 95% cosine similarity threshold), the cached response is returned instantly. No provider completion call is made, and the hit is billed at the semantic-cache read rate instead of the full model price.

How It Works

  1. Every request is embedded using a fast embedding model (vector of 1,536 floats)
  2. The embedding is compared against your user-scoped cache using pgvector cosine similarity
  3. If a match above 95% similarity is found, the cached response is returned immediately
  4. If no match, the request proceeds to the provider and the response is cached for future use
New request arrives
  ↓ Embed query (< 50ms)
  ↓ pgvector similarity search (< 10ms)
  → Cache HIT  → Return cached response (discounted, ~60ms total)
  → Cache MISS → Forward to provider → Cache response → Return

Cost Savings

Cache hits are charged at 0.05× the normal model request cost by default. This is lower than provider prompt-cache read pricing. The provider is not called for the completion, but AICredits still charges a small read fee for the cache lookup, storage, and serving infrastructure.

Request typeCost
Cache hit0.05× normal model request cost
Cache miss (first time)Normal provider cost
Embedding lookupIncluded in the cache-hit read fee

This is particularly effective for:

  • Customer support chatbots with repetitive FAQ-type queries
  • Document Q&A systems where users ask similar questions
  • Internal tools where many users ask the same things

Cache Scope

Each user's cache is private — your cache entries are never shared with other users. The cache key combines the user ID and the embedding vector, so even semantically identical queries from different users do not share cache entries.

AspectDetails
ScopePer-user (private)
Similarity threshold95% cosine similarity
Default cache TTL24 hours
Storagepgvector (PostgreSQL)
Default read charge0.05× normal model request cost

Similarity Threshold

The 95% threshold is chosen to catch semantically equivalent rephrases while avoiding false positives:

QuerySimilarityCache hit?
"What are your pricing plans?"100%Yes (exact match)
"How much does it cost?"~97%Yes
"Tell me about subscription options"~94%No
"How do I reset my password?"~12%No

When Semantic Caching Is Available

When enabled on your account, it works transparently — no changes to your API requests are required. The response will include a header indicating whether the response was served from cache:

X-AICredits-Cache: hit
X-AICredits-Cache: miss

You can disable semantic caching per-request (for example, when you need a fresh response) by setting:

{ "no_cache": true }

You can also force a provider refresh with the request header:

X-Cache-Force-Refresh: true

On this page