Semantic Caching

Cache LLM responses by semantic similarity using pgvector. Identical-meaning queries return cached responses instantly at a discounted read rate.

Not currently enabled

Semantic caching is built and functional but not yet enabled on the hosted platform. Requests pass through to providers with no overhead in the meantime. It will be rolled out as a configurable per-account feature.

Semantic caching caches LLM responses based on the meaning of your queries, not just exact text matches. When a new query is semantically similar to a cached one (within a 95% cosine similarity threshold), the cached response is returned instantly. No provider completion call is made, and the hit is billed at the semantic-cache read rate instead of the full model price.

How It Works

Every request is embedded using a fast embedding model (vector of 1,536 floats)
The embedding is compared against your user-scoped cache using pgvector cosine similarity
If a match above 95% similarity is found, the cached response is returned immediately
If no match, the request proceeds to the provider and the response is cached for future use

New request arrives
  ↓ Embed query (< 50ms)
  ↓ pgvector similarity search (< 10ms)
  → Cache HIT  → Return cached response (discounted, ~60ms total)
  → Cache MISS → Forward to provider → Cache response → Return

Cost Savings

Cache hits are charged at 0.05× the normal model request cost by default. This is lower than provider prompt-cache read pricing. The provider is not called for the completion, but AICredits still charges a small read fee for the cache lookup, storage, and serving infrastructure.

Request type	Cost
Cache hit	0.05× normal model request cost
Cache miss (first time)	Normal provider cost
Embedding lookup	Included in the cache-hit read fee

This is particularly effective for:

Customer support chatbots with repetitive FAQ-type queries
Document Q&A systems where users ask similar questions
Internal tools where many users ask the same things

Cache Scope

Each user's cache is private — your cache entries are never shared with other users. The cache key combines the user ID and the embedding vector, so even semantically identical queries from different users do not share cache entries.

Aspect	Details
Scope	Per-user (private)
Similarity threshold	95% cosine similarity
Default cache TTL	24 hours
Storage	pgvector (PostgreSQL)
Default read charge	0.05× normal model request cost

Similarity Threshold

The 95% threshold is chosen to catch semantically equivalent rephrases while avoiding false positives:

Query	Similarity	Cache hit?
"What are your pricing plans?"	100%	Yes (exact match)
"How much does it cost?"	~97%	Yes
"Tell me about subscription options"	~94%	No
"How do I reset my password?"	~12%	No

When Semantic Caching Is Available

When enabled on your account, it works transparently — no changes to your API requests are required. The response will include a header indicating whether the response was served from cache:

X-AICredits-Cache: hit
X-AICredits-Cache: miss

You can disable semantic caching per-request (for example, when you need a fresh response) by setting:

{ "no_cache": true }

You can also force a provider refresh with the request header:

X-Cache-Force-Refresh: true

How It Works

Cost Savings

Cache Scope

Similarity Threshold

When Semantic Caching Is Available

On this page