
Semantic Caching: Cut LLM API Costs by 40% on Repeated Queries
Standard HTTP caching doesn't help with LLMs because queries are never exactly the same. Semantic caching matches by meaning — and can eliminate 20–40% of your API spend.
Author
AICredits Team
Published
22 Mar 2026
Reading time
7 min read
Why standard caching does not work for LLMs
Traditional caching keys on exact request equality. "What is the capital of France?" and "What's the capital of France?" are different strings and would not match in a standard cache. "How do I reset my password?" and "I forgot my password — can you help?" are semantically identical but lexically different.
Semantic caching changes the key from exact string match to vector similarity. It embeds each incoming query, searches for stored queries with cosine similarity above a threshold, and returns the cached response if a match is found.
How it works technically
Incoming query
↓
Embed query (text-embedding-3-small, ~50ms, ₹0.0001)
↓
Vector similarity search against stored queries
↓
Similarity ≥ threshold (0.92)?
├── YES → return cached response (near-zero cost, <10ms)
└── NO → call LLM, store {query_embedding, response}, return response
Building a minimal semantic cache in Python
import json
import numpy as np
from openai import OpenAI
client = OpenAI(base_url="https://api.aicredits.in/v1", api_key="sk-your-aicredits-key")
class SemanticCache:
def __init__(self, threshold: float = 0.92):
self.threshold = threshold
self.store: list[dict] = [] # {embedding, response, query}
def _embed(self, text: str) -> list[float]:
r = client.embeddings.create(model="text-embedding-3-small", input=text)
return r.data[0].embedding
def _similarity(self, a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def get(self, query: str) -> str | None:
if not self.store:
return None
query_emb = self._embed(query)
best = max(self.store, key=lambda x: self._similarity(query_emb, x["embedding"]))
if self._similarity(query_emb, best["embedding"]) >= self.threshold:
print(f" [CACHE HIT] similarity={self._similarity(query_emb, best['embedding']):.3f}")
return best["response"]
return None
def set(self, query: str, response: str):
self.store.append({"query": query, "embedding": self._embed(query), "response": response})
cache = SemanticCache(threshold=0.92)
def cached_llm_call(prompt: str, model: str = "openai/gpt-4o-mini") -> str:
cached = cache.get(prompt)
if cached:
return cached
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
result = response.choices[0].message.content
cache.set(prompt, result)
return result
# First call — hits LLM
print(cached_llm_call("How do I reset my password?"))
# Second call with different wording — cache hit
print(cached_llm_call("I forgot my password, how do I get back in?"))Setting the similarity threshold
| Threshold | Effect | |-----------|--------| | 0.99+ | Near-exact matches only — very few cache hits | | 0.92–0.95 | Good balance — recommended starting point | | 0.85–0.90 | Risk of false positives — semantically different queries matched | | < 0.85 | High false positive rate — wrong answers returned |
Start at 0.92 and manually review 100 cache hits. If any returned wrong answers, raise the threshold.
When semantic caching delivers the best results
| Use case | Expected hit rate | |----------|------------------| | Support chatbot FAQ | 30–50% | | Developer documentation Q&A | 20–35% | | Product onboarding flow | 25–40% | | Open-ended creative tasks | < 5% | | Unique data extraction | < 2% |
Semantic caching in AICredits
AICredits has semantic caching built in and enabled by default. Every chat completion request is checked against the cache before being forwarded to the provider. Cache hits return in under 100ms and cost a fraction of a full LLM call.
View your cache hit rate and estimated savings in the usage dashboard. Bypass caching for a specific request by setting the X-No-Cache: true header.
Related Articles
Continue in Docs
Need implementation commands and endpoint details? Go to quickstart or API reference.