Semantic Caching: Cut LLM API Costs by 40% on Repeated Queries
Back to blogEngineering

Semantic Caching: Cut LLM API Costs by 40% on Repeated Queries

Standard HTTP caching doesn't help with LLMs because queries are never exactly the same. Semantic caching matches by meaning — and can eliminate 20–40% of your API spend.

Author

AICredits Team

Published

22 Mar 2026

Reading time

7 min read

Why standard caching does not work for LLMs

Traditional caching keys on exact request equality. "What is the capital of France?" and "What's the capital of France?" are different strings and would not match in a standard cache. "How do I reset my password?" and "I forgot my password — can you help?" are semantically identical but lexically different.

Semantic caching changes the key from exact string match to vector similarity. It embeds each incoming query, searches for stored queries with cosine similarity above a threshold, and returns the cached response if a match is found.

How it works technically

Incoming query
    ↓
Embed query (text-embedding-3-small, ~50ms, ₹0.0001)
    ↓
Vector similarity search against stored queries
    ↓
Similarity ≥ threshold (0.92)?
    ├── YES → return cached response (near-zero cost, <10ms)
    └── NO  → call LLM, store {query_embedding, response}, return response

Building a minimal semantic cache in Python

import json
import numpy as np
from openai import OpenAI
 
client = OpenAI(base_url="https://api.aicredits.in/v1", api_key="sk-your-aicredits-key")
 
class SemanticCache:
    def __init__(self, threshold: float = 0.92):
        self.threshold = threshold
        self.store: list[dict] = []  # {embedding, response, query}
 
    def _embed(self, text: str) -> list[float]:
        r = client.embeddings.create(model="text-embedding-3-small", input=text)
        return r.data[0].embedding
 
    def _similarity(self, a: list[float], b: list[float]) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
 
    def get(self, query: str) -> str | None:
        if not self.store:
            return None
        query_emb = self._embed(query)
        best = max(self.store, key=lambda x: self._similarity(query_emb, x["embedding"]))
        if self._similarity(query_emb, best["embedding"]) >= self.threshold:
            print(f"  [CACHE HIT] similarity={self._similarity(query_emb, best['embedding']):.3f}")
            return best["response"]
        return None
 
    def set(self, query: str, response: str):
        self.store.append({"query": query, "embedding": self._embed(query), "response": response})
 
cache = SemanticCache(threshold=0.92)
 
def cached_llm_call(prompt: str, model: str = "openai/gpt-4o-mini") -> str:
    cached = cache.get(prompt)
    if cached:
        return cached
 
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    result = response.choices[0].message.content
    cache.set(prompt, result)
    return result
 
# First call — hits LLM
print(cached_llm_call("How do I reset my password?"))
 
# Second call with different wording — cache hit
print(cached_llm_call("I forgot my password, how do I get back in?"))

Setting the similarity threshold

| Threshold | Effect | |-----------|--------| | 0.99+ | Near-exact matches only — very few cache hits | | 0.92–0.95 | Good balance — recommended starting point | | 0.85–0.90 | Risk of false positives — semantically different queries matched | | < 0.85 | High false positive rate — wrong answers returned |

Start at 0.92 and manually review 100 cache hits. If any returned wrong answers, raise the threshold.

When semantic caching delivers the best results

| Use case | Expected hit rate | |----------|------------------| | Support chatbot FAQ | 30–50% | | Developer documentation Q&A | 20–35% | | Product onboarding flow | 25–40% | | Open-ended creative tasks | < 5% | | Unique data extraction | < 2% |

Semantic caching in AICredits

AICredits has semantic caching built in and enabled by default. Every chat completion request is checked against the cache before being forwarded to the provider. Cache hits return in under 100ms and cost a fraction of a full LLM call.

View your cache hit rate and estimated savings in the usage dashboard. Bypass caching for a specific request by setting the X-No-Cache: true header.

Related Articles

Continue in Docs

Need implementation commands and endpoint details? Go to quickstart or API reference.