Prompt Caching

Cache large system prompts and repeated context for Claude models. Save up to 90% on repeated context costs with cache reads at 0.1x the standard rate.

Prompt caching lets you cache large, repeated context (system prompts, documents, codebase snippets) so it is only processed once. Subsequent requests that hit the cache are charged at a deeply discounted rate.

How It Works

When you enable caching on a request, AICredits injects the appropriate cache control headers for the provider before forwarding the request. This means you use a single API field ("cache": true) regardless of the provider — no provider-specific SDK changes needed.

First request  → Cache WRITE  → billed at 1.25× input rate
Second request → Cache HIT    → billed at 0.10× input rate (90% discount)

You break even after the second request. For applications that reuse the same context many times, the savings compound quickly.

Pricing

Operation	Rate
Cache write	1.25× standard input token rate
Cache read (hit)	0.10× standard input token rate
No cache	1.0× standard input token rate

Cache writes cost more

Cache writes are billed at 1.25× the standard rate. Only enable caching when the same context will be reused across at least 2 requests. For one-off requests, caching increases cost.

Enabling Cache

Add "cache": true to your request body:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-key-here",
)

LARGE_SYSTEM_PROMPT = """You are an expert software engineer...[long codebase context]..."""

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    extra_body={"cache": True},
    messages=[
        {"role": "system", "content": LARGE_SYSTEM_PROMPT},
        {"role": "user", "content": "Add a unit test for the billing module."},
    ],
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.aicredits.in/v1",
  apiKey: "sk-your-key-here",
});

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4.6",
  // @ts-ignore — custom extension field
  cache: true,
  messages: [
    { role: "system", content: LARGE_SYSTEM_PROMPT },
    { role: "user", content: "Add a unit test for the billing module." },
  ],
});

console.log(response.choices[0].message.content);

curl https://api.aicredits.in/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4.6",
    "cache": true,
    "messages": [
      {
        "role": "system",
        "content": "You are an expert engineer. [large context here]"
      },
      {
        "role": "user",
        "content": "Add a unit test for the billing module."
      }
    ]
  }'

Provider Support

Provider	Models	Caching Method
Anthropic	Claude Sonnet, Haiku, Opus	Explicit cache control headers
OpenAI	GPT-4o, GPT-4o-mini	Automatic (no opt-in needed)
Other providers	—	Not supported

For OpenAI models, prompt caching is automatic — OpenAI applies it on their end for prompts over a minimum length. You do not need to set "cache": true for OpenAI models. The cache flag only activates explicit caching for Claude models.

Cache TTL & Invalidation

Aspect	Details
TTL	5 minutes per cache entry
Refresh	Each cache hit resets the 5-minute TTL
Minimum size	~1,000 tokens (~4,000 characters)
Invalidation	Modifying the cached portion invalidates the cache

Best Practices

Put the cacheable content at the top of your system prompt. The cache key is derived from the first N tokens of the system message. Keep the static context (instructions, documents, codebase) at the top and the per-request variable content further down.

Reuse the exact same text. Even a single character change invalidates the cache. Template the variable parts rather than concatenating strings inline.

Use for large, repeated context. The breakeven is at the second request. Ideal use cases: full codebase in system prompt, legal documents for review, product catalogs, long conversation history.