
How to Estimate Your LLM API Costs Before You Ship
Unexpected AI bills have killed startups. Here's how to forecast your LLM costs accurately before you go live — with real formulas and a cost calculator.
Author
AICredits Team
Published
8 Mar 2026
Reading time
11 min read
The horror stories are real
A Y Combinator founder posted on X in late 2024: their side project went viral overnight, and by morning they had a $4,300 OpenAI bill with nothing to show for it. No rate limits. No budget cap. No monitoring. Just a recursive prompt loop eating tokens while they slept.
This is not an isolated incident. Developers ship an LLM feature, tweet about it, it gets 5,000 users in 48 hours, and the API bill is three months of runway. The three root causes are almost always the same:
- No budget limits set on API keys. Most LLM providers will happily run up unlimited charges unless you explicitly cap spending.
- Wrong model for the task. Using GPT-4o for a task that GPT-4o Mini handles equally well costs 17× more per request.
- No pre-launch cost estimation. The team shipped without ever running the numbers.
This guide gives you a repeatable process to estimate your LLM costs before you go live, so you can forecast your burn rate, pick the right model, and set hard limits before the first user hits your API.
Token counting fundamentals
Every LLM API charges by the token, not by the word or character. A token is roughly 0.75 words in English, or about 4 characters. The exact split depends on the tokeniser the model uses.
Quick approximations:
- 1,000 words ≈ 1,333 tokens
- 1 page of text ≈ 500–600 tokens
- A short system prompt ("You are a helpful assistant") ≈ 8 tokens
- A detailed 200-word system prompt ≈ 270 tokens
For accurate counts, use OpenAI's tiktoken library:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = """You are a customer support assistant for Acme Corp.
Always respond in a friendly, professional tone.
Do not discuss competitor products."""
token_count = len(enc.encode(text))
print(f"System prompt tokens: {token_count}")
# Output: System prompt tokens: 38For Claude models, Anthropic's tokeniser is slightly different but the 0.75 words/token approximation holds within ±10% for planning purposes. When in doubt, measure your actual prompts.
The cost formula
Every LLM API follows the same billing structure:
cost = (input_tokens × input_price_per_1M / 1_000_000)
+ (output_tokens × output_price_per_1M / 1_000_000)
Input tokens include everything sent to the model: system prompt, conversation history, tool definitions, and the user message. Output tokens are only what the model generates in response.
A typical support chatbot request might look like:
-
System prompt: 80 tokens
-
Conversation history (3 turns): 420 tokens
-
Current user message: 60 tokens
-
Total input: 560 tokens
-
Model response: 180 tokens
-
Total output: 180 tokens
At GPT-4o pricing (₹240/M input, ₹960/M output as of early 2026):
cost = (560 × 240 / 1_000_000) + (180 × 960 / 1_000_000)
= ₹0.134 + ₹0.173
= ₹0.307 per request
At 10,000 requests/day, that is ₹3,070/day or ₹92,100/month — for a single support chatbot.
The model cost matrix (early 2026 prices)
Choosing the right model is the highest-leverage cost decision you will make. Here are current prices on AICredits in Indian Rupees per 1 million tokens:
| Model | Input (₹/1M tokens) | Output (₹/1M tokens) | Best for | |---|---|---|---| | GPT-4o | ₹240 | ₹960 | Complex reasoning, multi-step tasks | | GPT-4o Mini | ₹14 | ₹56 | Classification, extraction, simple Q&A | | Claude 3.5 Sonnet | ₹275 | ₹1,375 | Long-form writing, nuanced analysis | | Claude 3.5 Haiku | ₹69 | ₹345 | Fast, cheap, good quality for most tasks | | Gemini 1.5 Flash | ₹9 | ₹35 | Highest volume, cost-sensitive pipelines | | Gemini 1.5 Pro | ₹115 | ₹345 | Large context windows, multimodal |
The spread between cheapest (Gemini Flash at ₹9/M) and most expensive (Claude Sonnet at ₹1,375/M output) is over 150×. Most production workloads can use a tiered approach: fast cheap models for triage and classification, frontier models only for tasks that genuinely need them.
Building a cost estimator: Python script
Here is a reusable cost estimator you can run against your actual prompt templates before shipping:
import tiktoken
from dataclasses import dataclass
# Pricing per 1M tokens in INR (early 2026)
MODEL_PRICING = {
"gpt-4o": {"input": 240, "output": 960},
"gpt-4o-mini": {"input": 14, "output": 56},
"claude-3-5-sonnet-20241022": {"input": 275, "output": 1375},
"claude-3-5-haiku-20241022": {"input": 69, "output": 345},
"gemini-1.5-flash": {"input": 9, "output": 35},
"gemini-1.5-pro": {"input": 115, "output": 345},
}
@dataclass
class RequestProfile:
system_prompt: str
example_user_message: str
avg_output_words: int
requests_per_day: int
def count_tokens(text: str, model: str = "gpt-4o") -> int:
try:
enc = tiktoken.encoding_for_model(model)
except KeyError:
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))
def estimate_monthly_cost(profile: RequestProfile, model: str) -> dict:
if model not in MODEL_PRICING:
raise ValueError(f"Unknown model: {model}")
input_tokens = (
count_tokens(profile.system_prompt, model)
+ count_tokens(profile.example_user_message, model)
)
# Approximate output tokens from word count
output_tokens = int(profile.avg_output_words * 1.33)
pricing = MODEL_PRICING[model]
cost_per_request_inr = (
input_tokens * pricing["input"] / 1_000_000
+ output_tokens * pricing["output"] / 1_000_000
)
daily_cost = cost_per_request_inr * profile.requests_per_day
monthly_cost = daily_cost * 30
return {
"model": model,
"input_tokens_per_req": input_tokens,
"output_tokens_per_req": output_tokens,
"cost_per_request_inr": round(cost_per_request_inr, 4),
"daily_cost_inr": round(daily_cost, 2),
"monthly_cost_inr": round(monthly_cost, 2),
}
# --- Define your workload ---
profile = RequestProfile(
system_prompt="""You are a customer support assistant for a SaaS product.
Answer questions about billing, features, and account management.
Be concise and friendly. If you don't know the answer, say so.""",
example_user_message="How do I upgrade my plan to Pro?",
avg_output_words=80, # typical response length
requests_per_day=5_000, # your expected traffic
)
print(f"{'Model':<35} {'Input/req':>10} {'Output/req':>11} {'₹/req':>8} {'₹/day':>10} {'₹/month':>12}")
print("-" * 90)
for model in MODEL_PRICING:
result = estimate_monthly_cost(profile, model)
print(
f"{result['model']:<35} "
f"{result['input_tokens_per_req']:>10} "
f"{result['output_tokens_per_req']:>11} "
f"{result['cost_per_request_inr']:>8.4f} "
f"{result['daily_cost_inr']:>10.2f} "
f"{result['monthly_cost_inr']:>12.2f}"
)Run this before you pick a model. The output will show you exactly what each model costs at your expected traffic volume.
Workload sizing: the numbers that actually matter
To estimate monthly cost, you need four inputs:
- Requests per day — start with your target DAU × expected interactions per user per day
- Average input tokens per request — system prompt + context window + user message
- Average output tokens per request — how long are your model responses?
- Model price — from the table above
For a customer support chatbot with 500 DAU, 4 interactions per session, 2 sessions per user per day:
- Requests per day: 500 × 4 × 2 = 4,000
- Average input: 600 tokens (80 system + 400 history + 120 user message)
- Average output: 200 tokens
- Model: Claude 3.5 Haiku
Monthly cost = 4,000 × 30 × ((600 × 69 + 200 × 345) / 1,000,000) = 120,000 × (0.0414 + 0.069) = 120,000 × 0.1104 = ₹13,248/month
The hidden costs developers miss
Most cost estimates are wrong because they ignore compounding factors that significantly increase actual token usage:
System prompt on every request. A 300-token system prompt sent with 10,000 requests/day is 3 million tokens of input you never budgeted for. At GPT-4o pricing that is ₹720/day or ₹21,600/month — just for your instructions.
Conversation history grows with each turn. A 10-turn conversation where each turn adds 100 tokens means the final request carries 900 tokens of prior context. If you average 5 turns per session, your effective input tokens are 2.5× higher than turn 1.
Tool and function definitions add tokens. If you are using function calling or tool use, each tool definition is appended to every request. A set of 5 tool definitions at 200 tokens each adds 1,000 tokens to every single API call. At 50,000 requests/day, that is 50 million tokens — a non-trivial budget line.
Retries. A 2% error rate with automatic retry means 2% of your requests are paid for twice. At high volume, this adds up. A retry storm (cascading failures triggering mass retries) can multiply your daily spend by 3–5× in an hour.
Context stuffing. RAG pipelines that retrieve 5 documents at 500 tokens each add 2,500 tokens of context to every retrieval-augmented request. Budget for this explicitly.
Cost optimisation techniques
Once you have your baseline estimate, here is where to find savings:
1. Model downgrading for simple tasks (saves 70–95% on those tasks) Classify requests by complexity before routing. Use GPT-4o Mini or Gemini Flash for classification, extraction, and template-based generation. Reserve frontier models for genuinely complex reasoning. A hybrid routing setup (cheap model first, escalate if confidence is low) typically reduces spend by 40–60%.
2. Prompt compression (saves 15–30%) Remove unnecessary whitespace, merge redundant instructions, and use shorter variable names in templates. Tools like LLMLingua can compress prompts by 20–30% with minimal quality loss. For a 400-token system prompt, trimming to 280 tokens saves 30% on every single input call.
3. Semantic caching (saves 20–40% for conversational apps) Cache responses to semantically similar queries. A user asking "what's my balance?" and another asking "show me my current balance" should return the same cached response. AICredits includes built-in semantic caching — enable it and repeated queries never hit the model at all.
4. Setting max_tokens (prevents runaway output costs)
Always set max_tokens to a reasonable ceiling for your use case. An unguarded model generating a 2,000-token essay when you needed a 100-token summary costs 20× more than expected. For most conversational apps, max_tokens=512 is a reasonable default.
5. Batching for async workloads (saves 50% where available) OpenAI's Batch API charges 50% less for non-real-time processing. If you are running document processing, analysis pipelines, or overnight jobs, batch mode can cut that portion of your bill in half.
Setting budget guardrails with AICredits
Estimation is only half the job. The other half is enforcing limits so a traffic spike or a bug cannot blow your budget.
On AICredits, every API key supports a hard spending limit. Once the key hits its budget, it stops accepting requests and returns a clear error — no silent runaway spend.
from openai import OpenAI
# Your production key with a ₹5,000/month budget set in the AICredits dashboard
client = OpenAI(
base_url="https://api.aicredits.in/v1",
api_key="sk-your-aicredits-key"
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarise this support ticket in one sentence."}
],
max_tokens=100, # Always set this
)
print(response.choices[0].message.content)Best practice: create separate API keys for each environment (dev, staging, production) and each feature. Set lower budgets on dev and staging keys to catch unexpected spend before it reaches production.
Monthly cost projections by app type
Use these benchmarks as starting points for your own estimates:
| App Type | Req/Day | Avg Input Tokens | Avg Output Tokens | Recommended Model | Est. ₹/Month | |---|---|---|---|---|---| | Customer support chatbot | 5,000 | 600 | 200 | Claude 3.5 Haiku | ₹18,000 | | Document Q&A (RAG) | 2,000 | 2,500 | 300 | GPT-4o Mini | ₹12,600 | | Code assistant (IDE plugin) | 10,000 | 1,200 | 500 | GPT-4o | ₹1,58,400 | | Classification pipeline | 50,000 | 300 | 20 | Gemini Flash | ₹5,850 | | Content generation | 1,000 | 400 | 800 | Claude 3.5 Sonnet | ₹39,600 |
These are rough estimates. Your actual numbers depend on your specific prompts and usage patterns — run the estimator script above with your own prompts to get accurate figures.
Cost tracking you can add to any project today
Add this wrapper around your LLM calls to track real spend as it happens. Log it to your database or observability platform:
import time
from openai import OpenAI
client = OpenAI(
base_url="https://api.aicredits.in/v1",
api_key="sk-your-aicredits-key"
)
# INR pricing per 1M tokens (update when prices change)
PRICING_INR = {
"gpt-4o": {"input": 240, "output": 960},
"gpt-4o-mini": {"input": 14, "output": 56},
"claude-3-5-haiku-20241022": {"input": 69, "output": 345},
"gemini-1.5-flash": {"input": 9, "output": 35},
}
def tracked_completion(model: str, messages: list, **kwargs) -> dict:
start = time.time()
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
latency_ms = (time.time() - start) * 1000
usage = response.usage
pricing = PRICING_INR.get(model, {"input": 240, "output": 960})
cost_inr = (
usage.prompt_tokens * pricing["input"] / 1_000_000
+ usage.completion_tokens * pricing["output"] / 1_000_000
)
# Log to your observability system
log_entry = {
"model": model,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"cost_inr": round(cost_inr, 5),
"latency_ms": round(latency_ms, 1),
"timestamp": time.time(),
}
print(f"[LLM] {model} | {usage.prompt_tokens}+{usage.completion_tokens} tokens "
f"| ₹{cost_inr:.5f} | {latency_ms:.0f}ms")
return {"response": response, "cost_log": log_entry}
# Usage
result = tracked_completion(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Classify support tickets."},
{"role": "user", "content": "My payment failed but I was charged."},
],
max_tokens=20,
)
print(result["response"].choices[0].message.content)In production, replace the print with a write to your metrics system (Prometheus, Datadog, or even a simple Postgres table). After a week of real traffic, you will have accurate data to refine your cost model.
INR cost conversion for Indian startups
All LLM providers price in USD. For Indian startups, the forex rate and buffer matter.
At an exchange rate of ₹85/USD with a 5% buffer:
- Effective rate: ₹85 × 1.05 = ₹89.25/USD
- GPT-4o: $3/1M input tokens = ₹267.75/1M input tokens
- GPT-4o Mini: $0.15/1M input tokens = ₹13.39/1M input tokens
AICredits bills entirely in INR from a prepaid wallet, so you are never exposed to intra-month forex fluctuations. You load ₹10,000 into your wallet and spend exactly that — no surprise USD-to-INR conversion at the end of the month, no international transaction fees on your credit card, and no rejected cards because your Indian bank blocked a foreign charge.
For a startup burning ₹50,000/month on LLM API costs, the difference between a 5% forex buffer and a 15% buffer (what some resellers charge) is ₹5,000/month — over ₹60,000/year that goes directly to your bottom line.
The pre-launch checklist
Before you ship any LLM-powered feature, run through this list:
- [ ] Measured actual token counts on your real prompt templates (not estimates)
- [ ] Ran the cost estimator at 1×, 10×, and 100× expected traffic
- [ ] Chosen the cheapest model that meets your quality bar for each task
- [ ] Set
max_tokenson every API call - [ ] Created a dedicated API key for production with a monthly budget limit
- [ ] Added per-request cost logging to your application
- [ ] Set up an alert when monthly spend crosses 80% of budget
- [ ] Tested what happens when the budget limit is hit (does your app fail gracefully?)
LLM API costs are predictable if you measure them upfront. The teams that get surprise bills are the ones who skip the math. Spend an hour with the estimator script above before you launch — it is considerably cheaper than finding out the hard way.
Related Articles
Continue in Docs
Need implementation commands and endpoint details? Go to quickstart or API reference.