Evaluate & Optimize
Evaluate LLM outputs systematically, A/B test models, measure quality metrics, and optimize cost without sacrificing accuracy.
Use this page with an AI assistant
Opens a new chat with this docs URL and the correct AICredits base URLs.
Don't pick models by intuition — measure. This guide covers building eval sets, comparing models systematically, and finding the cheapest model that meets your quality bar.
Overview
The goal of evaluation is to find the model (and prompt) that maximises quality at the lowest cost for your specific task. What works for one task (e.g., GPT-4o for legal summarisation) may be overkill for another (GPT-4o-mini for sentiment classification).
Because AICredits gives you access to all models through one API, you can run the same eval set against multiple models with minimal code changes.
Build an Eval Set
Start with 50–200 representative examples from your production traffic. For each example, capture:
- The input (user message + system prompt)
- The expected output (ground truth or reference answer)
- The evaluation criteria (exact match, contains key phrase, LLM judge score)
[
{
"id": "support-001",
"input": "My payment failed but money was deducted. What should I do?",
"expected_topics": ["refund", "contact support", "3-5 business days"],
"must_not_contain": ["I don't know", "I cannot help"]
},
{
"id": "classify-001",
"input": "This product is absolutely terrible and I want my money back",
"expected_sentiment": "negative",
"expected_intent": "refund_request"
}
]Model Comparison
import json
import time
from openai import OpenAI
client = OpenAI(
base_url="https://api.aicredits.in/v1",
api_key="sk-your-key-here",
)
MODELS = [
"openai/gpt-4o-mini",
"openai/gpt-4o",
"anthropic/claude-haiku-4.5-20251001",
"anthropic/claude-sonnet-4.5",
"google/gemini-2.0-flash",
]
SYSTEM_PROMPT = "You are a helpful customer support agent for an e-commerce platform."
def run_eval(eval_examples: list[dict], model: str) -> list:
results = []
for ex in eval_examples:
start = time.perf_counter()
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex["input"]},
],
max_tokens=200,
)
latency_ms = (time.perf_counter() - start) * 1000
output = response.choices[0].message.content
results.append({
"id": ex["id"],
"output": output,
"latency_ms": latency_ms,
"tokens": response.usage.total_tokens,
})
return results
with open("eval_set.json") as f:
eval_set = json.load(f)
for model in MODELS:
print(f"\nRunning {model}...")
results = run_eval(eval_set, model)
avg_latency = sum(r["latency_ms"] for r in results) / len(results)
print(f"Avg latency: {avg_latency:.0f}ms")LLM-as-Judge
For tasks without a clear ground truth (e.g., response quality, helpfulness), use a strong model to score outputs:
def llm_judge(question: str, response: str, criteria: str) -> dict:
"""Use GPT-4o to score a response on a 1-5 scale."""
prompt = f"""Score the following response on a scale of 1-5 for the given criteria.
Criteria: {criteria}
User Question: {question}
Response to evaluate:
{response}
Respond with JSON: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""
result = client.chat.completions.create(
model="openai/gpt-4o-mini",
response_format={"type": "json_object"},
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return json.loads(result.choices[0].message.content)
judgment = llm_judge(
question="My payment failed but money was deducted.",
response="I understand your frustration. Please contact our support team...",
criteria="Is the response helpful, empathetic, and actionable?",
)
print(f"Score: {judgment['score']}/5 — {judgment['reasoning']}")Use a cheaper, fast model (GPT-4o-mini) as the judge for most criteria. Reserve expensive models (GPT-4o, Claude Sonnet) for judging subtle quality dimensions.
Cost vs Quality Trade-off
After running evals, plot your models on a cost vs. quality comparison:
results = {
"openai/gpt-4o-mini": {"avg_score": 3.8, "avg_cost_inr": 0.02, "avg_latency_ms": 380},
"anthropic/claude-haiku-4.5-20251001": {"avg_score": 4.1, "avg_cost_inr": 0.03, "avg_latency_ms": 420},
"openai/gpt-4o": {"avg_score": 4.6, "avg_cost_inr": 0.18, "avg_latency_ms": 850},
"anthropic/claude-sonnet-4.5": {"avg_score": 4.7, "avg_cost_inr": 0.22, "avg_latency_ms": 920},
}
# Decision:
# - If quality >= 4.0 is acceptable: use claude-haiku (cheap, good quality)
# - If quality >= 4.5 required: use gpt-4o or claude-sonnet (10x cost)
# - Routing: use haiku for simple queries, sonnet for complex ones
print("\nModel Comparison:")
print(f"{'Model':<45} {'Score':>7} {'Cost (₹)':>10} {'Latency':>10}")
print("-" * 75)
for model, stats in results.items():
print(f"{model:<45} {stats['avg_score']:>7.1f} {stats['avg_cost_inr']:>10.3f} {stats['avg_latency_ms']:>9.0f}ms")Tracking with Usage Logs
AICredits logs every request automatically. Use the usage dashboard or API to pull real production cost data per model:
GET /api/billing/usage?page=1&limit=100
# Response includes per-request:
{
"logs": [
{
"model": "openai/gpt-4o-mini",
"total_tokens": 234,
"cost_inr": 0.019,
"created_at": "2026-03-16T10:30:00Z"
}
]
}Cross-reference your eval scores with real production usage logs to validate that the model performing best in offline evals also performs best with real user inputs.