Evaluate & Optimize

Evaluate LLM outputs systematically, A/B test models, measure quality metrics, and optimize cost without sacrificing accuracy.

Don't pick models by intuition — measure. This guide covers building eval sets, comparing models systematically, and finding the cheapest model that meets your quality bar.

Overview

The goal of evaluation is to find the model (and prompt) that maximises quality at the lowest cost for your specific task. What works for one task (e.g., GPT-4o for legal summarisation) may be overkill for another (GPT-4o-mini for sentiment classification).

Because AICredits gives you access to all models through one API, you can run the same eval set against multiple models with minimal code changes.

Build an Eval Set

Start with 50–200 representative examples from your production traffic. For each example, capture:

The input (user message + system prompt)
The expected output (ground truth or reference answer)
The evaluation criteria (exact match, contains key phrase, LLM judge score)

eval_set.json

[
  {
    "id": "support-001",
    "input": "My payment failed but money was deducted. What should I do?",
    "expected_topics": ["refund", "contact support", "3-5 business days"],
    "must_not_contain": ["I don't know", "I cannot help"]
  },
  {
    "id": "classify-001",
    "input": "This product is absolutely terrible and I want my money back",
    "expected_sentiment": "negative",
    "expected_intent": "refund_request"
  }
]

Model Comparison

run_eval.py

import json
import time
from openai import OpenAI

client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-key-here",
)

MODELS = [
    "openai/gpt-4o-mini",
    "openai/gpt-4o",
    "anthropic/claude-haiku-4.5-20251001",
    "anthropic/claude-sonnet-4.5",
    "google/gemini-2.0-flash",
]

SYSTEM_PROMPT = "You are a helpful customer support agent for an e-commerce platform."

def run_eval(eval_examples: list[dict], model: str) -> list:
    results = []
    for ex in eval_examples:
        start = time.perf_counter()
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": ex["input"]},
            ],
            max_tokens=200,
        )
        latency_ms = (time.perf_counter() - start) * 1000
        output = response.choices[0].message.content

        results.append({
            "id": ex["id"],
            "output": output,
            "latency_ms": latency_ms,
            "tokens": response.usage.total_tokens,
        })
    return results

with open("eval_set.json") as f:
    eval_set = json.load(f)

for model in MODELS:
    print(f"\nRunning {model}...")
    results = run_eval(eval_set, model)
    avg_latency = sum(r["latency_ms"] for r in results) / len(results)
    print(f"Avg latency: {avg_latency:.0f}ms")

LLM-as-Judge

For tasks without a clear ground truth (e.g., response quality, helpfulness), use a strong model to score outputs:

LLM-as-judge

def llm_judge(question: str, response: str, criteria: str) -> dict:
    """Use GPT-4o to score a response on a 1-5 scale."""
    prompt = f"""Score the following response on a scale of 1-5 for the given criteria.

Criteria: {criteria}

User Question: {question}

Response to evaluate:
{response}

Respond with JSON: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""

    result = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return json.loads(result.choices[0].message.content)

judgment = llm_judge(
    question="My payment failed but money was deducted.",
    response="I understand your frustration. Please contact our support team...",
    criteria="Is the response helpful, empathetic, and actionable?",
)
print(f"Score: {judgment['score']}/5 — {judgment['reasoning']}")

Use a cheaper, fast model (GPT-4o-mini) as the judge for most criteria. Reserve expensive models (GPT-4o, Claude Sonnet) for judging subtle quality dimensions.

Cost vs Quality Trade-off

After running evals, plot your models on a cost vs. quality comparison:

Cost/quality comparison

results = {
    "openai/gpt-4o-mini": {"avg_score": 3.8, "avg_cost_inr": 0.02, "avg_latency_ms": 380},
    "anthropic/claude-haiku-4.5-20251001": {"avg_score": 4.1, "avg_cost_inr": 0.03, "avg_latency_ms": 420},
    "openai/gpt-4o": {"avg_score": 4.6, "avg_cost_inr": 0.18, "avg_latency_ms": 850},
    "anthropic/claude-sonnet-4.5": {"avg_score": 4.7, "avg_cost_inr": 0.22, "avg_latency_ms": 920},
}

# Decision:
# - If quality >= 4.0 is acceptable: use claude-haiku (cheap, good quality)
# - If quality >= 4.5 required: use gpt-4o or claude-sonnet (10x cost)
# - Routing: use haiku for simple queries, sonnet for complex ones

print("\nModel Comparison:")
print(f"{'Model':<45} {'Score':>7} {'Cost (₹)':>10} {'Latency':>10}")
print("-" * 75)
for model, stats in results.items():
    print(f"{model:<45} {stats['avg_score']:>7.1f} {stats['avg_cost_inr']:>10.3f} {stats['avg_latency_ms']:>9.0f}ms")

Tracking with Usage Logs

AICredits logs every request automatically. Use the usage dashboard or API to pull real production cost data per model:

Fetch usage logs

GET /api/billing/usage?page=1&limit=100

# Response includes per-request:
{
  "logs": [
    {
      "model": "openai/gpt-4o-mini",
      "total_tokens": 234,
      "cost_inr": 0.019,
      "created_at": "2026-03-16T10:30:00Z"
    }
  ]
}

Cross-reference your eval scores with real production usage logs to validate that the model performing best in offline evals also performs best with real user inputs.