How to Evaluate LLM Outputs: A Practical Guide to Building Evals
Back to blogEngineering

How to Evaluate LLM Outputs: A Practical Guide to Building Evals

Shipping an LLM feature without evals is flying blind. Here's how to build evaluation systems that tell you if your prompts are actually working.

Author

AICredits Team

Published

10 Mar 2026

Reading time

12 min read

The vibe-check problem

You write a prompt, run it against five inputs, the outputs look reasonable, and you ship it. Two weeks later a user files a bug: the model hallucinated a product feature that does not exist. You tweak the prompt to fix it. Now three other cases break.

This is the vibe-check loop — informal manual testing where "looks good to me" is the only acceptance criterion. It does not scale, it does not catch regressions, and it gives you no way to compare prompt versions objectively.

Evals fix this. An eval is a repeatable, automated process that runs your prompt against a fixed dataset and produces a numeric score. When you change your prompt, you run the eval again. If the score goes up, the change is an improvement. If it goes down, you revert.

The goal of this guide is to get you from "I manually check outputs" to "I have a scoring harness I run before every deploy."


Types of evaluation

Different tasks call for different grading strategies. Use the simplest one that gives you a reliable signal.

Exact match

Best for structured outputs with a deterministic correct answer: JSON extraction, entity tagging, yes/no classification, numeric answers.

def exact_match(output: str, expected: str) -> float:
    return 1.0 if output.strip() == expected.strip() else 0.0

Normalise before comparing (lowercase, strip whitespace, parse JSON before comparing keys). Exact match is fast and cheap — use it wherever your task allows.

Semantic similarity

For tasks where the wording can vary but the meaning should be equivalent — summaries, paraphrases, open-ended Q&A — compute cosine similarity between the output embedding and the expected answer embedding.

import numpy as np
from openai import OpenAI
 
client = OpenAI(base_url="https://api.aicredits.in/v1", api_key="sk-your-key")
 
def embed(text: str) -> list[float]:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return resp.data[0].embedding
 
def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
 
def semantic_score(output: str, expected: str) -> float:
    return cosine_similarity(embed(output), embed(expected))

A score above 0.85 is typically a strong match; below 0.70 indicates a meaningfully different answer.

LLM-as-judge

For subjective quality — helpfulness, tone, accuracy of open-domain answers — use a stronger model to grade outputs. This is the most flexible approach and works where reference answers do not exist.

The grader prompt is your most important design decision (more on this below).

Human evaluation

The gold standard. Have a domain expert label a sample of outputs as pass/fail or on a 1–5 scale. Human evals are expensive and slow, so use them to calibrate your automated graders rather than as your primary eval loop.

A typical workflow: run automated evals continuously, run human evals on a 200-example sample once per quarter, and use the human scores to verify that your automated grader is still well-calibrated.


Building a simple eval harness in Python

Here is a complete, runnable eval harness. It loads a dataset, runs your prompt against each input, grades each output, and prints a summary report.

import json
import time
from dataclasses import dataclass, field
from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",
)
 
# --- Data types ---
 
@dataclass
class EvalExample:
    id: str
    input: str
    expected: str
    metadata: dict = field(default_factory=dict)
 
@dataclass
class EvalResult:
    id: str
    input: str
    expected: str
    output: str
    score: float
    latency_ms: float
    cost_tokens: int
    passed: bool
 
# --- Model under test ---
 
SYSTEM_PROMPT = """You are a customer support classifier.
Given a user message, respond with exactly one of these categories:
billing, technical, account, general
 
Respond with only the category name, nothing else."""
 
def run_model(user_input: str, model: str = "gpt-4o-mini") -> tuple[str, float, int]:
    """Returns (output, latency_ms, total_tokens)."""
    start = time.perf_counter()
    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_input},
        ],
        temperature=0,
        max_tokens=16,
    )
    latency_ms = (time.perf_counter() - start) * 1000
    output = resp.choices[0].message.content.strip().lower()
    tokens = resp.usage.total_tokens
    return output, latency_ms, tokens
 
# --- Grader ---
 
def grade(output: str, expected: str) -> float:
    return exact_match(output, expected)
 
def exact_match(output: str, expected: str) -> float:
    return 1.0 if output == expected.lower().strip() else 0.0
 
# --- Harness ---
 
def run_eval(dataset: list[EvalExample], model: str = "gpt-4o-mini") -> list[EvalResult]:
    results = []
    for ex in dataset:
        output, latency_ms, tokens = run_model(ex.input, model=model)
        score = grade(output, ex.expected)
        results.append(EvalResult(
            id=ex.id,
            input=ex.input,
            expected=ex.expected,
            output=output,
            score=score,
            latency_ms=latency_ms,
            cost_tokens=tokens,
            passed=score >= 1.0,
        ))
    return results
 
def print_report(results: list[EvalResult]):
    total = len(results)
    passed = sum(r.passed for r in results)
    avg_latency = sum(r.latency_ms for r in results) / total
    total_tokens = sum(r.cost_tokens for r in results)
    accuracy = passed / total
 
    print(f"\n{'='*50}")
    print(f"Eval Results — {total} examples")
    print(f"{'='*50}")
    print(f"Accuracy:      {accuracy:.1%}  ({passed}/{total})")
    print(f"Avg latency:   {avg_latency:.0f} ms")
    print(f"Total tokens:  {total_tokens:,}")
    print(f"\nFailed cases:")
    for r in results:
        if not r.passed:
            print(f"  [{r.id}] input={r.input!r}")
            print(f"         expected={r.expected!r}  got={r.output!r}")
 
# --- Example dataset ---
 
dataset = [
    EvalExample("001", "My invoice is wrong", "billing"),
    EvalExample("002", "The API keeps timing out", "technical"),
    EvalExample("003", "I need to reset my password", "account"),
    EvalExample("004", "What are your business hours?", "general"),
    EvalExample("005", "I was charged twice this month", "billing"),
    EvalExample("006", "How do I integrate with Python?", "technical"),
]
 
if __name__ == "__main__":
    results = run_eval(dataset)
    print_report(results)

Run this after any prompt change. A drop in accuracy is an immediate signal that your edit caused a regression.


LLM-as-judge: writing reliable grader prompts

When exact match is too strict, use a language model to grade the output. The key is writing a grader prompt that produces consistent, calibrated scores.

Rules for reliable grader prompts:

  1. Give a rubric, not vibes. Define what a 1, 2, 3, 4, 5 means explicitly. Do not ask "is this a good response?"
  2. Ask for a score and reasoning separately. Ask for <reasoning> before <score> so the model is not anchored to a number before explaining itself.
  3. Use a lower-capability model for cheap tasks. For binary pass/fail, gpt-4o-mini as judge is accurate enough and 17× cheaper than GPT-4o.
  4. Validate grader consistency. Run the same example through the grader five times. If scores vary, tighten the rubric.
JUDGE_PROMPT = """You are an expert evaluator for a customer support AI.
 
Your job is to assess whether the AI's response correctly addresses the user's question.
 
Rubric:
5 — Fully correct: accurate, complete, and clearly addresses the question
4 — Mostly correct: minor omissions or slight inaccuracies, but helpful overall
3 — Partially correct: addresses part of the question but misses key points
2 — Mostly incorrect: the response is largely unhelpful or misleading
1 — Completely wrong: factually incorrect or entirely off-topic
 
Respond in this exact format:
<reasoning>One to three sentences explaining your score.</reasoning>
<score>N</score>
 
User question: {question}
AI response: {response}
Reference answer: {reference}"""
 
import re
 
def llm_judge(question: str, response: str, reference: str,
              judge_model: str = "gpt-4o-mini") -> tuple[float, str]:
    prompt = JUDGE_PROMPT.format(
        question=question,
        response=response,
        reference=reference,
    )
    result = client.chat.completions.create(
        model=judge_model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=256,
    )
    text = result.choices[0].message.content
 
    score_match = re.search(r"<score>(\d)</score>", text)
    reasoning_match = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
 
    score = int(score_match.group(1)) / 5.0 if score_match else 0.0
    reasoning = reasoning_match.group(1).strip() if reasoning_match else ""
    return score, reasoning

Metrics to track

Beyond accuracy, monitor these signals across your eval runs:

| Metric | What it measures | How to compute | |--------|-----------------|----------------| | Accuracy | Fraction of outputs that pass the grader | passed / total | | Hallucination rate | Fraction of outputs with fabricated facts | LLM-as-judge with a fact-check rubric | | Format compliance | Fraction of outputs in the expected format (JSON, category name, etc.) | Parse + validate | | p50/p95 latency | Response time distribution | Collect latency_ms per request | | Cost per example | Tokens × price | total_tokens * price_per_token |

Track these as a time-series: one row per eval run with a timestamp and the current prompt hash. Plotting accuracy over time makes regressions immediately visible.


Test set construction

The quality of your eval is bounded by the quality of your dataset. Fifty bad examples tell you less than twenty representative ones.

Size. Aim for 50–200 examples to start. Fewer than 50 gives high variance (a single example is 2% of your score). More than 200 has diminishing returns for most tasks unless you have strong class imbalance.

Coverage. Your dataset must cover:

  • The happy path (clear, easy inputs)
  • Edge cases (ambiguous inputs, inputs near decision boundaries)
  • Known failure modes (inputs you have seen break the model before)
  • Each output class roughly proportionally (or deliberately oversampling rare categories)

Collection strategies:

  1. Production logs. Export real user inputs (with PII stripped) from your existing system. These are the most valuable because they represent actual usage.
  2. Adversarial generation. Prompt a model to generate inputs designed to trick your classifier: "Generate ten ambiguous customer support messages that could reasonably be classified as either billing or technical."
  3. Manual curation. Domain experts write examples targeting known hard cases.

Labelling. For classification tasks, two independent labellers and a tiebreaker reduces label noise significantly. Track inter-annotator agreement — if two humans disagree on 20% of examples, your task definition is ambiguous and the model cannot be expected to do better.


Regression testing: before and after prompt changes

Treat prompt changes like code changes. Before merging a prompt update:

# Save current score
python eval.py --prompt prompts/v1.txt --output results/v1.json
 
# Edit your prompt...
 
# Score the new version
python eval.py --prompt prompts/v2.txt --output results/v2.json
 
# Compare
python compare.py results/v1.json results/v2.json

A minimal comparison script:

import json, sys
 
def load(path):
    with open(path) as f:
        return json.load(f)
 
def compare(path_a, path_b):
    a = load(path_a)
    b = load(path_b)
    acc_a = sum(r["passed"] for r in a) / len(a)
    acc_b = sum(r["passed"] for r in b) / len(b)
    delta = acc_b - acc_a
    symbol = "+" if delta >= 0 else ""
    print(f"Accuracy: {acc_a:.1%} -> {acc_b:.1%}  ({symbol}{delta:.1%})")
 
    # Show examples that changed
    by_id_a = {r["id"]: r for r in a}
    regressions = [r for r in b if not r["passed"] and by_id_a.get(r["id"], {}).get("passed")]
    improvements = [r for r in b if r["passed"] and not by_id_a.get(r["id"], {}).get("passed")]
 
    if regressions:
        print(f"\nRegressions ({len(regressions)}):")
        for r in regressions:
            print(f"  [{r['id']}] {r['input']!r}")
    if improvements:
        print(f"\nImprovements ({len(improvements)}):")
        for r in improvements:
            print(f"  [{r['id']}] {r['input']!r}")
 
compare(sys.argv[1], sys.argv[2])

Never ship a prompt change that reduces accuracy on your eval set without a deliberate, documented decision.


Automated evals in CI/CD

The highest-leverage thing you can do is run evals automatically on every pull request. This catches regressions before they reach production.

A minimal GitHub Actions workflow:

# .github/workflows/evals.yml
name: LLM Evals
 
on:
  pull_request:
    paths:
      - "prompts/**"
      - "src/llm/**"
 
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install openai numpy
      - run: python eval.py --threshold 0.85
        env:
          OPENAI_API_KEY: ${{ secrets.AICREDITS_API_KEY }}
          OPENAI_BASE_URL: "https://api.aicredits.in/v1"

The --threshold flag makes the eval step fail the CI build if accuracy drops below 85%. Adjust based on your task requirements.

To keep CI costs low, maintain a small "smoke" dataset (20–30 examples) that runs on every PR, and a full dataset that runs on merges to main.


Common failure modes

Prompt sensitivity. Small wording changes can shift accuracy by 5–15%. If you change "Respond with only the category" to "Only respond with the category", you may get different results. This is normal — it is exactly why you need an eval to detect it.

Provider differences. The same prompt behaves differently across providers. GPT-4o Mini and Claude Haiku have different default behaviours for capitalization, punctuation in structured outputs, and handling of edge cases. If you evaluate on one provider and deploy on another, your eval scores are not valid. Test on the model you will run in production.

Context length effects. Models perform measurably worse on tasks near their context limit. If your production inputs can be long, include long-context examples in your eval set. A common mistake is building an eval with short examples and then deploying against user-uploaded documents.

Distribution shift. Your eval set reflects your data at collection time. If user behaviour changes — new terminology, new product features, different query patterns — your eval set becomes stale. Schedule a quarterly review to add new examples from recent production logs.


Eval tooling overview

| Tool | Strengths | Weaknesses | |------|-----------|------------| | PromptFoo | Open-source, YAML config, built-in graders, CI integration | Limited custom grader flexibility | | Braintrust | Hosted, experiment tracking, dataset management, rich UI | Paid for teams, vendor lock-in | | OpenAI Evals | Reference framework, good LLM-as-judge patterns | Opinionated structure, Python-only | | Custom harness (this guide) | Full control, no vendor dependency, minimal overhead | You build and maintain it |

For most teams building on top of an LLM API, a custom harness gets you 80% of the value in an afternoon. Graduate to a managed tool when you need experiment tracking across many concurrent prompt experiments.


Budget-conscious evals with AICredits

Running evals on every PR adds API cost. The key is using cheaper models for evaluation wherever the task allows it.

Use mini models for classification evals. gpt-4o-mini costs roughly 1/17th of GPT-4o. For exact-match classification tasks, it produces nearly identical grades.

Use cheaper models as judge. For LLM-as-judge, claude-3-5-haiku or gpt-4o-mini as grader is accurate enough for most rubrics. Reserve frontier models for calibrating your grader periodically, not for running it at scale.

Estimate your CI budget:

# gpt-4o-mini pricing (approximate)
INPUT_COST_PER_1M = 0.15   # USD
OUTPUT_COST_PER_1M = 0.60  # USD
 
examples = 100
avg_input_tokens = 150   # prompt + user message
avg_output_tokens = 10   # short structured output
 
input_cost = (examples * avg_input_tokens / 1_000_000) * INPUT_COST_PER_1M
output_cost = (examples * avg_output_tokens / 1_000_000) * OUTPUT_COST_PER_1M
total_usd = input_cost + output_cost
 
print(f"Cost per eval run: ${total_usd:.4f}")
# -> Cost per eval run: $0.0023

A 100-example eval on GPT-4o Mini costs roughly $0.002 per run. At 50 runs per day, that is $0.10/day — well under any reasonable budget. Use AICredits' prepaid INR wallet to keep costs predictable and avoid unexpected charges from direct provider billing.

client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",  # prepaid INR wallet, no USD card needed
)

Worked example: evaluating a customer support classifier

Here is the full end-to-end eval for a real task — classifying inbound support messages into four categories.

import json
import time
import re
from dataclasses import dataclass, field
from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",
)
 
# Prompt under test
SYSTEM_PROMPT = """Classify the customer message into exactly one category:
billing, technical, account, general
 
Rules:
- billing: payment issues, invoices, refunds, subscription charges
- technical: API errors, bugs, integration problems, performance issues
- account: login, password, profile settings, access permissions
- general: pricing questions, product info, business hours, other
 
Respond with only the category name in lowercase."""
 
DATASET = [
    {"id": "001", "input": "I was charged twice for my subscription", "expected": "billing"},
    {"id": "002", "input": "The API returns 429 errors under load", "expected": "technical"},
    {"id": "003", "input": "How do I change my email address?", "expected": "account"},
    {"id": "004", "input": "Do you offer a free trial?", "expected": "general"},
    {"id": "005", "input": "My credit card was declined", "expected": "billing"},
    {"id": "006", "input": "Embeddings endpoint is timing out", "expected": "technical"},
    {"id": "007", "input": "I forgot my password", "expected": "account"},
    {"id": "008", "input": "What models do you support?", "expected": "general"},
    {"id": "009", "input": "I need a refund for last month", "expected": "billing"},
    {"id": "010", "input": "Getting a 401 on every request", "expected": "technical"},
    {"id": "011", "input": "Can I add a team member to my account?", "expected": "account"},
    {"id": "012", "input": "Is there a Python SDK?", "expected": "technical"},
    {"id": "013", "input": "My invoice shows the wrong amount", "expected": "billing"},
    {"id": "014", "input": "How do rate limits work?", "expected": "general"},
    {"id": "015", "input": "I cannot log in after changing my password", "expected": "account"},
]
 
VALID_CATEGORIES = {"billing", "technical", "account", "general"}
 
def run_example(ex: dict, model: str = "gpt-4o-mini") -> dict:
    start = time.perf_counter()
    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": ex["input"]},
        ],
        temperature=0,
        max_tokens=16,
    )
    latency_ms = (time.perf_counter() - start) * 1000
    output = resp.choices[0].message.content.strip().lower()
    tokens = resp.usage.total_tokens
 
    exact = output == ex["expected"]
    valid_format = output in VALID_CATEGORIES
 
    return {
        "id": ex["id"],
        "input": ex["input"],
        "expected": ex["expected"],
        "output": output,
        "passed": exact,
        "format_ok": valid_format,
        "latency_ms": round(latency_ms, 1),
        "tokens": tokens,
    }
 
def run_eval(dataset: list[dict], model: str = "gpt-4o-mini") -> list[dict]:
    print(f"Running eval on {len(dataset)} examples with model={model}")
    return [run_example(ex, model=model) for ex in dataset]
 
def report(results: list[dict]):
    n = len(results)
    accuracy = sum(r["passed"] for r in results) / n
    format_rate = sum(r["format_ok"] for r in results) / n
    avg_latency = sum(r["latency_ms"] for r in results) / n
    total_tokens = sum(r["tokens"] for r in results)
 
    print(f"\n{'='*55}")
    print(f"{'Metric':<25} {'Value':>10}")
    print(f"{'='*55}")
    print(f"{'Accuracy':<25} {accuracy:>9.1%}")
    print(f"{'Format compliance':<25} {format_rate:>9.1%}")
    print(f"{'Avg latency':<25} {avg_latency:>8.0f}ms")
    print(f"{'Total tokens':<25} {total_tokens:>10,}")
    print(f"{'='*55}")
 
    failures = [r for r in results if not r["passed"]]
    if failures:
        print(f"\nFailed ({len(failures)}/{n}):")
        for r in failures:
            mark = "BAD_FORMAT" if not r["format_ok"] else "WRONG"
            print(f"  [{r['id']}] {mark}: {r['input']!r}")
            print(f"         expected={r['expected']!r}  got={r['output']!r}")
    else:
        print("\nAll examples passed.")
 
if __name__ == "__main__":
    results = run_eval(DATASET)
    report(results)
 
    with open("eval_results.json", "w") as f:
        json.dump(results, f, indent=2)
    print("\nResults saved to eval_results.json")

Running this takes about 15 seconds and costs less than half a cent. Save the output to eval_results.json, commit it to your repo, and use the comparison script from the regression testing section to diff runs across prompt versions.


Where to go from here

The pattern above covers the fundamentals. As your system grows, layer in:

  • Automatic dataset expansion. Log production requests that get low user satisfaction scores and feed them into your eval set.
  • Multi-turn evals. For agents and chatbots, evaluate full conversation flows, not just single-turn responses.
  • Latency budgets. Treat p95 latency as a hard constraint alongside accuracy. A more accurate prompt that adds 500ms may not be worth it.
  • Model comparison matrix. Run the same eval across multiple models to find the cheapest one that meets your accuracy threshold. With AICredits' unified API, swapping models is a one-line change.

Evals are the engineering discipline that separates prototypes from production AI. The harness above takes an afternoon to set up and pays for itself the first time it catches a regression before it ships.

Related Articles

Continue in Docs

Need implementation commands and endpoint details? Go to quickstart or API reference.