Few-Shot vs Zero-Shot Prompting: When to Use Which (With Cost Benchmarks)

Adding examples to your prompt improves accuracy but costs more in tokens. Here is a practical framework for deciding when the quality gain is worth the extra spend.

Author

AICredits Team

Published

28 Mar 2026

Reading time

6 min read

Definitions

Zero-shot — instructions only, no examples. The model performs the task using its training knowledge.

Few-shot — 2–3 input/output examples included in the prompt. The examples teach the model your specific format and style.

Beyond 5–6 examples, marginal gains diminish sharply while token cost rises linearly. Most practitioners find the sweet spot at 2–3 well-chosen examples.

When zero-shot is sufficient

Zero-shot works well for general tasks the model has seen abundant training data for: summarisation, translation, factual Q&A, standard sentiment classification, simple code generation.

For these tasks, a well-written instruction with an explicit output format specification often matches few-shot quality at zero extra token cost.

When few-shot significantly helps

from openai import OpenAI
 
client = OpenAI(base_url="https://api.aicredits.in/v1", api_key="sk-your-aicredits-key")
 
# Zero-shot — model may use inconsistent output format
def zero_shot_classify(text: str) -> str:
    response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Classify support ticket urgency as: low, medium, or high. Return only the label."},
            {"role": "user",   "content": text},
        ],
    )
    return response.choices[0].message.content.strip()
 
# Few-shot — examples anchor the exact output format
FEW_SHOT = [
    {"role": "user",      "content": "My order hasn't arrived after 3 weeks."},
    {"role": "assistant", "content": "high"},
    {"role": "user",      "content": "Can I change my billing email?"},
    {"role": "assistant", "content": "low"},
    {"role": "user",      "content": "The export feature is broken on my account."},
    {"role": "assistant", "content": "medium"},
]
 
def few_shot_classify(text: str) -> str:
    messages = [
        {"role": "system", "content": "Classify urgency as: low, medium, or high. Return only the label."},
        *FEW_SHOT,
        {"role": "user",   "content": text},
    ]
    response = client.chat.completions.create(model="openai/gpt-4o-mini", messages=messages)
    return response.choices[0].message.content.strip()
 
ticket = "Payment keeps failing but I have sufficient balance."
print("Zero-shot:", zero_shot_classify(ticket))
print("Few-shot: ", few_shot_classify(ticket))

Few-shot consistently wins for: entity extraction from domain-specific text (medical, legal, financial), intent classification with overlapping categories, and any task where the model defaults to a different format than you need.

The cost tradeoff

Each example pair costs roughly 100–300 tokens. At Claude Sonnet input prices (₹289/M), three 200-token examples add:

| Volume | Extra cost (3 examples) | |--------|------------------------| | 1,000 requests | ₹0.17 | | 100,000 requests | ₹17.34 | | 1,000,000 requests | ₹173.40 |

At 1M requests/month, three examples cost ₹173. Compare that to the cost of human review time for zero-shot errors on even 0.1% of requests. The few-shot investment almost always wins.

Selecting good examples

Choose examples representative of the most common case, not interesting edge cases
Include at least one negative example when the model has strong defaults you need to override
Keep examples short — one good example of the common case beats five edge-case examples
Test your examples on a held-out set before deploying

Dynamic few-shot: selecting examples at runtime

For tasks with diverse input types, retrieve the most relevant examples at request time:

from openai import OpenAI
import numpy as np
 
client = OpenAI(base_url="https://api.aicredits.in/v1", api_key="sk-your-aicredits-key")
 
# Example bank: (input, label) pairs
EXAMPLE_BANK = [
    ("My order hasn't arrived after 3 weeks.", "high"),
    ("Can I change my billing email?", "low"),
    ("The export feature is broken.", "medium"),
    ("API keys are not working.", "medium"),
    ("I want to cancel my subscription.", "low"),
    ("My account has been charged twice.", "high"),
]
 
def get_embedding(text: str) -> list[float]:
    r = client.embeddings.create(model="text-embedding-3-small", input=text)
    return r.data[0].embedding
 
def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
 
def dynamic_few_shot_classify(query: str, k: int = 2) -> str:
    query_emb = get_embedding(query)
    bank_embs = [get_embedding(ex[0]) for ex in EXAMPLE_BANK]
 
    scored = sorted(zip(EXAMPLE_BANK, bank_embs), key=lambda x: cosine_similarity(query_emb, x[1]), reverse=True)
    top_k = [ex for ex, _ in scored[:k]]
 
    messages = [{"role": "system", "content": "Classify urgency as: low, medium, or high. Return only the label."}]
    for inp, label in top_k:
        messages += [{"role": "user", "content": inp}, {"role": "assistant", "content": label}]
    messages.append({"role": "user", "content": query})
 
    response = client.chat.completions.create(model="openai/gpt-4o-mini", messages=messages)
    return response.choices[0].message.content.strip()

The embedding + retrieval step adds ~50ms and a tiny embedding cost but consistently outperforms static examples on varied inputs.

Using the Anthropic SDK with AICredits (Python & TypeScript)

7 min read

The Prompting Cheat Sheet: 10 Patterns Every Developer Should Know

9 min read

How to Get Structured JSON Output from Any LLM (Reliably)