Building a Simple LLM Router in Python (Best Model for Each Task)

Route cheap tasks to cheap models and expensive tasks to capable ones. A practical Python implementation that cuts API spend by 40–70% without sacrificing quality.

Author

AICredits Team

Published

26 Mar 2026

Reading time

8 min read

Why routing saves money

Most AI applications have a mix of task types: some require frontier model quality (complex reasoning, nuanced generation), and some are easily handled by cheaper models (classification, extraction, simple Q&A). Using GPT-4o for everything means paying frontier prices for tasks that GPT-4o Mini could handle at 17× lower cost.

A router classifies each request and dispatches it to the appropriate model. A well-tuned router typically reduces per-request cost by 40–70% with less than 2% quality degradation.

Model tiers

| Tier | Models | Input cost (INR/M) | Best for | |------|--------|-------------------|---------| | Simple | GPT-4o Mini, Gemini Flash | ₹7–14 | Classification, extraction, short Q&A | | Medium | Claude 3.5 Haiku | ₹96 | Summarisation, structured generation | | Complex | Claude 3.5 Sonnet, GPT-4o | ₹240–289 | Long-form writing, code, multi-step reasoning |

Rule-based router (zero latency, zero cost)

from openai import OpenAI
 
client = OpenAI(base_url="https://api.aicredits.in/v1", api_key="sk-your-aicredits-key")
 
MODELS = {
    "simple":  "openai/gpt-4o-mini",
    "medium":  "anthropic/claude-3-5-haiku-20241022",
    "complex": "anthropic/claude-3-5-sonnet-20241022",
}
 
COMPLEX_KEYWORDS = {"analyse", "analyze", "detailed", "compare", "refactor", "architecture", "design", "debug"}
SIMPLE_KEYWORDS  = {"classify", "extract", "yes or no", "true or false", "label", "categorise", "categorize"}
 
def route(prompt: str) -> str:
    words = set(prompt.lower().split())
    if words & COMPLEX_KEYWORDS:
        return MODELS["complex"]
    if words & SIMPLE_KEYWORDS or len(prompt.split()) < 50:
        return MODELS["simple"]
    return MODELS["medium"]
 
def ask(prompt: str) -> str:
    model = route(prompt)
    print(f"→ routing to: {model}")
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content
 
# Simple → cheap model (~₹0.002 per call)
print(ask("Classify as positive or negative: 'The product arrived on time.'"))
 
# Complex → capable model (~₹0.05 per call)
print(ask("Analyse the tradeoffs of microservices vs monolith for a 5-person startup."))

LLM-based router (more accurate, ~₹0.003 overhead)

For tasks where rules are insufficient, use a tiny model to classify complexity:

def llm_route(prompt: str) -> str:
    """Use GPT-4o Mini to classify task complexity. Costs ~₹0.003."""
    classification = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Classify the complexity of this request as: simple, medium, or complex. Return only one word.",
        }, {
            "role": "user",
            "content": prompt[:200],  # first 200 chars is enough for classification
        }],
        max_tokens=5,
    )
    tier = classification.choices[0].message.content.strip().lower()
    return MODELS.get(tier, MODELS["medium"])
 
def smart_ask(prompt: str) -> str:
    model = llm_route(prompt)
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

Cache the routing decision for repeated queries — if the same type of request comes in again, reuse the classification without an extra API call.

Measuring routing quality

A router is only as good as its tier boundaries. Sample 100 requests and run each through both the routed model and the next tier up:

def evaluate_routing(test_prompts: list[str], expected_quality: list[str]) -> dict:
    results = {"correct": 0, "downgraded": 0, "total": len(test_prompts)}
 
    for prompt, expected in zip(test_prompts, expected_quality):
        routed_model  = route(prompt)
        routed_result = ask(prompt)
        ideal_result  = client.chat.completions.create(
            model=MODELS["complex"],  # always use best model as ground truth
            messages=[{"role": "user", "content": prompt}],
        ).choices[0].message.content
 
        # Compare manually or with an LLM judge
        print(f"Routed ({routed_model}): {routed_result[:100]}")
        print(f"Ideal:                  {ideal_result[:100]}\n")
 
    return results

If the routed model matches on 95%+ of sampled requests, your routing is well-calibrated.

Agentic AI Costs: How One Loop Burned ₹5,000 in 10 Minutes (And How to Prevent It)

9 min read

How to Build a Retry Strategy for LLM API Calls

6 min read

Context Window Management: Don't Waste Tokens

7 min read

Continue in Docs

Need implementation commands and endpoint details? Go to quickstart or API reference.