
Building a Simple LLM Router in Python (Best Model for Each Task)
Route cheap tasks to cheap models and expensive tasks to capable ones. A practical Python implementation that cuts API spend by 40–70% without sacrificing quality.
Author
AICredits Team
Published
26 Mar 2026
Reading time
8 min read
Why routing saves money
Most AI applications have a mix of task types: some require frontier model quality (complex reasoning, nuanced generation), and some are easily handled by cheaper models (classification, extraction, simple Q&A). Using GPT-4o for everything means paying frontier prices for tasks that GPT-4o Mini could handle at 17× lower cost.
A router classifies each request and dispatches it to the appropriate model. A well-tuned router typically reduces per-request cost by 40–70% with less than 2% quality degradation.
Model tiers
| Tier | Models | Input cost (INR/M) | Best for | |------|--------|-------------------|---------| | Simple | GPT-4o Mini, Gemini Flash | ₹7–14 | Classification, extraction, short Q&A | | Medium | Claude 3.5 Haiku | ₹96 | Summarisation, structured generation | | Complex | Claude 3.5 Sonnet, GPT-4o | ₹240–289 | Long-form writing, code, multi-step reasoning |
Rule-based router (zero latency, zero cost)
from openai import OpenAI
client = OpenAI(base_url="https://api.aicredits.in/v1", api_key="sk-your-aicredits-key")
MODELS = {
"simple": "openai/gpt-4o-mini",
"medium": "anthropic/claude-3-5-haiku-20241022",
"complex": "anthropic/claude-3-5-sonnet-20241022",
}
COMPLEX_KEYWORDS = {"analyse", "analyze", "detailed", "compare", "refactor", "architecture", "design", "debug"}
SIMPLE_KEYWORDS = {"classify", "extract", "yes or no", "true or false", "label", "categorise", "categorize"}
def route(prompt: str) -> str:
words = set(prompt.lower().split())
if words & COMPLEX_KEYWORDS:
return MODELS["complex"]
if words & SIMPLE_KEYWORDS or len(prompt.split()) < 50:
return MODELS["simple"]
return MODELS["medium"]
def ask(prompt: str) -> str:
model = route(prompt)
print(f"→ routing to: {model}")
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
# Simple → cheap model (~₹0.002 per call)
print(ask("Classify as positive or negative: 'The product arrived on time.'"))
# Complex → capable model (~₹0.05 per call)
print(ask("Analyse the tradeoffs of microservices vs monolith for a 5-person startup."))LLM-based router (more accurate, ~₹0.003 overhead)
For tasks where rules are insufficient, use a tiny model to classify complexity:
def llm_route(prompt: str) -> str:
"""Use GPT-4o Mini to classify task complexity. Costs ~₹0.003."""
classification = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{
"role": "system",
"content": "Classify the complexity of this request as: simple, medium, or complex. Return only one word.",
}, {
"role": "user",
"content": prompt[:200], # first 200 chars is enough for classification
}],
max_tokens=5,
)
tier = classification.choices[0].message.content.strip().lower()
return MODELS.get(tier, MODELS["medium"])
def smart_ask(prompt: str) -> str:
model = llm_route(prompt)
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.contentCache the routing decision for repeated queries — if the same type of request comes in again, reuse the classification without an extra API call.
Measuring routing quality
A router is only as good as its tier boundaries. Sample 100 requests and run each through both the routed model and the next tier up:
def evaluate_routing(test_prompts: list[str], expected_quality: list[str]) -> dict:
results = {"correct": 0, "downgraded": 0, "total": len(test_prompts)}
for prompt, expected in zip(test_prompts, expected_quality):
routed_model = route(prompt)
routed_result = ask(prompt)
ideal_result = client.chat.completions.create(
model=MODELS["complex"], # always use best model as ground truth
messages=[{"role": "user", "content": prompt}],
).choices[0].message.content
# Compare manually or with an LLM judge
print(f"Routed ({routed_model}): {routed_result[:100]}")
print(f"Ideal: {ideal_result[:100]}\n")
return resultsIf the routed model matches on 95%+ of sampled requests, your routing is well-calibrated.
Related Articles
Continue in Docs
Need implementation commands and endpoint details? Go to quickstart or API reference.