Smart Model Routing: Using Cheap Models First, Expensive Models When Needed
Learn how to design a model routing system that sends simple queries to cheap models and escalates complex ones to powerful models. Reduce AI agent costs by 40-60% while maintaining quality with intelligent routing.
The Model Routing Problem
Most teams default to using their best (and most expensive) model for every request. A customer asking "What are your business hours?" gets the same GPT-4o treatment as someone asking for a complex multi-step analysis. This is like sending every package via overnight express shipping — it works, but it destroys your margins.
Smart model routing classifies requests by complexity and routes them to the cheapest model that can handle them well. In practice, 60–80% of agent queries are simple enough for a small, fast model, meaning you only need the expensive model for the remaining 20–40%.
Designing a Two-Tier Router
The simplest effective pattern uses two tiers: a fast/cheap model for straightforward requests and a powerful/expensive model for complex ones. A lightweight classifier decides which tier handles each request.
flowchart TD
START["Smart Model Routing: Using Cheap Models First, Ex…"] --> A
A["The Model Routing Problem"]
A --> B
B["Designing a Two-Tier Router"]
B --> C
C["Adding Quality Gates"]
C --> D
D["Cost Tracking Across Routes"]
D --> E
E["When Not to Route"]
E --> F
F["FAQ"]
F --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import openai
class Complexity(Enum):
SIMPLE = "simple"
COMPLEX = "complex"
@dataclass
class RoutingDecision:
complexity: Complexity
model: str
reason: str
estimated_cost_ratio: float # relative to always using the expensive model
TIER_CONFIG = {
Complexity.SIMPLE: {
"model": "gpt-4o-mini",
"max_tokens": 1024,
"cost_ratio": 0.06, # ~6% the cost of gpt-4o
},
Complexity.COMPLEX: {
"model": "gpt-4o",
"max_tokens": 4096,
"cost_ratio": 1.0,
},
}
class ModelRouter:
def __init__(self, client: openai.OpenAI):
self.client = client
def classify_complexity(self, user_message: str) -> RoutingDecision:
classification_prompt = (
"Classify this user message as SIMPLE or COMPLEX.\n"
"SIMPLE: factual lookups, greetings, yes/no questions, "
"status checks, single-step tasks.\n"
"COMPLEX: multi-step reasoning, analysis, code generation, "
"creative writing, comparisons, ambiguous queries.\n"
f"Message: {user_message}\n"
"Respond with only SIMPLE or COMPLEX."
)
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": classification_prompt}],
max_tokens=10,
temperature=0,
)
label = response.choices[0].message.content.strip().upper()
complexity = Complexity.COMPLEX if "COMPLEX" in label else Complexity.SIMPLE
config = TIER_CONFIG[complexity]
return RoutingDecision(
complexity=complexity,
model=config["model"],
reason=label,
estimated_cost_ratio=config["cost_ratio"],
)
def route_and_respond(self, user_message: str, system_prompt: str) -> dict:
decision = self.classify_complexity(user_message)
config = TIER_CONFIG[decision.complexity]
response = self.client.chat.completions.create(
model=decision.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
],
max_tokens=config["max_tokens"],
)
return {
"response": response.choices[0].message.content,
"model_used": decision.model,
"complexity": decision.complexity.value,
"cost_ratio": decision.estimated_cost_ratio,
}
Adding Quality Gates
Routing is only valuable if quality stays high. Add a quality gate that catches cases where the cheap model underperforms and automatically retries with the expensive model.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class QualityGatedRouter(ModelRouter):
def __init__(self, client: openai.OpenAI, quality_threshold: float = 0.7):
super().__init__(client)
self.quality_threshold = quality_threshold
def check_response_quality(self, question: str, answer: str) -> float:
check_prompt = (
"Rate this answer's quality from 0.0 to 1.0.\n"
f"Question: {question}\n"
f"Answer: {answer}\n"
"Respond with only a number."
)
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": check_prompt}],
max_tokens=5,
temperature=0,
)
try:
return float(response.choices[0].message.content.strip())
except ValueError:
return 0.5
def route_with_fallback(self, user_message: str, system_prompt: str) -> dict:
result = self.route_and_respond(user_message, system_prompt)
if result["complexity"] == "simple":
score = self.check_response_quality(user_message, result["response"])
if score < self.quality_threshold:
result = self.route_and_respond(user_message, system_prompt)
result["model_used"] = TIER_CONFIG[Complexity.COMPLEX]["model"]
result["escalated"] = True
result["original_quality_score"] = score
return result
Cost Tracking Across Routes
class RoutingCostTracker:
def __init__(self):
self.requests = []
def record(self, complexity: str, model: str, tokens_used: int, cost: float):
self.requests.append({
"complexity": complexity,
"model": model,
"tokens": tokens_used,
"cost": cost,
})
def savings_report(self) -> dict:
total_actual = sum(r["cost"] for r in self.requests)
total_if_always_expensive = sum(
r["tokens"] / 1_000_000 * 12.50 for r in self.requests
)
savings = total_if_always_expensive - total_actual
return {
"actual_cost": round(total_actual, 4),
"cost_without_routing": round(total_if_always_expensive, 4),
"savings": round(savings, 4),
"savings_pct": round((savings / total_if_always_expensive) * 100, 1),
"simple_pct": round(
len([r for r in self.requests if r["complexity"] == "simple"])
/ len(self.requests) * 100, 1
),
}
When Not to Route
Avoid model routing for safety-critical applications (medical, legal, financial advice), tasks requiring consistent voice or style across responses, and scenarios where the classification cost exceeds the routing savings — which happens with very short queries where the classifier itself costs more than the difference between models.
FAQ
Does the classifier itself add significant cost?
The classifier call uses a cheap model with very few output tokens (just "SIMPLE" or "COMPLEX"), so it costs roughly $0.00001–$0.00005 per classification. At typical volumes, the classifier cost is 0.1–0.5% of total LLM spend. The savings from routing far outweigh this overhead.
What if the classifier misroutes a complex query to the cheap model?
This is where quality gates matter. The fallback pattern detects low-quality responses and automatically escalates to the expensive model. Track your escalation rate — if it exceeds 15–20%, retune your classifier prompt or switch to a rule-based pre-filter for known complex patterns.
Can I use more than two tiers?
Absolutely. Three-tier systems (small/medium/large) work well at scale. The key is keeping the classifier logic simple enough that it does not become a cost center itself. Start with two tiers and add a middle tier only when you have enough traffic data to justify the complexity.
#ModelRouting #CostOptimization #LLMSelection #AIArchitecture #SmartRouting #AgenticAI #LearnAI #AIEngineering
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.