Model Routing: Directing Agent Queries to the Optimal Model Based on Complexity
Design and implement a model router that classifies query complexity and directs agent requests to the most cost-effective model. Build fallback chains, measure routing accuracy, and optimize per-query costs.
The Cost of Using One Model for Everything
Most agent systems use a single model for all requests. If you choose a powerful model like GPT-4o, you get reliable results but pay premium prices for simple tasks that a smaller model could handle. If you choose a cheap model, complex queries fail. This is a false trade-off.
In practice, 60-80% of agent queries are straightforward — simple lookups, classification, template-based responses, or short factual answers. Only 20-40% require deep reasoning, long context processing, or complex multi-step chains. Model routing exploits this distribution by sending easy queries to small, fast, cheap models and reserving expensive models for hard queries.
A well-designed router can reduce LLM costs by 40-70% while maintaining quality on the queries that matter.
Router Architecture
A model router sits between the agent and the LLM providers. It inspects each query, classifies its complexity, and forwards it to the appropriate model:
flowchart TD
START["Model Routing: Directing Agent Queries to the Opt…"] --> A
A["The Cost of Using One Model for Everyth…"]
A --> B
B["Router Architecture"]
B --> C
C["LLM-Based Classification"]
C --> D
D["Implementing Fallback Chains"]
D --> E
E["Measuring Router Performance"]
E --> F
F["Advanced: Embedding-Based Routing"]
F --> G
G["Cost Savings in Practice"]
G --> H
H["FAQ"]
H --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass
from enum import Enum
from openai import OpenAI
class Complexity(Enum):
SIMPLE = "simple"
MODERATE = "moderate"
COMPLEX = "complex"
@dataclass
class ModelTier:
model: str
base_url: str
api_key: str
max_tokens: int
cost_per_1k_tokens: float
class ModelRouter:
def __init__(self):
self.tiers = {
Complexity.SIMPLE: ModelTier(
model="llama3.1:8b",
base_url="http://localhost:11434/v1",
api_key="ollama",
max_tokens=512,
cost_per_1k_tokens=0.0, # Free, local
),
Complexity.MODERATE: ModelTier(
model="gpt-4o-mini",
base_url="https://api.openai.com/v1",
api_key="sk-...",
max_tokens=2048,
cost_per_1k_tokens=0.15,
),
Complexity.COMPLEX: ModelTier(
model="gpt-4o",
base_url="https://api.openai.com/v1",
api_key="sk-...",
max_tokens=4096,
cost_per_1k_tokens=2.50,
),
}
def route(self, messages: list) -> tuple[str, OpenAI]:
complexity = self.classify_complexity(messages)
tier = self.tiers[complexity]
client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)
return tier.model, client
def classify_complexity(self, messages: list) -> Complexity:
user_msg = messages[-1]["content"] if messages else ""
# Rule-based classification (fast, free)
return self._rule_based_classify(user_msg)
def _rule_based_classify(self, text: str) -> Complexity:
text_lower = text.lower()
word_count = len(text.split())
# Simple: short queries, greetings, yes/no questions
simple_indicators = [
word_count < 15,
text_lower.startswith(("what is", "who is", "define", "list")),
text_lower in ("hello", "hi", "thanks", "bye"),
]
# Complex: long context, multi-step reasoning, analysis
complex_indicators = [
word_count > 200,
any(kw in text_lower for kw in [
"analyze", "compare", "explain why", "step by step",
"write a", "debug", "refactor", "design",
]),
text.count("\n") > 5, # Multi-line input
]
if sum(complex_indicators) >= 2:
return Complexity.COMPLEX
if sum(simple_indicators) >= 2:
return Complexity.SIMPLE
return Complexity.MODERATE
LLM-Based Classification
Rule-based routing is fast but brittle. For higher accuracy, use a small, fast model to classify query complexity:
class LLMClassifier:
def __init__(self):
# Use a small local model for classification
self.client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
def classify(self, query: str) -> Complexity:
response = self.client.chat.completions.create(
model="gemma2:2b",
messages=[{
"role": "user",
"content": (
"Classify this query complexity as SIMPLE, MODERATE, or COMPLEX.\n"
"SIMPLE: factual lookups, definitions, yes/no, greetings\n"
"MODERATE: explanations, summaries, single-step tasks\n"
"COMPLEX: analysis, multi-step reasoning, code generation, comparisons\n"
"Respond with one word only.\n\n"
f"Query: {query}"
),
}],
temperature=0.0,
max_tokens=5,
)
label = response.choices[0].message.content.strip().upper()
return Complexity[label] if label in Complexity.__members__ else Complexity.MODERATE
The classification call adds 50-200ms of latency but only costs a fraction of a cent. The savings from routing simple queries to cheap models far outweigh this overhead.
Implementing Fallback Chains
Even with good routing, the selected model may fail — it might produce a low-quality response, hit a rate limit, or timeout. Implement automatic escalation:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import time
import logging
logger = logging.getLogger(__name__)
class RoutingAgent:
def __init__(self, router: ModelRouter):
self.router = router
self.escalation_order = [
Complexity.SIMPLE,
Complexity.MODERATE,
Complexity.COMPLEX,
]
def query(self, messages: list, system_prompt: str = "") -> str:
complexity = self.router.classify_complexity(messages)
start_idx = self.escalation_order.index(complexity)
full_messages = []
if system_prompt:
full_messages.append({"role": "system", "content": system_prompt})
full_messages.extend(messages)
# Try the classified tier, then escalate on failure
for tier_complexity in self.escalation_order[start_idx:]:
tier = self.router.tiers[tier_complexity]
client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)
try:
start = time.time()
response = client.chat.completions.create(
model=tier.model,
messages=full_messages,
max_tokens=tier.max_tokens,
temperature=0.2,
timeout=30,
)
elapsed = time.time() - start
result = response.choices[0].message.content
# Quality gate: if response is too short, escalate
if len(result.strip()) < 20 and tier_complexity != Complexity.COMPLEX:
logger.warning(
f"Short response from {tier.model}, escalating"
)
continue
logger.info(
f"Routed to {tier.model} "
f"({tier_complexity.value}) in {elapsed:.2f}s"
)
return result
except Exception as e:
logger.error(f"Failed on {tier.model}: {e}")
continue
return "I'm unable to process this request at the moment."
Measuring Router Performance
Track routing decisions to optimize over time:
from collections import defaultdict
import json
class RouterMetrics:
def __init__(self):
self.decisions = defaultdict(int)
self.escalations = 0
self.costs = defaultdict(float)
def record(self, classified: Complexity, actual: Complexity, cost: float):
self.decisions[classified.value] += 1
if actual != classified:
self.escalations += 1
self.costs[actual.value] += cost
def report(self) -> dict:
total = sum(self.decisions.values())
return {
"total_queries": total,
"distribution": {
k: f"{v/total*100:.1f}%"
for k, v in self.decisions.items()
},
"escalation_rate": f"{self.escalations/total*100:.1f}%"
if total > 0 else "0%",
"total_cost": f"${sum(self.costs.values()):.4f}",
"cost_by_tier": {
k: f"${v:.4f}" for k, v in self.costs.items()
},
}
metrics = RouterMetrics()
# After each query:
# metrics.record(classified_complexity, actual_tier_used, cost)
print(json.dumps(metrics.report(), indent=2))
Advanced: Embedding-Based Routing
For even better routing accuracy, use semantic similarity to a set of labeled example queries:
from sentence_transformers import SentenceTransformer
import numpy as np
class EmbeddingRouter:
def __init__(self):
self.embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
# Labeled example queries for each complexity tier
self.examples = {
Complexity.SIMPLE: [
"What is Python?",
"Define machine learning",
"Hello",
"What time is it?",
],
Complexity.MODERATE: [
"Explain how neural networks learn",
"Summarize the benefits of microservices",
"What are the pros and cons of NoSQL?",
],
Complexity.COMPLEX: [
"Design a distributed event-sourcing system for an e-commerce platform",
"Compare transformer and LSTM architectures for time-series forecasting",
"Debug this multi-threaded Python code that has a race condition",
],
}
# Pre-compute example embeddings
self.tier_embeddings = {}
for tier, texts in self.examples.items():
self.tier_embeddings[tier] = self.embedder.encode(
texts, normalize_embeddings=True
)
def classify(self, query: str) -> Complexity:
query_emb = self.embedder.encode([query], normalize_embeddings=True)
best_tier = Complexity.MODERATE
best_score = -1.0
for tier, embeddings in self.tier_embeddings.items():
similarities = np.dot(embeddings, query_emb.T).flatten()
max_sim = float(similarities.max())
if max_sim > best_score:
best_score = max_sim
best_tier = tier
return best_tier
Cost Savings in Practice
Consider an agent handling 100,000 queries per month with this distribution after routing:
- 60% simple (local Llama 8B): $0
- 30% moderate (GPT-4o-mini at $0.15/1K tokens): ~$45
- 10% complex (GPT-4o at $2.50/1K tokens): ~$25
Total with routing: ~$70/month Without routing (all GPT-4o): ~$250/month
That is a 72% cost reduction while maintaining full quality on the complex queries that actually need it.
FAQ
Does the routing classification itself add meaningful latency?
Rule-based routing adds less than 1ms. LLM-based classification with a local 2B model adds 50-200ms. Embedding-based routing adds 10-30ms. For most agent applications where LLM inference takes 500ms-3s, the routing overhead is negligible — and the latency savings from using a faster model for simple queries often more than compensate.
What if the router misclassifies a complex query as simple?
This is why fallback chains are essential. If the small model produces a short, low-quality, or incoherent response, the quality gate detects this and escalates to the next tier. In practice, misclassification rates below 15% have minimal impact on user experience because the escalation mechanism catches most errors.
Can I use model routing with tool-calling agents?
Yes, but route based on the tool complexity, not just the query text. Simple tool calls (single lookup, single API call) route to small models. Complex orchestration (multi-tool chains, conditional logic) routes to large models. You can inspect the agent's tool definitions to inform the routing decision.
#ModelRouting #CostOptimization #AgentArchitecture #MultiModel #LLMOrchestration #AgenticAI #LearnAI #AIEngineering
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.