Model Routing: Directing Agent Queries to the Optimal Model Based on Complexity

The Cost of Using One Model for Everything

Most agent systems use a single model for all requests. If you choose a powerful model like GPT-4o, you get reliable results but pay premium prices for simple tasks that a smaller model could handle. If you choose a cheap model, complex queries fail. This is a false trade-off.

In practice, 60-80% of agent queries are straightforward — simple lookups, classification, template-based responses, or short factual answers. Only 20-40% require deep reasoning, long context processing, or complex multi-step chains. Model routing exploits this distribution by sending easy queries to small, fast, cheap models and reserving expensive models for hard queries.

A well-designed router can reduce LLM costs by 40-70% while maintaining quality on the queries that matter.

Router Architecture

A model router sits between the agent and the LLM providers. It inspects each query, classifies its complexity, and forwards it to the appropriate model:

flowchart LR
    subgraph IN["Inputs"]
        I1["Monthly call volume"]
        I2["Average deal value"]
        I3["Current answer rate"]
        I4["Receptionist cost<br/>per month"]
    end
    subgraph CALC["CallSphere Captures"]
        C1["Missed calls converted<br/>at 24 by 7 coverage"]
        C2["Receptionist payroll<br/>displaced or freed"]
    end
    subgraph OUT["Outputs"]
        O1["Recovered revenue<br/>per month"]
        O2["Operating cost saved"]
        O3((Net ROI<br/>monthly))
    end
    I1 --> C1
    I2 --> C1
    I3 --> C1
    I4 --> C2
    C1 --> O1 --> O3
    C2 --> O2 --> O3
    style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
    style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
    style O3 fill:#059669,stroke:#047857,color:#fff

from dataclasses import dataclass
from enum import Enum
from openai import OpenAI

class Complexity(Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"

@dataclass
class ModelTier:
    model: str
    base_url: str
    api_key: str
    max_tokens: int
    cost_per_1k_tokens: float

class ModelRouter:
    def __init__(self):
        self.tiers = {
            Complexity.SIMPLE: ModelTier(
                model="llama3.1:8b",
                base_url="http://localhost:11434/v1",
                api_key="ollama",
                max_tokens=512,
                cost_per_1k_tokens=0.0,  # Free, local
            ),
            Complexity.MODERATE: ModelTier(
                model="gpt-4o-mini",
                base_url="https://api.openai.com/v1",
                api_key="sk-...",
                max_tokens=2048,
                cost_per_1k_tokens=0.15,
            ),
            Complexity.COMPLEX: ModelTier(
                model="gpt-4o",
                base_url="https://api.openai.com/v1",
                api_key="sk-...",
                max_tokens=4096,
                cost_per_1k_tokens=2.50,
            ),
        }

    def route(self, messages: list) -> tuple[str, OpenAI]:
        complexity = self.classify_complexity(messages)
        tier = self.tiers[complexity]
        client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)
        return tier.model, client

    def classify_complexity(self, messages: list) -> Complexity:
        user_msg = messages[-1]["content"] if messages else ""
        # Rule-based classification (fast, free)
        return self._rule_based_classify(user_msg)

    def _rule_based_classify(self, text: str) -> Complexity:
        text_lower = text.lower()
        word_count = len(text.split())

        # Simple: short queries, greetings, yes/no questions
        simple_indicators = [
            word_count < 15,
            text_lower.startswith(("what is", "who is", "define", "list")),
            text_lower in ("hello", "hi", "thanks", "bye"),
        ]

        # Complex: long context, multi-step reasoning, analysis
        complex_indicators = [
            word_count > 200,
            any(kw in text_lower for kw in [
                "analyze", "compare", "explain why", "step by step",
                "write a", "debug", "refactor", "design",
            ]),
            text.count("\n") > 5,  # Multi-line input
        ]

        if sum(complex_indicators) >= 2:
            return Complexity.COMPLEX
        if sum(simple_indicators) >= 2:
            return Complexity.SIMPLE
        return Complexity.MODERATE

LLM-Based Classification

Rule-based routing is fast but brittle. For higher accuracy, use a small, fast model to classify query complexity:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

class LLMClassifier:
    def __init__(self):
        # Use a small local model for classification
        self.client = OpenAI(
            base_url="http://localhost:11434/v1",
            api_key="ollama",
        )

    def classify(self, query: str) -> Complexity:
        response = self.client.chat.completions.create(
            model="gemma2:2b",
            messages=[{
                "role": "user",
                "content": (
                    "Classify this query complexity as SIMPLE, MODERATE, or COMPLEX.\n"
                    "SIMPLE: factual lookups, definitions, yes/no, greetings\n"
                    "MODERATE: explanations, summaries, single-step tasks\n"
                    "COMPLEX: analysis, multi-step reasoning, code generation, comparisons\n"
                    "Respond with one word only.\n\n"
                    f"Query: {query}"
                ),
            }],
            temperature=0.0,
            max_tokens=5,
        )

        label = response.choices[0].message.content.strip().upper()
        return Complexity[label] if label in Complexity.__members__ else Complexity.MODERATE

The classification call adds 50-200ms of latency but only costs a fraction of a cent. The savings from routing simple queries to cheap models far outweigh this overhead.

Implementing Fallback Chains

Even with good routing, the selected model may fail — it might produce a low-quality response, hit a rate limit, or timeout. Implement automatic escalation:

import time
import logging

logger = logging.getLogger(__name__)

class RoutingAgent:
    def __init__(self, router: ModelRouter):
        self.router = router
        self.escalation_order = [
            Complexity.SIMPLE,
            Complexity.MODERATE,
            Complexity.COMPLEX,
        ]

    def query(self, messages: list, system_prompt: str = "") -> str:
        complexity = self.router.classify_complexity(messages)
        start_idx = self.escalation_order.index(complexity)

        full_messages = []
        if system_prompt:
            full_messages.append({"role": "system", "content": system_prompt})
        full_messages.extend(messages)

        # Try the classified tier, then escalate on failure
        for tier_complexity in self.escalation_order[start_idx:]:
            tier = self.router.tiers[tier_complexity]
            client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)

            try:
                start = time.time()
                response = client.chat.completions.create(
                    model=tier.model,
                    messages=full_messages,
                    max_tokens=tier.max_tokens,
                    temperature=0.2,
                    timeout=30,
                )
                elapsed = time.time() - start
                result = response.choices[0].message.content

                # Quality gate: if response is too short, escalate
                if len(result.strip()) < 20 and tier_complexity != Complexity.COMPLEX:
                    logger.warning(
                        f"Short response from {tier.model}, escalating"
                    )
                    continue

                logger.info(
                    f"Routed to {tier.model} "
                    f"({tier_complexity.value}) in {elapsed:.2f}s"
                )
                return result

            except Exception as e:
                logger.error(f"Failed on {tier.model}: {e}")
                continue

        return "I'm unable to process this request at the moment."

Measuring Router Performance

Track routing decisions to optimize over time:

from collections import defaultdict
import json

class RouterMetrics:
    def __init__(self):
        self.decisions = defaultdict(int)
        self.escalations = 0
        self.costs = defaultdict(float)

    def record(self, classified: Complexity, actual: Complexity, cost: float):
        self.decisions[classified.value] += 1
        if actual != classified:
            self.escalations += 1
        self.costs[actual.value] += cost

    def report(self) -> dict:
        total = sum(self.decisions.values())
        return {
            "total_queries": total,
            "distribution": {
                k: f"{v/total*100:.1f}%"
                for k, v in self.decisions.items()
            },
            "escalation_rate": f"{self.escalations/total*100:.1f}%"
            if total > 0 else "0%",
            "total_cost": f"${sum(self.costs.values()):.4f}",
            "cost_by_tier": {
                k: f"${v:.4f}" for k, v in self.costs.items()
            },
        }

metrics = RouterMetrics()
# After each query:
# metrics.record(classified_complexity, actual_tier_used, cost)
print(json.dumps(metrics.report(), indent=2))

Advanced: Embedding-Based Routing

For even better routing accuracy, use semantic similarity to a set of labeled example queries:

from sentence_transformers import SentenceTransformer
import numpy as np

class EmbeddingRouter:
    def __init__(self):
        self.embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")

        # Labeled example queries for each complexity tier
        self.examples = {
            Complexity.SIMPLE: [
                "What is Python?",
                "Define machine learning",
                "Hello",
                "What time is it?",
            ],
            Complexity.MODERATE: [
                "Explain how neural networks learn",
                "Summarize the benefits of microservices",
                "What are the pros and cons of NoSQL?",
            ],
            Complexity.COMPLEX: [
                "Design a distributed event-sourcing system for an e-commerce platform",
                "Compare transformer and LSTM architectures for time-series forecasting",
                "Debug this multi-threaded Python code that has a race condition",
            ],
        }

        # Pre-compute example embeddings
        self.tier_embeddings = {}
        for tier, texts in self.examples.items():
            self.tier_embeddings[tier] = self.embedder.encode(
                texts, normalize_embeddings=True
            )

    def classify(self, query: str) -> Complexity:
        query_emb = self.embedder.encode([query], normalize_embeddings=True)

        best_tier = Complexity.MODERATE
        best_score = -1.0

        for tier, embeddings in self.tier_embeddings.items():
            similarities = np.dot(embeddings, query_emb.T).flatten()
            max_sim = float(similarities.max())

            if max_sim > best_score:
                best_score = max_sim
                best_tier = tier

        return best_tier

Cost Savings in Practice

Consider an agent handling 100,000 queries per month with this distribution after routing:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

60% simple (local Llama 8B): $0
30% moderate (GPT-4o-mini at $0.15/1K tokens): ~$45
10% complex (GPT-4o at $2.50/1K tokens): ~$25

Total with routing: ~$70/month Without routing (all GPT-4o): ~$250/month

That is a 72% cost reduction while maintaining full quality on the complex queries that actually need it.

FAQ

Does the routing classification itself add meaningful latency?

Rule-based routing adds less than 1ms. LLM-based classification with a local 2B model adds 50-200ms. Embedding-based routing adds 10-30ms. For most agent applications where LLM inference takes 500ms-3s, the routing overhead is negligible — and the latency savings from using a faster model for simple queries often more than compensate.

What if the router misclassifies a complex query as simple?

This is why fallback chains are essential. If the small model produces a short, low-quality, or incoherent response, the quality gate detects this and escalates to the next tier. In practice, misclassification rates below 15% have minimal impact on user experience because the escalation mechanism catches most errors.

Can I use model routing with tool-calling agents?

Yes, but route based on the tool complexity, not just the query text. Simple tool calls (single lookup, single API call) route to small models. Complex orchestration (multi-tool chains, conditional logic) routes to large models. You can inspect the agent's tool definitions to inform the routing decision.

#ModelRouting #CostOptimization #AgentArchitecture #MultiModel #LLMOrchestration #AgenticAI #LearnAI #AIEngineering

Model Routing: Directing Agent Queries to the Optimal Model Based on Complexity

The Cost of Using One Model for Everything

Router Architecture

LLM-Based Classification

Implementing Fallback Chains

Measuring Router Performance

Advanced: Embedding-Based Routing

Cost Savings in Practice

FAQ

Does the routing classification itself add meaningful latency?

What if the router misclassifies a complex query as simple?

Can I use model routing with tool-calling agents?

Try CallSphere AI Voice Agents

Related Articles You May Like

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Self-hosted on-prem stack for Edge / on-device LLM inference: A May 2026 Comparison

Edge / on-device LLM inference in 2026: Open-source frontier matchup (DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3)