---
title: "Model Routing: Directing Agent Queries to the Optimal Model Based on Complexity"
description: "Design and implement a model router that classifies query complexity and directs agent requests to the most cost-effective model. Build fallback chains, measure routing accuracy, and optimize per-query costs."
canonical: https://callsphere.ai/blog/model-routing-directing-agent-queries-optimal-model-complexity
category: "Learn Agentic AI"
tags: ["Model Routing", "Cost Optimization", "Agent Architecture", "Multi-Model", "LLM Orchestration"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:43.592Z
---

# Model Routing: Directing Agent Queries to the Optimal Model Based on Complexity

> Design and implement a model router that classifies query complexity and directs agent requests to the most cost-effective model. Build fallback chains, measure routing accuracy, and optimize per-query costs.

## The Cost of Using One Model for Everything

Most agent systems use a single model for all requests. If you choose a powerful model like GPT-4o, you get reliable results but pay premium prices for simple tasks that a smaller model could handle. If you choose a cheap model, complex queries fail. This is a false trade-off.

In practice, 60-80% of agent queries are straightforward — simple lookups, classification, template-based responses, or short factual answers. Only 20-40% require deep reasoning, long context processing, or complex multi-step chains. Model routing exploits this distribution by sending easy queries to small, fast, cheap models and reserving expensive models for hard queries.

A well-designed router can reduce LLM costs by 40-70% while maintaining quality on the queries that matter.

## Router Architecture

A model router sits between the agent and the LLM providers. It inspects each query, classifies its complexity, and forwards it to the appropriate model:

```mermaid
flowchart LR
    subgraph IN["Inputs"]
        I1["Monthly call volume"]
        I2["Average deal value"]
        I3["Current answer rate"]
        I4["Receptionist cost
per month"]
    end
    subgraph CALC["CallSphere Captures"]
        C1["Missed calls converted
at 24 by 7 coverage"]
        C2["Receptionist payroll
displaced or freed"]
    end
    subgraph OUT["Outputs"]
        O1["Recovered revenue
per month"]
        O2["Operating cost saved"]
        O3((Net ROI
monthly))
    end
    I1 --> C1
    I2 --> C1
    I3 --> C1
    I4 --> C2
    C1 --> O1 --> O3
    C2 --> O2 --> O3
    style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
    style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
    style O3 fill:#059669,stroke:#047857,color:#fff
```

```python
from dataclasses import dataclass
from enum import Enum
from openai import OpenAI

class Complexity(Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"

@dataclass
class ModelTier:
    model: str
    base_url: str
    api_key: str
    max_tokens: int
    cost_per_1k_tokens: float

class ModelRouter:
    def __init__(self):
        self.tiers = {
            Complexity.SIMPLE: ModelTier(
                model="llama3.1:8b",
                base_url="http://localhost:11434/v1",
                api_key="ollama",
                max_tokens=512,
                cost_per_1k_tokens=0.0,  # Free, local
            ),
            Complexity.MODERATE: ModelTier(
                model="gpt-4o-mini",
                base_url="https://api.openai.com/v1",
                api_key="sk-...",
                max_tokens=2048,
                cost_per_1k_tokens=0.15,
            ),
            Complexity.COMPLEX: ModelTier(
                model="gpt-4o",
                base_url="https://api.openai.com/v1",
                api_key="sk-...",
                max_tokens=4096,
                cost_per_1k_tokens=2.50,
            ),
        }

    def route(self, messages: list) -> tuple[str, OpenAI]:
        complexity = self.classify_complexity(messages)
        tier = self.tiers[complexity]
        client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)
        return tier.model, client

    def classify_complexity(self, messages: list) -> Complexity:
        user_msg = messages[-1]["content"] if messages else ""
        # Rule-based classification (fast, free)
        return self._rule_based_classify(user_msg)

    def _rule_based_classify(self, text: str) -> Complexity:
        text_lower = text.lower()
        word_count = len(text.split())

        # Simple: short queries, greetings, yes/no questions
        simple_indicators = [
            word_count  200,
            any(kw in text_lower for kw in [
                "analyze", "compare", "explain why", "step by step",
                "write a", "debug", "refactor", "design",
            ]),
            text.count("\n") > 5,  # Multi-line input
        ]

        if sum(complex_indicators) >= 2:
            return Complexity.COMPLEX
        if sum(simple_indicators) >= 2:
            return Complexity.SIMPLE
        return Complexity.MODERATE
```

## LLM-Based Classification

Rule-based routing is fast but brittle. For higher accuracy, use a small, fast model to classify query complexity:

```python
class LLMClassifier:
    def __init__(self):
        # Use a small local model for classification
        self.client = OpenAI(
            base_url="http://localhost:11434/v1",
            api_key="ollama",
        )

    def classify(self, query: str) -> Complexity:
        response = self.client.chat.completions.create(
            model="gemma2:2b",
            messages=[{
                "role": "user",
                "content": (
                    "Classify this query complexity as SIMPLE, MODERATE, or COMPLEX.\n"
                    "SIMPLE: factual lookups, definitions, yes/no, greetings\n"
                    "MODERATE: explanations, summaries, single-step tasks\n"
                    "COMPLEX: analysis, multi-step reasoning, code generation, comparisons\n"
                    "Respond with one word only.\n\n"
                    f"Query: {query}"
                ),
            }],
            temperature=0.0,
            max_tokens=5,
        )

        label = response.choices[0].message.content.strip().upper()
        return Complexity[label] if label in Complexity.__members__ else Complexity.MODERATE
```

The classification call adds 50-200ms of latency but only costs a fraction of a cent. The savings from routing simple queries to cheap models far outweigh this overhead.

## Implementing Fallback Chains

Even with good routing, the selected model may fail — it might produce a low-quality response, hit a rate limit, or timeout. Implement automatic escalation:

```python
import time
import logging

logger = logging.getLogger(__name__)

class RoutingAgent:
    def __init__(self, router: ModelRouter):
        self.router = router
        self.escalation_order = [
            Complexity.SIMPLE,
            Complexity.MODERATE,
            Complexity.COMPLEX,
        ]

    def query(self, messages: list, system_prompt: str = "") -> str:
        complexity = self.router.classify_complexity(messages)
        start_idx = self.escalation_order.index(complexity)

        full_messages = []
        if system_prompt:
            full_messages.append({"role": "system", "content": system_prompt})
        full_messages.extend(messages)

        # Try the classified tier, then escalate on failure
        for tier_complexity in self.escalation_order[start_idx:]:
            tier = self.router.tiers[tier_complexity]
            client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)

            try:
                start = time.time()
                response = client.chat.completions.create(
                    model=tier.model,
                    messages=full_messages,
                    max_tokens=tier.max_tokens,
                    temperature=0.2,
                    timeout=30,
                )
                elapsed = time.time() - start
                result = response.choices[0].message.content

                # Quality gate: if response is too short, escalate
                if len(result.strip())  dict:
        total = sum(self.decisions.values())
        return {
            "total_queries": total,
            "distribution": {
                k: f"{v/total*100:.1f}%"
                for k, v in self.decisions.items()
            },
            "escalation_rate": f"{self.escalations/total*100:.1f}%"
            if total > 0 else "0%",
            "total_cost": f"${sum(self.costs.values()):.4f}",
            "cost_by_tier": {
                k: f"${v:.4f}" for k, v in self.costs.items()
            },
        }

metrics = RouterMetrics()
# After each query:
# metrics.record(classified_complexity, actual_tier_used, cost)
print(json.dumps(metrics.report(), indent=2))
```

## Advanced: Embedding-Based Routing

For even better routing accuracy, use semantic similarity to a set of labeled example queries:

```python
from sentence_transformers import SentenceTransformer
import numpy as np

class EmbeddingRouter:
    def __init__(self):
        self.embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")

        # Labeled example queries for each complexity tier
        self.examples = {
            Complexity.SIMPLE: [
                "What is Python?",
                "Define machine learning",
                "Hello",
                "What time is it?",
            ],
            Complexity.MODERATE: [
                "Explain how neural networks learn",
                "Summarize the benefits of microservices",
                "What are the pros and cons of NoSQL?",
            ],
            Complexity.COMPLEX: [
                "Design a distributed event-sourcing system for an e-commerce platform",
                "Compare transformer and LSTM architectures for time-series forecasting",
                "Debug this multi-threaded Python code that has a race condition",
            ],
        }

        # Pre-compute example embeddings
        self.tier_embeddings = {}
        for tier, texts in self.examples.items():
            self.tier_embeddings[tier] = self.embedder.encode(
                texts, normalize_embeddings=True
            )

    def classify(self, query: str) -> Complexity:
        query_emb = self.embedder.encode([query], normalize_embeddings=True)

        best_tier = Complexity.MODERATE
        best_score = -1.0

        for tier, embeddings in self.tier_embeddings.items():
            similarities = np.dot(embeddings, query_emb.T).flatten()
            max_sim = float(similarities.max())

            if max_sim > best_score:
                best_score = max_sim
                best_tier = tier

        return best_tier
```

## Cost Savings in Practice

Consider an agent handling 100,000 queries per month with this distribution after routing:

- **60% simple** (local Llama 8B): $0
- **30% moderate** (GPT-4o-mini at $0.15/1K tokens): ~$45
- **10% complex** (GPT-4o at $2.50/1K tokens): ~$25

**Total with routing: ~$70/month**
**Without routing (all GPT-4o): ~$250/month**

That is a 72% cost reduction while maintaining full quality on the complex queries that actually need it.

## FAQ

### Does the routing classification itself add meaningful latency?

Rule-based routing adds less than 1ms. LLM-based classification with a local 2B model adds 50-200ms. Embedding-based routing adds 10-30ms. For most agent applications where LLM inference takes 500ms-3s, the routing overhead is negligible — and the latency savings from using a faster model for simple queries often more than compensate.

### What if the router misclassifies a complex query as simple?

This is why fallback chains are essential. If the small model produces a short, low-quality, or incoherent response, the quality gate detects this and escalates to the next tier. In practice, misclassification rates below 15% have minimal impact on user experience because the escalation mechanism catches most errors.

### Can I use model routing with tool-calling agents?

Yes, but route based on the tool complexity, not just the query text. Simple tool calls (single lookup, single API call) route to small models. Complex orchestration (multi-tool chains, conditional logic) routes to large models. You can inspect the agent's tool definitions to inform the routing decision.

---

#ModelRouting #CostOptimization #AgentArchitecture #MultiModel #LLMOrchestration #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/model-routing-directing-agent-queries-optimal-model-complexity
