Skip to content
Learn Agentic AI
Learn Agentic AI13 min read2 views

Model Routing: Directing Agent Queries to the Optimal Model Based on Complexity

Design and implement a model router that classifies query complexity and directs agent requests to the most cost-effective model. Build fallback chains, measure routing accuracy, and optimize per-query costs.

The Cost of Using One Model for Everything

Most agent systems use a single model for all requests. If you choose a powerful model like GPT-4o, you get reliable results but pay premium prices for simple tasks that a smaller model could handle. If you choose a cheap model, complex queries fail. This is a false trade-off.

In practice, 60-80% of agent queries are straightforward — simple lookups, classification, template-based responses, or short factual answers. Only 20-40% require deep reasoning, long context processing, or complex multi-step chains. Model routing exploits this distribution by sending easy queries to small, fast, cheap models and reserving expensive models for hard queries.

A well-designed router can reduce LLM costs by 40-70% while maintaining quality on the queries that matter.

Router Architecture

A model router sits between the agent and the LLM providers. It inspects each query, classifies its complexity, and forwards it to the appropriate model:

flowchart TD
    START["Model Routing: Directing Agent Queries to the Opt…"] --> A
    A["The Cost of Using One Model for Everyth…"]
    A --> B
    B["Router Architecture"]
    B --> C
    C["LLM-Based Classification"]
    C --> D
    D["Implementing Fallback Chains"]
    D --> E
    E["Measuring Router Performance"]
    E --> F
    F["Advanced: Embedding-Based Routing"]
    F --> G
    G["Cost Savings in Practice"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass
from enum import Enum
from openai import OpenAI

class Complexity(Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"

@dataclass
class ModelTier:
    model: str
    base_url: str
    api_key: str
    max_tokens: int
    cost_per_1k_tokens: float

class ModelRouter:
    def __init__(self):
        self.tiers = {
            Complexity.SIMPLE: ModelTier(
                model="llama3.1:8b",
                base_url="http://localhost:11434/v1",
                api_key="ollama",
                max_tokens=512,
                cost_per_1k_tokens=0.0,  # Free, local
            ),
            Complexity.MODERATE: ModelTier(
                model="gpt-4o-mini",
                base_url="https://api.openai.com/v1",
                api_key="sk-...",
                max_tokens=2048,
                cost_per_1k_tokens=0.15,
            ),
            Complexity.COMPLEX: ModelTier(
                model="gpt-4o",
                base_url="https://api.openai.com/v1",
                api_key="sk-...",
                max_tokens=4096,
                cost_per_1k_tokens=2.50,
            ),
        }

    def route(self, messages: list) -> tuple[str, OpenAI]:
        complexity = self.classify_complexity(messages)
        tier = self.tiers[complexity]
        client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)
        return tier.model, client

    def classify_complexity(self, messages: list) -> Complexity:
        user_msg = messages[-1]["content"] if messages else ""
        # Rule-based classification (fast, free)
        return self._rule_based_classify(user_msg)

    def _rule_based_classify(self, text: str) -> Complexity:
        text_lower = text.lower()
        word_count = len(text.split())

        # Simple: short queries, greetings, yes/no questions
        simple_indicators = [
            word_count < 15,
            text_lower.startswith(("what is", "who is", "define", "list")),
            text_lower in ("hello", "hi", "thanks", "bye"),
        ]

        # Complex: long context, multi-step reasoning, analysis
        complex_indicators = [
            word_count > 200,
            any(kw in text_lower for kw in [
                "analyze", "compare", "explain why", "step by step",
                "write a", "debug", "refactor", "design",
            ]),
            text.count("\n") > 5,  # Multi-line input
        ]

        if sum(complex_indicators) >= 2:
            return Complexity.COMPLEX
        if sum(simple_indicators) >= 2:
            return Complexity.SIMPLE
        return Complexity.MODERATE

LLM-Based Classification

Rule-based routing is fast but brittle. For higher accuracy, use a small, fast model to classify query complexity:

class LLMClassifier:
    def __init__(self):
        # Use a small local model for classification
        self.client = OpenAI(
            base_url="http://localhost:11434/v1",
            api_key="ollama",
        )

    def classify(self, query: str) -> Complexity:
        response = self.client.chat.completions.create(
            model="gemma2:2b",
            messages=[{
                "role": "user",
                "content": (
                    "Classify this query complexity as SIMPLE, MODERATE, or COMPLEX.\n"
                    "SIMPLE: factual lookups, definitions, yes/no, greetings\n"
                    "MODERATE: explanations, summaries, single-step tasks\n"
                    "COMPLEX: analysis, multi-step reasoning, code generation, comparisons\n"
                    "Respond with one word only.\n\n"
                    f"Query: {query}"
                ),
            }],
            temperature=0.0,
            max_tokens=5,
        )

        label = response.choices[0].message.content.strip().upper()
        return Complexity[label] if label in Complexity.__members__ else Complexity.MODERATE

The classification call adds 50-200ms of latency but only costs a fraction of a cent. The savings from routing simple queries to cheap models far outweigh this overhead.

Implementing Fallback Chains

Even with good routing, the selected model may fail — it might produce a low-quality response, hit a rate limit, or timeout. Implement automatic escalation:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import time
import logging

logger = logging.getLogger(__name__)

class RoutingAgent:
    def __init__(self, router: ModelRouter):
        self.router = router
        self.escalation_order = [
            Complexity.SIMPLE,
            Complexity.MODERATE,
            Complexity.COMPLEX,
        ]

    def query(self, messages: list, system_prompt: str = "") -> str:
        complexity = self.router.classify_complexity(messages)
        start_idx = self.escalation_order.index(complexity)

        full_messages = []
        if system_prompt:
            full_messages.append({"role": "system", "content": system_prompt})
        full_messages.extend(messages)

        # Try the classified tier, then escalate on failure
        for tier_complexity in self.escalation_order[start_idx:]:
            tier = self.router.tiers[tier_complexity]
            client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)

            try:
                start = time.time()
                response = client.chat.completions.create(
                    model=tier.model,
                    messages=full_messages,
                    max_tokens=tier.max_tokens,
                    temperature=0.2,
                    timeout=30,
                )
                elapsed = time.time() - start
                result = response.choices[0].message.content

                # Quality gate: if response is too short, escalate
                if len(result.strip()) < 20 and tier_complexity != Complexity.COMPLEX:
                    logger.warning(
                        f"Short response from {tier.model}, escalating"
                    )
                    continue

                logger.info(
                    f"Routed to {tier.model} "
                    f"({tier_complexity.value}) in {elapsed:.2f}s"
                )
                return result

            except Exception as e:
                logger.error(f"Failed on {tier.model}: {e}")
                continue

        return "I'm unable to process this request at the moment."

Measuring Router Performance

Track routing decisions to optimize over time:

from collections import defaultdict
import json

class RouterMetrics:
    def __init__(self):
        self.decisions = defaultdict(int)
        self.escalations = 0
        self.costs = defaultdict(float)

    def record(self, classified: Complexity, actual: Complexity, cost: float):
        self.decisions[classified.value] += 1
        if actual != classified:
            self.escalations += 1
        self.costs[actual.value] += cost

    def report(self) -> dict:
        total = sum(self.decisions.values())
        return {
            "total_queries": total,
            "distribution": {
                k: f"{v/total*100:.1f}%"
                for k, v in self.decisions.items()
            },
            "escalation_rate": f"{self.escalations/total*100:.1f}%"
            if total > 0 else "0%",
            "total_cost": f"${sum(self.costs.values()):.4f}",
            "cost_by_tier": {
                k: f"${v:.4f}" for k, v in self.costs.items()
            },
        }

metrics = RouterMetrics()
# After each query:
# metrics.record(classified_complexity, actual_tier_used, cost)
print(json.dumps(metrics.report(), indent=2))

Advanced: Embedding-Based Routing

For even better routing accuracy, use semantic similarity to a set of labeled example queries:

from sentence_transformers import SentenceTransformer
import numpy as np

class EmbeddingRouter:
    def __init__(self):
        self.embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")

        # Labeled example queries for each complexity tier
        self.examples = {
            Complexity.SIMPLE: [
                "What is Python?",
                "Define machine learning",
                "Hello",
                "What time is it?",
            ],
            Complexity.MODERATE: [
                "Explain how neural networks learn",
                "Summarize the benefits of microservices",
                "What are the pros and cons of NoSQL?",
            ],
            Complexity.COMPLEX: [
                "Design a distributed event-sourcing system for an e-commerce platform",
                "Compare transformer and LSTM architectures for time-series forecasting",
                "Debug this multi-threaded Python code that has a race condition",
            ],
        }

        # Pre-compute example embeddings
        self.tier_embeddings = {}
        for tier, texts in self.examples.items():
            self.tier_embeddings[tier] = self.embedder.encode(
                texts, normalize_embeddings=True
            )

    def classify(self, query: str) -> Complexity:
        query_emb = self.embedder.encode([query], normalize_embeddings=True)

        best_tier = Complexity.MODERATE
        best_score = -1.0

        for tier, embeddings in self.tier_embeddings.items():
            similarities = np.dot(embeddings, query_emb.T).flatten()
            max_sim = float(similarities.max())

            if max_sim > best_score:
                best_score = max_sim
                best_tier = tier

        return best_tier

Cost Savings in Practice

Consider an agent handling 100,000 queries per month with this distribution after routing:

  • 60% simple (local Llama 8B): $0
  • 30% moderate (GPT-4o-mini at $0.15/1K tokens): ~$45
  • 10% complex (GPT-4o at $2.50/1K tokens): ~$25

Total with routing: ~$70/month Without routing (all GPT-4o): ~$250/month

That is a 72% cost reduction while maintaining full quality on the complex queries that actually need it.

FAQ

Does the routing classification itself add meaningful latency?

Rule-based routing adds less than 1ms. LLM-based classification with a local 2B model adds 50-200ms. Embedding-based routing adds 10-30ms. For most agent applications where LLM inference takes 500ms-3s, the routing overhead is negligible — and the latency savings from using a faster model for simple queries often more than compensate.

What if the router misclassifies a complex query as simple?

This is why fallback chains are essential. If the small model produces a short, low-quality, or incoherent response, the quality gate detects this and escalates to the next tier. In practice, misclassification rates below 15% have minimal impact on user experience because the escalation mechanism catches most errors.

Can I use model routing with tool-calling agents?

Yes, but route based on the tool complexity, not just the query text. Simple tool calls (single lookup, single API call) route to small models. Complex orchestration (multi-tool chains, conditional logic) routes to large models. You can inspect the agent's tool definitions to inform the routing decision.


#ModelRouting #CostOptimization #AgentArchitecture #MultiModel #LLMOrchestration #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

AI Agent Cost Optimization: Reducing LLM API Spend by 70% with Caching and Routing

Practical cost reduction strategies for AI agents including semantic caching, intelligent model routing, prompt optimization, and batch processing to cut LLM API spend.

Learn Agentic AI

Microservices for AI Agents: Service Decomposition and Inter-Agent Communication

How to structure AI agents as microservices with proper service boundaries, gRPC communication, circuit breakers, health checks, and service mesh integration.

Learn Agentic AI

Event-Driven Agent Architectures: Using NATS, Kafka, and Redis Streams for Agent Communication

Deep dive into event-driven patterns for AI agent coordination: pub/sub messaging, dead letter queues, exactly-once processing with NATS, Kafka, and Redis Streams.

Learn Agentic AI

Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps

Learn how to build agent systems that route different reasoning tasks to different language models — using fast, cheap models for classification and routing, and powerful models for generation and complex reasoning.

Learn Agentic AI

Streaming Agent Architectures: Real-Time Token-by-Token Output with Tool Call Interleaving

Master the architecture of streaming AI agents that deliver token-by-token output while interleaving tool calls, using Server-Sent Events and progressive rendering to create responsive user experiences.

Learn Agentic AI

Plan-and-Execute Agents: Separating Planning from Execution for Complex Tasks

Discover how plan-and-execute agent architectures split high-level reasoning from step-by-step execution, enabling robust replanning on failure and efficient handling of complex multi-step tasks.