Skip to content
Learn Agentic AI
Learn Agentic AI11 min read1 views

Smart Model Routing: Using Cheap Models First, Expensive Models When Needed

Learn how to design a model routing system that sends simple queries to cheap models and escalates complex ones to powerful models. Reduce AI agent costs by 40-60% while maintaining quality with intelligent routing.

The Model Routing Problem

Most teams default to using their best (and most expensive) model for every request. A customer asking "What are your business hours?" gets the same GPT-4o treatment as someone asking for a complex multi-step analysis. This is like sending every package via overnight express shipping — it works, but it destroys your margins.

Smart model routing classifies requests by complexity and routes them to the cheapest model that can handle them well. In practice, 60–80% of agent queries are simple enough for a small, fast model, meaning you only need the expensive model for the remaining 20–40%.

Designing a Two-Tier Router

The simplest effective pattern uses two tiers: a fast/cheap model for straightforward requests and a powerful/expensive model for complex ones. A lightweight classifier decides which tier handles each request.

flowchart TD
    START["Smart Model Routing: Using Cheap Models First, Ex…"] --> A
    A["The Model Routing Problem"]
    A --> B
    B["Designing a Two-Tier Router"]
    B --> C
    C["Adding Quality Gates"]
    C --> D
    D["Cost Tracking Across Routes"]
    D --> E
    E["When Not to Route"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import openai

class Complexity(Enum):
    SIMPLE = "simple"
    COMPLEX = "complex"

@dataclass
class RoutingDecision:
    complexity: Complexity
    model: str
    reason: str
    estimated_cost_ratio: float  # relative to always using the expensive model

TIER_CONFIG = {
    Complexity.SIMPLE: {
        "model": "gpt-4o-mini",
        "max_tokens": 1024,
        "cost_ratio": 0.06,  # ~6% the cost of gpt-4o
    },
    Complexity.COMPLEX: {
        "model": "gpt-4o",
        "max_tokens": 4096,
        "cost_ratio": 1.0,
    },
}

class ModelRouter:
    def __init__(self, client: openai.OpenAI):
        self.client = client

    def classify_complexity(self, user_message: str) -> RoutingDecision:
        classification_prompt = (
            "Classify this user message as SIMPLE or COMPLEX.\n"
            "SIMPLE: factual lookups, greetings, yes/no questions, "
            "status checks, single-step tasks.\n"
            "COMPLEX: multi-step reasoning, analysis, code generation, "
            "creative writing, comparisons, ambiguous queries.\n"
            f"Message: {user_message}\n"
            "Respond with only SIMPLE or COMPLEX."
        )
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": classification_prompt}],
            max_tokens=10,
            temperature=0,
        )
        label = response.choices[0].message.content.strip().upper()
        complexity = Complexity.COMPLEX if "COMPLEX" in label else Complexity.SIMPLE
        config = TIER_CONFIG[complexity]
        return RoutingDecision(
            complexity=complexity,
            model=config["model"],
            reason=label,
            estimated_cost_ratio=config["cost_ratio"],
        )

    def route_and_respond(self, user_message: str, system_prompt: str) -> dict:
        decision = self.classify_complexity(user_message)
        config = TIER_CONFIG[decision.complexity]
        response = self.client.chat.completions.create(
            model=decision.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message},
            ],
            max_tokens=config["max_tokens"],
        )
        return {
            "response": response.choices[0].message.content,
            "model_used": decision.model,
            "complexity": decision.complexity.value,
            "cost_ratio": decision.estimated_cost_ratio,
        }

Adding Quality Gates

Routing is only valuable if quality stays high. Add a quality gate that catches cases where the cheap model underperforms and automatically retries with the expensive model.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class QualityGatedRouter(ModelRouter):
    def __init__(self, client: openai.OpenAI, quality_threshold: float = 0.7):
        super().__init__(client)
        self.quality_threshold = quality_threshold

    def check_response_quality(self, question: str, answer: str) -> float:
        check_prompt = (
            "Rate this answer's quality from 0.0 to 1.0.\n"
            f"Question: {question}\n"
            f"Answer: {answer}\n"
            "Respond with only a number."
        )
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": check_prompt}],
            max_tokens=5,
            temperature=0,
        )
        try:
            return float(response.choices[0].message.content.strip())
        except ValueError:
            return 0.5

    def route_with_fallback(self, user_message: str, system_prompt: str) -> dict:
        result = self.route_and_respond(user_message, system_prompt)
        if result["complexity"] == "simple":
            score = self.check_response_quality(user_message, result["response"])
            if score < self.quality_threshold:
                result = self.route_and_respond(user_message, system_prompt)
                result["model_used"] = TIER_CONFIG[Complexity.COMPLEX]["model"]
                result["escalated"] = True
                result["original_quality_score"] = score
        return result

Cost Tracking Across Routes

class RoutingCostTracker:
    def __init__(self):
        self.requests = []

    def record(self, complexity: str, model: str, tokens_used: int, cost: float):
        self.requests.append({
            "complexity": complexity,
            "model": model,
            "tokens": tokens_used,
            "cost": cost,
        })

    def savings_report(self) -> dict:
        total_actual = sum(r["cost"] for r in self.requests)
        total_if_always_expensive = sum(
            r["tokens"] / 1_000_000 * 12.50 for r in self.requests
        )
        savings = total_if_always_expensive - total_actual
        return {
            "actual_cost": round(total_actual, 4),
            "cost_without_routing": round(total_if_always_expensive, 4),
            "savings": round(savings, 4),
            "savings_pct": round((savings / total_if_always_expensive) * 100, 1),
            "simple_pct": round(
                len([r for r in self.requests if r["complexity"] == "simple"])
                / len(self.requests) * 100, 1
            ),
        }

When Not to Route

Avoid model routing for safety-critical applications (medical, legal, financial advice), tasks requiring consistent voice or style across responses, and scenarios where the classification cost exceeds the routing savings — which happens with very short queries where the classifier itself costs more than the difference between models.

FAQ

Does the classifier itself add significant cost?

The classifier call uses a cheap model with very few output tokens (just "SIMPLE" or "COMPLEX"), so it costs roughly $0.00001–$0.00005 per classification. At typical volumes, the classifier cost is 0.1–0.5% of total LLM spend. The savings from routing far outweigh this overhead.

What if the classifier misroutes a complex query to the cheap model?

This is where quality gates matter. The fallback pattern detects low-quality responses and automatically escalates to the expensive model. Track your escalation rate — if it exceeds 15–20%, retune your classifier prompt or switch to a rule-based pre-filter for known complex patterns.

Can I use more than two tiers?

Absolutely. Three-tier systems (small/medium/large) work well at scale. The key is keeping the classifier logic simple enough that it does not become a cost center itself. Start with two tiers and add a middle tier only when you have enough traffic data to justify the complexity.


#ModelRouting #CostOptimization #LLMSelection #AIArchitecture #SmartRouting #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

AI Agent Cost Optimization: Reducing LLM API Spend by 70% with Caching and Routing

Practical cost reduction strategies for AI agents including semantic caching, intelligent model routing, prompt optimization, and batch processing to cut LLM API spend.

Learn Agentic AI

The Future of Agentic AI: AGI Stepping Stones, Agent-Native Applications, and the Path Forward

Explore where agentic AI is headed — from current capabilities and near-term trajectory to agent-native application design, autonomous skill acquisition, and the architectural patterns that will define the next generation of AI systems.

Learn Agentic AI

Cost Optimization for Vision-Based Browser Agents: Image Compression and Caching

Reduce GPT Vision API costs by 60-80% through image resizing, compression, region cropping, intelligent caching, and token-aware strategies. Essential techniques for production vision-based browser automation.

Learn Agentic AI

Reflection Agents: Building AI Systems That Critique and Improve Their Own Output

Learn how to build reflection agents that evaluate their own outputs, assign quality scores, and iteratively refine results through multi-round self-improvement loops using the Reflexion pattern.

Learn Agentic AI

Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps

Learn how to build agent systems that route different reasoning tasks to different language models — using fast, cheap models for classification and routing, and powerful models for generation and complex reasoning.

Learn Agentic AI

What Is RAG: Retrieval-Augmented Generation Explained from Scratch

Understand what Retrieval-Augmented Generation is, why it exists, how the core architecture works, and when to choose RAG over fine-tuning for grounding LLM responses in your own data.