Smart Model Routing: Using Cheap Models First, Expensive Models When Needed

The Model Routing Problem

Most teams default to using their best (and most expensive) model for every request. A customer asking "What are your business hours?" gets the same GPT-4o treatment as someone asking for a complex multi-step analysis. This is like sending every package via overnight express shipping — it works, but it destroys your margins.

Smart model routing classifies requests by complexity and routes them to the cheapest model that can handle them well. In practice, 60–80% of agent queries are simple enough for a small, fast model, meaning you only need the expensive model for the remaining 20–40%.

Designing a Two-Tier Router

The simplest effective pattern uses two tiers: a fast/cheap model for straightforward requests and a powerful/expensive model for complex ones. A lightweight classifier decides which tier handles each request.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    subgraph IN["Inputs"]
        I1["Monthly call volume"]
        I2["Average deal value"]
        I3["Current answer rate"]
        I4["Receptionist cost<br/>per month"]
    end
    subgraph CALC["CallSphere Captures"]
        C1["Missed calls converted<br/>at 24 by 7 coverage"]
        C2["Receptionist payroll<br/>displaced or freed"]
    end
    subgraph OUT["Outputs"]
        O1["Recovered revenue<br/>per month"]
        O2["Operating cost saved"]
        O3((Net ROI<br/>monthly))
    end
    I1 --> C1
    I2 --> C1
    I3 --> C1
    I4 --> C2
    C1 --> O1 --> O3
    C2 --> O2 --> O3
    style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
    style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
    style O3 fill:#059669,stroke:#047857,color:#fff

from dataclasses import dataclass
from enum import Enum
from typing import Optional
import openai

class Complexity(Enum):
    SIMPLE = "simple"
    COMPLEX = "complex"

@dataclass
class RoutingDecision:
    complexity: Complexity
    model: str
    reason: str
    estimated_cost_ratio: float  # relative to always using the expensive model

TIER_CONFIG = {
    Complexity.SIMPLE: {
        "model": "gpt-4o-mini",
        "max_tokens": 1024,
        "cost_ratio": 0.06,  # ~6% the cost of gpt-4o
    },
    Complexity.COMPLEX: {
        "model": "gpt-4o",
        "max_tokens": 4096,
        "cost_ratio": 1.0,
    },
}

class ModelRouter:
    def __init__(self, client: openai.OpenAI):
        self.client = client

    def classify_complexity(self, user_message: str) -> RoutingDecision:
        classification_prompt = (
            "Classify this user message as SIMPLE or COMPLEX.\n"
            "SIMPLE: factual lookups, greetings, yes/no questions, "
            "status checks, single-step tasks.\n"
            "COMPLEX: multi-step reasoning, analysis, code generation, "
            "creative writing, comparisons, ambiguous queries.\n"
            f"Message: {user_message}\n"
            "Respond with only SIMPLE or COMPLEX."
        )
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": classification_prompt}],
            max_tokens=10,
            temperature=0,
        )
        label = response.choices[0].message.content.strip().upper()
        complexity = Complexity.COMPLEX if "COMPLEX" in label else Complexity.SIMPLE
        config = TIER_CONFIG[complexity]
        return RoutingDecision(
            complexity=complexity,
            model=config["model"],
            reason=label,
            estimated_cost_ratio=config["cost_ratio"],
        )

    def route_and_respond(self, user_message: str, system_prompt: str) -> dict:
        decision = self.classify_complexity(user_message)
        config = TIER_CONFIG[decision.complexity]
        response = self.client.chat.completions.create(
            model=decision.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message},
            ],
            max_tokens=config["max_tokens"],
        )
        return {
            "response": response.choices[0].message.content,
            "model_used": decision.model,
            "complexity": decision.complexity.value,
            "cost_ratio": decision.estimated_cost_ratio,
        }

Adding Quality Gates

Routing is only valuable if quality stays high. Add a quality gate that catches cases where the cheap model underperforms and automatically retries with the expensive model.

class QualityGatedRouter(ModelRouter):
    def __init__(self, client: openai.OpenAI, quality_threshold: float = 0.7):
        super().__init__(client)
        self.quality_threshold = quality_threshold

    def check_response_quality(self, question: str, answer: str) -> float:
        check_prompt = (
            "Rate this answer's quality from 0.0 to 1.0.\n"
            f"Question: {question}\n"
            f"Answer: {answer}\n"
            "Respond with only a number."
        )
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": check_prompt}],
            max_tokens=5,
            temperature=0,
        )
        try:
            return float(response.choices[0].message.content.strip())
        except ValueError:
            return 0.5

    def route_with_fallback(self, user_message: str, system_prompt: str) -> dict:
        result = self.route_and_respond(user_message, system_prompt)
        if result["complexity"] == "simple":
            score = self.check_response_quality(user_message, result["response"])
            if score < self.quality_threshold:
                result = self.route_and_respond(user_message, system_prompt)
                result["model_used"] = TIER_CONFIG[Complexity.COMPLEX]["model"]
                result["escalated"] = True
                result["original_quality_score"] = score
        return result

Cost Tracking Across Routes

class RoutingCostTracker:
    def __init__(self):
        self.requests = []

    def record(self, complexity: str, model: str, tokens_used: int, cost: float):
        self.requests.append({
            "complexity": complexity,
            "model": model,
            "tokens": tokens_used,
            "cost": cost,
        })

    def savings_report(self) -> dict:
        total_actual = sum(r["cost"] for r in self.requests)
        total_if_always_expensive = sum(
            r["tokens"] / 1_000_000 * 12.50 for r in self.requests
        )
        savings = total_if_always_expensive - total_actual
        return {
            "actual_cost": round(total_actual, 4),
            "cost_without_routing": round(total_if_always_expensive, 4),
            "savings": round(savings, 4),
            "savings_pct": round((savings / total_if_always_expensive) * 100, 1),
            "simple_pct": round(
                len([r for r in self.requests if r["complexity"] == "simple"])
                / len(self.requests) * 100, 1
            ),
        }

When Not to Route

Avoid model routing for safety-critical applications (medical, legal, financial advice), tasks requiring consistent voice or style across responses, and scenarios where the classification cost exceeds the routing savings — which happens with very short queries where the classifier itself costs more than the difference between models.

FAQ

Does the classifier itself add significant cost?

The classifier call uses a cheap model with very few output tokens (just "SIMPLE" or "COMPLEX"), so it costs roughly $0.00001–$0.00005 per classification. At typical volumes, the classifier cost is 0.1–0.5% of total LLM spend. The savings from routing far outweigh this overhead.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

What if the classifier misroutes a complex query to the cheap model?

This is where quality gates matter. The fallback pattern detects low-quality responses and automatically escalates to the expensive model. Track your escalation rate — if it exceeds 15–20%, retune your classifier prompt or switch to a rule-based pre-filter for known complex patterns.

Can I use more than two tiers?

Absolutely. Three-tier systems (small/medium/large) work well at scale. The key is keeping the classifier logic simple enough that it does not become a cost center itself. Start with two tiers and add a middle tier only when you have enough traffic data to justify the complexity.

#ModelRouting #CostOptimization #LLMSelection #AIArchitecture #SmartRouting #AgenticAI #LearnAI #AIEngineering

Smart Model Routing: Using Cheap Models First, Expensive Models When Needed

The Model Routing Problem

Designing a Two-Tier Router

Adding Quality Gates

Cost Tracking Across Routes

When Not to Route

FAQ

Does the classifier itself add significant cost?

What if the classifier misroutes a complex query to the cheap model?

Can I use more than two tiers?

Try CallSphere AI Voice Agents

Related Articles You May Like

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Self-hosted on-prem stack for Edge / on-device LLM inference: A May 2026 Comparison

Edge / on-device LLM inference in 2026: Open-source frontier matchup (DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3)

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Multilingual customer support in 2026?