---
title: "Smart Model Routing: Using Cheap Models First, Expensive Models When Needed"
description: "Learn how to design a model routing system that sends simple queries to cheap models and escalates complex ones to powerful models. Reduce AI agent costs by 40-60% while maintaining quality with intelligent routing."
canonical: https://callsphere.ai/blog/smart-model-routing-cheap-models-first-expensive-when-needed
category: "Learn Agentic AI"
tags: ["Model Routing", "Cost Optimization", "LLM Selection", "AI Architecture", "Smart Routing"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-08T18:48:04.554Z
---

# Smart Model Routing: Using Cheap Models First, Expensive Models When Needed

> Learn how to design a model routing system that sends simple queries to cheap models and escalates complex ones to powerful models. Reduce AI agent costs by 40-60% while maintaining quality with intelligent routing.

## The Model Routing Problem

Most teams default to using their best (and most expensive) model for every request. A customer asking "What are your business hours?" gets the same GPT-4o treatment as someone asking for a complex multi-step analysis. This is like sending every package via overnight express shipping — it works, but it destroys your margins.

Smart model routing classifies requests by complexity and routes them to the cheapest model that can handle them well. In practice, 60–80% of agent queries are simple enough for a small, fast model, meaning you only need the expensive model for the remaining 20–40%.

## Designing a Two-Tier Router

The simplest effective pattern uses two tiers: a fast/cheap model for straightforward requests and a powerful/expensive model for complex ones. A lightweight classifier decides which tier handles each request.

```mermaid
flowchart LR
    subgraph IN["Inputs"]
        I1["Monthly call volume"]
        I2["Average deal value"]
        I3["Current answer rate"]
        I4["Receptionist cost
per month"]
    end
    subgraph CALC["CallSphere Captures"]
        C1["Missed calls converted
at 24 by 7 coverage"]
        C2["Receptionist payroll
displaced or freed"]
    end
    subgraph OUT["Outputs"]
        O1["Recovered revenue
per month"]
        O2["Operating cost saved"]
        O3((Net ROI
monthly))
    end
    I1 --> C1
    I2 --> C1
    I3 --> C1
    I4 --> C2
    C1 --> O1 --> O3
    C2 --> O2 --> O3
    style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
    style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
    style O3 fill:#059669,stroke:#047857,color:#fff
```

```python
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import openai

class Complexity(Enum):
    SIMPLE = "simple"
    COMPLEX = "complex"

@dataclass
class RoutingDecision:
    complexity: Complexity
    model: str
    reason: str
    estimated_cost_ratio: float  # relative to always using the expensive model

TIER_CONFIG = {
    Complexity.SIMPLE: {
        "model": "gpt-4o-mini",
        "max_tokens": 1024,
        "cost_ratio": 0.06,  # ~6% the cost of gpt-4o
    },
    Complexity.COMPLEX: {
        "model": "gpt-4o",
        "max_tokens": 4096,
        "cost_ratio": 1.0,
    },
}

class ModelRouter:
    def __init__(self, client: openai.OpenAI):
        self.client = client

    def classify_complexity(self, user_message: str) -> RoutingDecision:
        classification_prompt = (
            "Classify this user message as SIMPLE or COMPLEX.\n"
            "SIMPLE: factual lookups, greetings, yes/no questions, "
            "status checks, single-step tasks.\n"
            "COMPLEX: multi-step reasoning, analysis, code generation, "
            "creative writing, comparisons, ambiguous queries.\n"
            f"Message: {user_message}\n"
            "Respond with only SIMPLE or COMPLEX."
        )
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": classification_prompt}],
            max_tokens=10,
            temperature=0,
        )
        label = response.choices[0].message.content.strip().upper()
        complexity = Complexity.COMPLEX if "COMPLEX" in label else Complexity.SIMPLE
        config = TIER_CONFIG[complexity]
        return RoutingDecision(
            complexity=complexity,
            model=config["model"],
            reason=label,
            estimated_cost_ratio=config["cost_ratio"],
        )

    def route_and_respond(self, user_message: str, system_prompt: str) -> dict:
        decision = self.classify_complexity(user_message)
        config = TIER_CONFIG[decision.complexity]
        response = self.client.chat.completions.create(
            model=decision.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message},
            ],
            max_tokens=config["max_tokens"],
        )
        return {
            "response": response.choices[0].message.content,
            "model_used": decision.model,
            "complexity": decision.complexity.value,
            "cost_ratio": decision.estimated_cost_ratio,
        }
```

## Adding Quality Gates

Routing is only valuable if quality stays high. Add a quality gate that catches cases where the cheap model underperforms and automatically retries with the expensive model.

```python
class QualityGatedRouter(ModelRouter):
    def __init__(self, client: openai.OpenAI, quality_threshold: float = 0.7):
        super().__init__(client)
        self.quality_threshold = quality_threshold

    def check_response_quality(self, question: str, answer: str) -> float:
        check_prompt = (
            "Rate this answer's quality from 0.0 to 1.0.\n"
            f"Question: {question}\n"
            f"Answer: {answer}\n"
            "Respond with only a number."
        )
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": check_prompt}],
            max_tokens=5,
            temperature=0,
        )
        try:
            return float(response.choices[0].message.content.strip())
        except ValueError:
            return 0.5

    def route_with_fallback(self, user_message: str, system_prompt: str) -> dict:
        result = self.route_and_respond(user_message, system_prompt)
        if result["complexity"] == "simple":
            score = self.check_response_quality(user_message, result["response"])
            if score  dict:
        total_actual = sum(r["cost"] for r in self.requests)
        total_if_always_expensive = sum(
            r["tokens"] / 1_000_000 * 12.50 for r in self.requests
        )
        savings = total_if_always_expensive - total_actual
        return {
            "actual_cost": round(total_actual, 4),
            "cost_without_routing": round(total_if_always_expensive, 4),
            "savings": round(savings, 4),
            "savings_pct": round((savings / total_if_always_expensive) * 100, 1),
            "simple_pct": round(
                len([r for r in self.requests if r["complexity"] == "simple"])
                / len(self.requests) * 100, 1
            ),
        }
```

## When Not to Route

Avoid model routing for safety-critical applications (medical, legal, financial advice), tasks requiring consistent voice or style across responses, and scenarios where the classification cost exceeds the routing savings — which happens with very short queries where the classifier itself costs more than the difference between models.

## FAQ

### Does the classifier itself add significant cost?

The classifier call uses a cheap model with very few output tokens (just "SIMPLE" or "COMPLEX"), so it costs roughly $0.00001–$0.00005 per classification. At typical volumes, the classifier cost is 0.1–0.5% of total LLM spend. The savings from routing far outweigh this overhead.

### What if the classifier misroutes a complex query to the cheap model?

This is where quality gates matter. The fallback pattern detects low-quality responses and automatically escalates to the expensive model. Track your escalation rate — if it exceeds 15–20%, retune your classifier prompt or switch to a rule-based pre-filter for known complex patterns.

### Can I use more than two tiers?

Absolutely. Three-tier systems (small/medium/large) work well at scale. The key is keeping the classifier logic simple enough that it does not become a cost center itself. Start with two tiers and add a middle tier only when you have enough traffic data to justify the complexity.

---

#ModelRouting #CostOptimization #LLMSelection #AIArchitecture #SmartRouting #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/smart-model-routing-cheap-models-first-expensive-when-needed
