---
title: "Agent Cost Optimization: Tokens, Caching, and Smart Routing"
description: "Reduce AI agent costs by 60-80% using token tracking, prompt caching with prompt_cache_retention, model routing, context truncation, and real-time cost dashboards with the OpenAI Agents SDK."
canonical: https://callsphere.ai/blog/agent-cost-optimization-tokens-caching-smart-routing
category: "Learn Agentic AI"
tags: ["OpenAI", "Cost Optimization", "Tokens", "Caching"]
author: "CallSphere Team"
published: 2026-03-14T00:00:00.000Z
updated: 2026-05-06T01:02:41.587Z
---

# Agent Cost Optimization: Tokens, Caching, and Smart Routing

> Reduce AI agent costs by 60-80% using token tracking, prompt caching with prompt_cache_retention, model routing, context truncation, and real-time cost dashboards with the OpenAI Agents SDK.

## Why Agent Costs Spiral Out of Control

A single agent call costs fractions of a cent. A multi-agent workflow with tool calls and retries costs a few cents. Multiply by thousands of users and millions of daily requests, and you are looking at thousands of dollars per day. Agent costs scale non-linearly because each conversation turn adds to the context window, each tool call adds a generation round, and each handoff passes the full conversation history to the next agent.

This post covers practical techniques to reduce agent costs by 60-80% without sacrificing quality.

## Technique 1: Token Tracking and Visibility

You cannot optimize what you cannot measure. Start by tracking token usage per agent, per tool call, and per workflow:

```mermaid
flowchart LR
    subgraph IN["Inputs"]
        I1["Monthly call volume"]
        I2["Average deal value"]
        I3["Current answer rate"]
        I4["Receptionist cost
per month"]
    end
    subgraph CALC["CallSphere Captures"]
        C1["Missed calls converted
at 24 by 7 coverage"]
        C2["Receptionist payroll
displaced or freed"]
    end
    subgraph OUT["Outputs"]
        O1["Recovered revenue
per month"]
        O2["Operating cost saved"]
        O3((Net ROI
monthly))
    end
    I1 --> C1
    I2 --> C1
    I3 --> C1
    I4 --> C2
    C1 --> O1 --> O3
    C2 --> O2 --> O3
    style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
    style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
    style O3 fill:#059669,stroke:#047857,color:#fff
```

```python
from agents import Agent, Runner
from dataclasses import dataclass, field
from typing import Any

@dataclass
class TokenReport:
    agent_name: str
    model: str
    input_tokens: int = 0
    output_tokens: int = 0
    cached_tokens: int = 0

    @property
    def total_tokens(self) -> int:
        return self.input_tokens + self.output_tokens

    @property
    def cache_hit_rate(self) -> float:
        if self.input_tokens == 0:
            return 0.0
        return self.cached_tokens / self.input_tokens

async def run_with_tracking(agent: Agent, input_text: str) -> tuple[str, list[TokenReport]]:
    """Run an agent and return detailed token reports."""
    result = await Runner.run(agent, input=input_text)

    reports = []
    for response in result.raw_responses:
        if response.usage:
            reports.append(TokenReport(
                agent_name=response.agent_name or "unknown",
                model=response.model or agent.model or "unknown",
                input_tokens=response.usage.input_tokens,
                output_tokens=response.usage.output_tokens,
                cached_tokens=getattr(response.usage, "input_tokens_details", {}).get(
                    "cached_tokens", 0
                ),
            ))

    return result.final_output, reports

# Usage
agent = Agent(name="CostTracker", model="gpt-4.1", instructions="Be concise.")
output, reports = await run_with_tracking(agent, "Explain quantum computing.")

for r in reports:
    print(f"{r.agent_name} ({r.model}): {r.input_tokens}in + {r.output_tokens}out = {r.total_tokens} total")
    print(f"  Cache hit rate: {r.cache_hit_rate:.1%}")
```

## Technique 2: Prompt Caching

OpenAI automatically caches prompt prefixes that remain stable across requests. For agents with long system instructions, this can reduce input token costs by 50% or more. Use `prompt_cache_retention` to control how long cached prompts persist:

```python
from agents import Agent, ModelSettings

# Long, detailed system instructions get cached automatically
detailed_agent = Agent(
    name="DetailedAgent",
    model="gpt-4.1",
    instructions="""You are an expert financial analyst assistant.

    ## Response Format
    Always structure your responses as follows:
    1. Executive Summary (2-3 sentences)
    2. Key Findings (bullet points)
    3. Detailed Analysis (paragraphs)
    4. Recommendations (numbered list)
    5. Risk Factors (bullet points)

    ## Data Handling Rules
    - Always cite specific numbers and dates
    - Convert all currencies to USD unless asked otherwise
    - Use trailing twelve months (TTM) for financial ratios
    - Flag any data older than 6 months as potentially stale

    ## Analysis Framework
    - Compare against industry benchmarks
    - Identify trends over 3+ periods
    - Note any anomalies or red flags
    - Consider macroeconomic context
    """,
    model_settings=ModelSettings(
        # Keep the prompt cached for 1 hour of inactivity
        extra_body={"prompt_cache_retention": 3600},
    ),
)
```

The first request pays full price for the system instructions. Subsequent requests within the retention window pay a reduced rate for cached input tokens. For agents with 2000+ token system prompts that handle dozens of requests per hour, this alone cuts input costs by 40-50%.

## Technique 3: Context Truncation

As conversations grow, the context window fills with old messages that may not be relevant. Use automatic truncation to manage costs:

```python
from agents import Agent, ModelSettings

# Automatically truncate long conversations
agent = Agent(
    name="TruncatingAgent",
    model="gpt-4.1",
    instructions="Help users with their questions. Focus on the most recent context.",
    model_settings=ModelSettings(
        truncation="auto",  # SDK manages context window automatically
    ),
)
```

The `truncation="auto"` setting lets the SDK automatically drop older messages when the context window approaches its limit. This prevents the conversation from growing unboundedly and keeps costs predictable.

For more control, implement manual context management:

```python
import tiktoken

def trim_conversation(messages: list[dict], max_tokens: int = 8000) -> list[dict]:
    """Keep the system message and most recent messages within budget."""
    encoding = tiktoken.encoding_for_model("gpt-4.1")

    # Always keep the system message
    system_messages = [m for m in messages if m["role"] == "system"]
    user_messages = [m for m in messages if m["role"] != "system"]

    system_tokens = sum(len(encoding.encode(m["content"])) for m in system_messages)
    budget = max_tokens - system_tokens

    # Add messages from most recent, working backwards
    trimmed = []
    running_tokens = 0
    for msg in reversed(user_messages):
        msg_tokens = len(encoding.encode(msg["content"]))
        if running_tokens + msg_tokens > budget:
            break
        trimmed.insert(0, msg)
        running_tokens += msg_tokens

    return system_messages + trimmed
```

## Technique 4: Smart Model Routing

Route requests to the cheapest model that can handle the task:

```python
from agents import Agent, Runner

# Model tier definitions
MODELS = {
    "simple": "gpt-4.1-nano",   # Cheapest: classification, extraction
    "standard": "gpt-4.1-mini",  # Mid-tier: most conversational tasks
    "complex": "gpt-4.1",        # Premium: tool-heavy, coding
    "reasoning": "gpt-5",        # Expensive: complex analysis
}

async def classify_complexity(user_input: str) -> str:
    """Use the cheapest model to classify request complexity."""
    classifier = Agent(
        name="Classifier",
        model=MODELS["simple"],
        instructions=(
            "Classify the complexity of this request. "
            "Reply with exactly one word: simple, standard, complex, or reasoning."
        ),
    )
    result = await Runner.run(classifier, input=user_input)
    complexity = result.final_output.strip().lower()
    if complexity not in MODELS:
        complexity = "standard"
    return complexity

async def cost_optimized_run(user_input: str) -> dict:
    """Route to the cheapest appropriate model."""
    complexity = await classify_complexity(user_input)
    model = MODELS[complexity]

    agent = Agent(
        name="OptimizedAgent",
        model=model,
        instructions="Provide helpful, accurate responses.",
    )

    result = await Runner.run(agent, input=user_input)

    return {
        "response": result.final_output,
        "model": model,
        "complexity": complexity,
    }
```

The classifier itself runs on the cheapest model. The total cost of classifier + routed model is still lower than running everything on GPT-4.1.

## Technique 5: Response Length Control

Controlling output length is one of the simplest cost reductions:

```python
from agents import Agent, ModelSettings

# Enforce concise outputs
concise_agent = Agent(
    name="ConciseAgent",
    model="gpt-4.1",
    instructions=(
        "Answer questions accurately and concisely. "
        "Use bullet points. Never exceed 200 words."
    ),
    model_settings=ModelSettings(
        max_tokens=300,  # Hard limit on output tokens
    ),
)
```

Combining instruction-level guidance ("be concise") with a hard `max_tokens` limit gives you both quality and cost control.

## Technique 6: Caching Agent Responses

For idempotent queries, cache the agent's response to avoid paying for the same computation twice:

```python
import hashlib
import json
from typing import Any

# Simple in-memory cache (use Redis in production)
_response_cache: dict[str, Any] = {}

def cache_key(agent_name: str, model: str, input_text: str) -> str:
    """Generate a deterministic cache key."""
    raw = f"{agent_name}:{model}:{input_text}"
    return hashlib.sha256(raw.encode()).hexdigest()

async def cached_run(agent: Agent, input_text: str, ttl: int = 3600) -> str:
    """Run an agent with response caching."""
    import time

    key = cache_key(agent.name, agent.model or "", input_text)

    if key in _response_cache:
        entry = _response_cache[key]
        if time.time() - entry["timestamp"]  dict:
        today = datetime.utcnow().date()
        today_records = [r for r in self.records if r["timestamp"].date() == today]

        by_model = defaultdict(lambda: {"requests": 0, "tokens": 0, "cost": 0.0})
        for r in today_records:
            m = by_model[r["model"]]
            m["requests"] += 1
            m["tokens"] += r["input_tokens"] + r["output_tokens"]
            m["cost"] += r["cost"]

        total_cost = sum(m["cost"] for m in by_model.values())

        return {
            "date": str(today),
            "total_cost": round(total_cost, 4),
            "total_requests": len(today_records),
            "by_model": dict(by_model),
        }
```

## Optimization Priority Order

When optimizing agent costs, apply techniques in this order for maximum impact:

1. **Model routing** — Moving 70% of traffic from GPT-4.1 to GPT-4.1-mini saves 80% on those requests
2. **Prompt caching** — Free with proper system prompt design; 40-50% input cost reduction
3. **Context truncation** — Prevents cost from growing linearly with conversation length
4. **Response length control** — Reduces output tokens by 30-50% with minimal quality impact
5. **Response caching** — Eliminates duplicate computation entirely
6. **Token tracking** — Provides visibility to identify the next optimization target

The key insight is that cost optimization is not a one-time exercise. Deploy tracking first, identify your highest-cost agents and workflows, and apply targeted optimizations. Most teams find that 80% of their costs come from 20% of their agent workflows — focus there first.

---

Source: https://callsphere.ai/blog/agent-cost-optimization-tokens-caching-smart-routing
