Skip to content
Learn Agentic AI
Learn Agentic AI12 min read4 views

Agent Cost Optimization: Tokens, Caching, and Smart Routing

Reduce AI agent costs by 60-80% using token tracking, prompt caching with prompt_cache_retention, model routing, context truncation, and real-time cost dashboards with the OpenAI Agents SDK.

Why Agent Costs Spiral Out of Control

A single agent call costs fractions of a cent. A multi-agent workflow with tool calls and retries costs a few cents. Multiply by thousands of users and millions of daily requests, and you are looking at thousands of dollars per day. Agent costs scale non-linearly because each conversation turn adds to the context window, each tool call adds a generation round, and each handoff passes the full conversation history to the next agent.

This post covers practical techniques to reduce agent costs by 60-80% without sacrificing quality.

Technique 1: Token Tracking and Visibility

You cannot optimize what you cannot measure. Start by tracking token usage per agent, per tool call, and per workflow:

flowchart TD
    START["Agent Cost Optimization: Tokens, Caching, and Sma…"] --> A
    A["Why Agent Costs Spiral Out of Control"]
    A --> B
    B["Technique 1: Token Tracking and Visibil…"]
    B --> C
    C["Technique 2: Prompt Caching"]
    C --> D
    D["Technique 3: Context Truncation"]
    D --> E
    E["Technique 4: Smart Model Routing"]
    E --> F
    F["Technique 5: Response Length Control"]
    F --> G
    G["Technique 6: Caching Agent Responses"]
    G --> H
    H["Building a Cost Dashboard"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from agents import Agent, Runner
from dataclasses import dataclass, field
from typing import Any


@dataclass
class TokenReport:
    agent_name: str
    model: str
    input_tokens: int = 0
    output_tokens: int = 0
    cached_tokens: int = 0

    @property
    def total_tokens(self) -> int:
        return self.input_tokens + self.output_tokens

    @property
    def cache_hit_rate(self) -> float:
        if self.input_tokens == 0:
            return 0.0
        return self.cached_tokens / self.input_tokens


async def run_with_tracking(agent: Agent, input_text: str) -> tuple[str, list[TokenReport]]:
    """Run an agent and return detailed token reports."""
    result = await Runner.run(agent, input=input_text)

    reports = []
    for response in result.raw_responses:
        if response.usage:
            reports.append(TokenReport(
                agent_name=response.agent_name or "unknown",
                model=response.model or agent.model or "unknown",
                input_tokens=response.usage.input_tokens,
                output_tokens=response.usage.output_tokens,
                cached_tokens=getattr(response.usage, "input_tokens_details", {}).get(
                    "cached_tokens", 0
                ),
            ))

    return result.final_output, reports


# Usage
agent = Agent(name="CostTracker", model="gpt-4.1", instructions="Be concise.")
output, reports = await run_with_tracking(agent, "Explain quantum computing.")

for r in reports:
    print(f"{r.agent_name} ({r.model}): {r.input_tokens}in + {r.output_tokens}out = {r.total_tokens} total")
    print(f"  Cache hit rate: {r.cache_hit_rate:.1%}")

Technique 2: Prompt Caching

OpenAI automatically caches prompt prefixes that remain stable across requests. For agents with long system instructions, this can reduce input token costs by 50% or more. Use prompt_cache_retention to control how long cached prompts persist:

from agents import Agent, ModelSettings

# Long, detailed system instructions get cached automatically
detailed_agent = Agent(
    name="DetailedAgent",
    model="gpt-4.1",
    instructions="""You are an expert financial analyst assistant.

    ## Response Format
    Always structure your responses as follows:
    1. Executive Summary (2-3 sentences)
    2. Key Findings (bullet points)
    3. Detailed Analysis (paragraphs)
    4. Recommendations (numbered list)
    5. Risk Factors (bullet points)

    ## Data Handling Rules
    - Always cite specific numbers and dates
    - Convert all currencies to USD unless asked otherwise
    - Use trailing twelve months (TTM) for financial ratios
    - Flag any data older than 6 months as potentially stale

    ## Analysis Framework
    - Compare against industry benchmarks
    - Identify trends over 3+ periods
    - Note any anomalies or red flags
    - Consider macroeconomic context
    """,
    model_settings=ModelSettings(
        # Keep the prompt cached for 1 hour of inactivity
        extra_body={"prompt_cache_retention": 3600},
    ),
)

The first request pays full price for the system instructions. Subsequent requests within the retention window pay a reduced rate for cached input tokens. For agents with 2000+ token system prompts that handle dozens of requests per hour, this alone cuts input costs by 40-50%.

Technique 3: Context Truncation

As conversations grow, the context window fills with old messages that may not be relevant. Use automatic truncation to manage costs:

from agents import Agent, ModelSettings

# Automatically truncate long conversations
agent = Agent(
    name="TruncatingAgent",
    model="gpt-4.1",
    instructions="Help users with their questions. Focus on the most recent context.",
    model_settings=ModelSettings(
        truncation="auto",  # SDK manages context window automatically
    ),
)

The truncation="auto" setting lets the SDK automatically drop older messages when the context window approaches its limit. This prevents the conversation from growing unboundedly and keeps costs predictable.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

For more control, implement manual context management:

import tiktoken


def trim_conversation(messages: list[dict], max_tokens: int = 8000) -> list[dict]:
    """Keep the system message and most recent messages within budget."""
    encoding = tiktoken.encoding_for_model("gpt-4.1")

    # Always keep the system message
    system_messages = [m for m in messages if m["role"] == "system"]
    user_messages = [m for m in messages if m["role"] != "system"]

    system_tokens = sum(len(encoding.encode(m["content"])) for m in system_messages)
    budget = max_tokens - system_tokens

    # Add messages from most recent, working backwards
    trimmed = []
    running_tokens = 0
    for msg in reversed(user_messages):
        msg_tokens = len(encoding.encode(msg["content"]))
        if running_tokens + msg_tokens > budget:
            break
        trimmed.insert(0, msg)
        running_tokens += msg_tokens

    return system_messages + trimmed

Technique 4: Smart Model Routing

Route requests to the cheapest model that can handle the task:

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Model routing — Moving 70% of traffic f…"]
    CENTER --> N1["Prompt caching — Free with proper syste…"]
    CENTER --> N2["Context truncation — Prevents cost from…"]
    CENTER --> N3["Response length control — Reduces outpu…"]
    CENTER --> N4["Response caching — Eliminates duplicate…"]
    CENTER --> N5["Token tracking — Provides visibility to…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
from agents import Agent, Runner

# Model tier definitions
MODELS = {
    "simple": "gpt-4.1-nano",   # Cheapest: classification, extraction
    "standard": "gpt-4.1-mini",  # Mid-tier: most conversational tasks
    "complex": "gpt-4.1",        # Premium: tool-heavy, coding
    "reasoning": "gpt-5",        # Expensive: complex analysis
}


async def classify_complexity(user_input: str) -> str:
    """Use the cheapest model to classify request complexity."""
    classifier = Agent(
        name="Classifier",
        model=MODELS["simple"],
        instructions=(
            "Classify the complexity of this request. "
            "Reply with exactly one word: simple, standard, complex, or reasoning."
        ),
    )
    result = await Runner.run(classifier, input=user_input)
    complexity = result.final_output.strip().lower()
    if complexity not in MODELS:
        complexity = "standard"
    return complexity


async def cost_optimized_run(user_input: str) -> dict:
    """Route to the cheapest appropriate model."""
    complexity = await classify_complexity(user_input)
    model = MODELS[complexity]

    agent = Agent(
        name="OptimizedAgent",
        model=model,
        instructions="Provide helpful, accurate responses.",
    )

    result = await Runner.run(agent, input=user_input)

    return {
        "response": result.final_output,
        "model": model,
        "complexity": complexity,
    }

The classifier itself runs on the cheapest model. The total cost of classifier + routed model is still lower than running everything on GPT-4.1.

Technique 5: Response Length Control

Controlling output length is one of the simplest cost reductions:

from agents import Agent, ModelSettings

# Enforce concise outputs
concise_agent = Agent(
    name="ConciseAgent",
    model="gpt-4.1",
    instructions=(
        "Answer questions accurately and concisely. "
        "Use bullet points. Never exceed 200 words."
    ),
    model_settings=ModelSettings(
        max_tokens=300,  # Hard limit on output tokens
    ),
)

Combining instruction-level guidance ("be concise") with a hard max_tokens limit gives you both quality and cost control.

Technique 6: Caching Agent Responses

For idempotent queries, cache the agent's response to avoid paying for the same computation twice:

import hashlib
import json
from typing import Any

# Simple in-memory cache (use Redis in production)
_response_cache: dict[str, Any] = {}


def cache_key(agent_name: str, model: str, input_text: str) -> str:
    """Generate a deterministic cache key."""
    raw = f"{agent_name}:{model}:{input_text}"
    return hashlib.sha256(raw.encode()).hexdigest()


async def cached_run(agent: Agent, input_text: str, ttl: int = 3600) -> str:
    """Run an agent with response caching."""
    import time

    key = cache_key(agent.name, agent.model or "", input_text)

    if key in _response_cache:
        entry = _response_cache[key]
        if time.time() - entry["timestamp"] < ttl:
            return entry["response"]

    result = await Runner.run(agent, input=input_text)

    _response_cache[key] = {
        "response": result.final_output,
        "timestamp": time.time(),
    }

    return result.final_output

This is especially effective for FAQ-style agents, knowledge base lookups, and any agent that answers the same questions repeatedly.

Building a Cost Dashboard

Combine all these techniques with a dashboard to monitor costs in real time:

from dataclasses import dataclass, field
from datetime import datetime, timedelta
from collections import defaultdict

MODEL_PRICING = {
    "gpt-5": {"input": 10.00, "output": 30.00},
    "gpt-4.1": {"input": 2.00, "output": 8.00},
    "gpt-4.1-mini": {"input": 0.40, "output": 1.60},
    "gpt-4.1-nano": {"input": 0.10, "output": 0.40},
}


@dataclass
class CostDashboard:
    records: list[dict] = field(default_factory=list)

    def record(self, agent: str, model: str, input_tokens: int, output_tokens: int):
        pricing = MODEL_PRICING.get(model, MODEL_PRICING["gpt-4.1"])
        cost = (
            (input_tokens / 1_000_000) * pricing["input"] +
            (output_tokens / 1_000_000) * pricing["output"]
        )
        self.records.append({
            "agent": agent,
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost": cost,
            "timestamp": datetime.utcnow(),
        })

    def daily_summary(self) -> dict:
        today = datetime.utcnow().date()
        today_records = [r for r in self.records if r["timestamp"].date() == today]

        by_model = defaultdict(lambda: {"requests": 0, "tokens": 0, "cost": 0.0})
        for r in today_records:
            m = by_model[r["model"]]
            m["requests"] += 1
            m["tokens"] += r["input_tokens"] + r["output_tokens"]
            m["cost"] += r["cost"]

        total_cost = sum(m["cost"] for m in by_model.values())

        return {
            "date": str(today),
            "total_cost": round(total_cost, 4),
            "total_requests": len(today_records),
            "by_model": dict(by_model),
        }

Optimization Priority Order

When optimizing agent costs, apply techniques in this order for maximum impact:

  1. Model routing — Moving 70% of traffic from GPT-4.1 to GPT-4.1-mini saves 80% on those requests
  2. Prompt caching — Free with proper system prompt design; 40-50% input cost reduction
  3. Context truncation — Prevents cost from growing linearly with conversation length
  4. Response length control — Reduces output tokens by 30-50% with minimal quality impact
  5. Response caching — Eliminates duplicate computation entirely
  6. Token tracking — Provides visibility to identify the next optimization target

The key insight is that cost optimization is not a one-time exercise. Deploy tracking first, identify your highest-cost agents and workflows, and apply targeted optimizations. Most teams find that 80% of their costs come from 20% of their agent workflows — focus there first.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.

Technical Guides

How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)

A full technical walkthrough of how modern AI voice agents work — speech-to-text, LLM orchestration, TTS, tool calling, and sub-second latency.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.