Agent Cost Optimization: Tokens, Caching, and Smart Routing
Reduce AI agent costs by 60-80% using token tracking, prompt caching with prompt_cache_retention, model routing, context truncation, and real-time cost dashboards with the OpenAI Agents SDK.
Why Agent Costs Spiral Out of Control
A single agent call costs fractions of a cent. A multi-agent workflow with tool calls and retries costs a few cents. Multiply by thousands of users and millions of daily requests, and you are looking at thousands of dollars per day. Agent costs scale non-linearly because each conversation turn adds to the context window, each tool call adds a generation round, and each handoff passes the full conversation history to the next agent.
This post covers practical techniques to reduce agent costs by 60-80% without sacrificing quality.
Technique 1: Token Tracking and Visibility
You cannot optimize what you cannot measure. Start by tracking token usage per agent, per tool call, and per workflow:
flowchart TD
START["Agent Cost Optimization: Tokens, Caching, and Sma…"] --> A
A["Why Agent Costs Spiral Out of Control"]
A --> B
B["Technique 1: Token Tracking and Visibil…"]
B --> C
C["Technique 2: Prompt Caching"]
C --> D
D["Technique 3: Context Truncation"]
D --> E
E["Technique 4: Smart Model Routing"]
E --> F
F["Technique 5: Response Length Control"]
F --> G
G["Technique 6: Caching Agent Responses"]
G --> H
H["Building a Cost Dashboard"]
H --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
from agents import Agent, Runner
from dataclasses import dataclass, field
from typing import Any
@dataclass
class TokenReport:
agent_name: str
model: str
input_tokens: int = 0
output_tokens: int = 0
cached_tokens: int = 0
@property
def total_tokens(self) -> int:
return self.input_tokens + self.output_tokens
@property
def cache_hit_rate(self) -> float:
if self.input_tokens == 0:
return 0.0
return self.cached_tokens / self.input_tokens
async def run_with_tracking(agent: Agent, input_text: str) -> tuple[str, list[TokenReport]]:
"""Run an agent and return detailed token reports."""
result = await Runner.run(agent, input=input_text)
reports = []
for response in result.raw_responses:
if response.usage:
reports.append(TokenReport(
agent_name=response.agent_name or "unknown",
model=response.model or agent.model or "unknown",
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
cached_tokens=getattr(response.usage, "input_tokens_details", {}).get(
"cached_tokens", 0
),
))
return result.final_output, reports
# Usage
agent = Agent(name="CostTracker", model="gpt-4.1", instructions="Be concise.")
output, reports = await run_with_tracking(agent, "Explain quantum computing.")
for r in reports:
print(f"{r.agent_name} ({r.model}): {r.input_tokens}in + {r.output_tokens}out = {r.total_tokens} total")
print(f" Cache hit rate: {r.cache_hit_rate:.1%}")
Technique 2: Prompt Caching
OpenAI automatically caches prompt prefixes that remain stable across requests. For agents with long system instructions, this can reduce input token costs by 50% or more. Use prompt_cache_retention to control how long cached prompts persist:
from agents import Agent, ModelSettings
# Long, detailed system instructions get cached automatically
detailed_agent = Agent(
name="DetailedAgent",
model="gpt-4.1",
instructions="""You are an expert financial analyst assistant.
## Response Format
Always structure your responses as follows:
1. Executive Summary (2-3 sentences)
2. Key Findings (bullet points)
3. Detailed Analysis (paragraphs)
4. Recommendations (numbered list)
5. Risk Factors (bullet points)
## Data Handling Rules
- Always cite specific numbers and dates
- Convert all currencies to USD unless asked otherwise
- Use trailing twelve months (TTM) for financial ratios
- Flag any data older than 6 months as potentially stale
## Analysis Framework
- Compare against industry benchmarks
- Identify trends over 3+ periods
- Note any anomalies or red flags
- Consider macroeconomic context
""",
model_settings=ModelSettings(
# Keep the prompt cached for 1 hour of inactivity
extra_body={"prompt_cache_retention": 3600},
),
)
The first request pays full price for the system instructions. Subsequent requests within the retention window pay a reduced rate for cached input tokens. For agents with 2000+ token system prompts that handle dozens of requests per hour, this alone cuts input costs by 40-50%.
Technique 3: Context Truncation
As conversations grow, the context window fills with old messages that may not be relevant. Use automatic truncation to manage costs:
from agents import Agent, ModelSettings
# Automatically truncate long conversations
agent = Agent(
name="TruncatingAgent",
model="gpt-4.1",
instructions="Help users with their questions. Focus on the most recent context.",
model_settings=ModelSettings(
truncation="auto", # SDK manages context window automatically
),
)
The truncation="auto" setting lets the SDK automatically drop older messages when the context window approaches its limit. This prevents the conversation from growing unboundedly and keeps costs predictable.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
For more control, implement manual context management:
import tiktoken
def trim_conversation(messages: list[dict], max_tokens: int = 8000) -> list[dict]:
"""Keep the system message and most recent messages within budget."""
encoding = tiktoken.encoding_for_model("gpt-4.1")
# Always keep the system message
system_messages = [m for m in messages if m["role"] == "system"]
user_messages = [m for m in messages if m["role"] != "system"]
system_tokens = sum(len(encoding.encode(m["content"])) for m in system_messages)
budget = max_tokens - system_tokens
# Add messages from most recent, working backwards
trimmed = []
running_tokens = 0
for msg in reversed(user_messages):
msg_tokens = len(encoding.encode(msg["content"]))
if running_tokens + msg_tokens > budget:
break
trimmed.insert(0, msg)
running_tokens += msg_tokens
return system_messages + trimmed
Technique 4: Smart Model Routing
Route requests to the cheapest model that can handle the task:
flowchart TD
CENTER(("Core Concepts"))
CENTER --> N0["Model routing — Moving 70% of traffic f…"]
CENTER --> N1["Prompt caching — Free with proper syste…"]
CENTER --> N2["Context truncation — Prevents cost from…"]
CENTER --> N3["Response length control — Reduces outpu…"]
CENTER --> N4["Response caching — Eliminates duplicate…"]
CENTER --> N5["Token tracking — Provides visibility to…"]
style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
from agents import Agent, Runner
# Model tier definitions
MODELS = {
"simple": "gpt-4.1-nano", # Cheapest: classification, extraction
"standard": "gpt-4.1-mini", # Mid-tier: most conversational tasks
"complex": "gpt-4.1", # Premium: tool-heavy, coding
"reasoning": "gpt-5", # Expensive: complex analysis
}
async def classify_complexity(user_input: str) -> str:
"""Use the cheapest model to classify request complexity."""
classifier = Agent(
name="Classifier",
model=MODELS["simple"],
instructions=(
"Classify the complexity of this request. "
"Reply with exactly one word: simple, standard, complex, or reasoning."
),
)
result = await Runner.run(classifier, input=user_input)
complexity = result.final_output.strip().lower()
if complexity not in MODELS:
complexity = "standard"
return complexity
async def cost_optimized_run(user_input: str) -> dict:
"""Route to the cheapest appropriate model."""
complexity = await classify_complexity(user_input)
model = MODELS[complexity]
agent = Agent(
name="OptimizedAgent",
model=model,
instructions="Provide helpful, accurate responses.",
)
result = await Runner.run(agent, input=user_input)
return {
"response": result.final_output,
"model": model,
"complexity": complexity,
}
The classifier itself runs on the cheapest model. The total cost of classifier + routed model is still lower than running everything on GPT-4.1.
Technique 5: Response Length Control
Controlling output length is one of the simplest cost reductions:
from agents import Agent, ModelSettings
# Enforce concise outputs
concise_agent = Agent(
name="ConciseAgent",
model="gpt-4.1",
instructions=(
"Answer questions accurately and concisely. "
"Use bullet points. Never exceed 200 words."
),
model_settings=ModelSettings(
max_tokens=300, # Hard limit on output tokens
),
)
Combining instruction-level guidance ("be concise") with a hard max_tokens limit gives you both quality and cost control.
Technique 6: Caching Agent Responses
For idempotent queries, cache the agent's response to avoid paying for the same computation twice:
import hashlib
import json
from typing import Any
# Simple in-memory cache (use Redis in production)
_response_cache: dict[str, Any] = {}
def cache_key(agent_name: str, model: str, input_text: str) -> str:
"""Generate a deterministic cache key."""
raw = f"{agent_name}:{model}:{input_text}"
return hashlib.sha256(raw.encode()).hexdigest()
async def cached_run(agent: Agent, input_text: str, ttl: int = 3600) -> str:
"""Run an agent with response caching."""
import time
key = cache_key(agent.name, agent.model or "", input_text)
if key in _response_cache:
entry = _response_cache[key]
if time.time() - entry["timestamp"] < ttl:
return entry["response"]
result = await Runner.run(agent, input=input_text)
_response_cache[key] = {
"response": result.final_output,
"timestamp": time.time(),
}
return result.final_output
This is especially effective for FAQ-style agents, knowledge base lookups, and any agent that answers the same questions repeatedly.
Building a Cost Dashboard
Combine all these techniques with a dashboard to monitor costs in real time:
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from collections import defaultdict
MODEL_PRICING = {
"gpt-5": {"input": 10.00, "output": 30.00},
"gpt-4.1": {"input": 2.00, "output": 8.00},
"gpt-4.1-mini": {"input": 0.40, "output": 1.60},
"gpt-4.1-nano": {"input": 0.10, "output": 0.40},
}
@dataclass
class CostDashboard:
records: list[dict] = field(default_factory=list)
def record(self, agent: str, model: str, input_tokens: int, output_tokens: int):
pricing = MODEL_PRICING.get(model, MODEL_PRICING["gpt-4.1"])
cost = (
(input_tokens / 1_000_000) * pricing["input"] +
(output_tokens / 1_000_000) * pricing["output"]
)
self.records.append({
"agent": agent,
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost": cost,
"timestamp": datetime.utcnow(),
})
def daily_summary(self) -> dict:
today = datetime.utcnow().date()
today_records = [r for r in self.records if r["timestamp"].date() == today]
by_model = defaultdict(lambda: {"requests": 0, "tokens": 0, "cost": 0.0})
for r in today_records:
m = by_model[r["model"]]
m["requests"] += 1
m["tokens"] += r["input_tokens"] + r["output_tokens"]
m["cost"] += r["cost"]
total_cost = sum(m["cost"] for m in by_model.values())
return {
"date": str(today),
"total_cost": round(total_cost, 4),
"total_requests": len(today_records),
"by_model": dict(by_model),
}
Optimization Priority Order
When optimizing agent costs, apply techniques in this order for maximum impact:
- Model routing — Moving 70% of traffic from GPT-4.1 to GPT-4.1-mini saves 80% on those requests
- Prompt caching — Free with proper system prompt design; 40-50% input cost reduction
- Context truncation — Prevents cost from growing linearly with conversation length
- Response length control — Reduces output tokens by 30-50% with minimal quality impact
- Response caching — Eliminates duplicate computation entirely
- Token tracking — Provides visibility to identify the next optimization target
The key insight is that cost optimization is not a one-time exercise. Deploy tracking first, identify your highest-cost agents and workflows, and apply targeted optimizations. Most teams find that 80% of their costs come from 20% of their agent workflows — focus there first.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.