Skip to content
Agentic AI
Agentic AI7 min read0 views

Cut Claude Agent Token Costs: Caching, Batching, Speed

Keep Claude agents cheap and fast: prompt caching, the Batch API, model routing across Opus/Sonnet/Haiku, and context discipline that cut token spend.

The bill arrives at the end of the first real month, and it is not the number you modeled. A startup's agent that worked beautifully in testing turns out to cost a few dollars per task at scale, and you are running thousands of tasks a day. Suddenly unit economics matter more than capabilities. The good news is that most agent token spend is waste — the same long system prompt re-read on every turn, multi-agent fan-outs that didn't need to fan out, and context that grows unbounded across a session. This post is about wringing that waste out without making the agent dumber.

Cost and latency are the same problem viewed from two angles. Tokens cost money and tokens take time to produce, so almost every optimization that lowers your bill also tightens your p95 latency. For an interactive agent answering a user, that double payoff is exactly what you want. Let's go through the levers in rough order of impact.

Where the tokens actually go

Before optimizing, measure. Instrument every Claude call with input tokens, output tokens, cache-read tokens, and the model used, tagged by task type. Almost every team that does this discovers the same surprise: the dominant cost is input tokens, not output, because an agent re-sends its entire growing context — system prompt, tool definitions, and full conversation history — on every single turn. A ten-turn agent run can re-read the same 8,000-token system prompt ten times. That is where caching pays off enormously.

The second surprise is the long tail. A handful of pathological runs — ones that looped, or that pulled a giant tool result into context — account for a disproportionate share of spend. So track per-run cost distribution, not just the average. Capping those outliers (with tool-call limits and result truncation) often saves more than micro-optimizing the median run.

Prompt caching is the biggest lever

Claude supports prompt caching, which lets you mark a stable prefix of your request — the system prompt, tool definitions, few-shot examples, a large reference document — so that on subsequent calls Claude reads it from cache instead of reprocessing it. Cached input tokens are dramatically cheaper and faster than fresh ones. For an agent loop where only the last user turn changes between calls, this is transformative: you pay full price for the prefix once, then a fraction of that on every following turn within the cache window.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The trick is structuring your request so the cacheable part is stable and comes first. Put your system prompt, tool schemas, and any fixed reference material at the top and mark the cache boundary after them; let only the volatile conversation tail follow. The diagram below shows how a single request splits across cache reads and fresh compute.

flowchart TD
  A["New turn"] --> B{"Stable prefix cached?"}
  B -->|Yes| C["Read system + tools from cache (cheap)"]
  B -->|No| D["Process full prefix once, write cache"]
  C --> E["Process only new user turn"]
  D --> E
  E --> F["Claude responds"]
  F --> G{"Big tool result returned?"}
  G -->|Yes| H["Summarize before appending to context"]
  G -->|No| I["Append result, continue loop"]
  H --> I

Two rules make caching reliable. First, keep the cached prefix byte-stable — even a timestamp injected into the system prompt invalidates the cache every call, so move dynamic values into the user message. Second, order tools and instructions so the parts that change least are highest in the request. Teams that get this right routinely see input-token costs on multi-turn agents fall by a large fraction with zero quality change.

Batch what doesn't need to be live

Not every agent task is interactive. Overnight enrichment, bulk classification, generating summaries for a backlog — these don't need sub-second responses. Claude's Batch API processes large volumes of requests asynchronously at a significant discount versus real-time calls. For a startup, the pattern is to split your workload: keep the live, user-facing path on standard low-latency calls, and route every fire-and-forget job through batching. The cost difference on the batchable half of your traffic is often the single biggest line-item saving you can make.

Batching also smooths your rate-limit pressure. Instead of hammering the API during business hours and risking throttling, you queue background work and let it drain on the provider's schedule. That makes your live capacity headroom larger when you actually need it.

Right-size the model and the context

Model routing is the next lever. The Claude family spans Opus 4.8 for the hardest reasoning, Sonnet 4.6 for balanced everyday work, and Haiku 4.5 for fast, cheap, high-volume calls. A common mistake is running every step of an agent on the most capable model. Instead, route by difficulty: use Haiku for extraction, routing, and short classification steps, escalate to Sonnet for the main reasoning, and reserve Opus for genuinely hard planning. A well-tiered agent can cut cost several-fold while keeping end-quality nearly identical, because most steps in a real workflow are easy.

Context discipline is the quiet multiplier. Every token you let accumulate in the conversation gets re-sent on every later turn and re-priced. So summarize large tool results before appending them, drop stale turns once they're no longer relevant, and don't dump entire files into context when a targeted excerpt will do. Multi-agent designs deserve special scrutiny here: an orchestrator with several subagents can use several times the tokens of a single agent, so only fan out when the parallelism genuinely buys you better results, not by default.

Make cost a first-class metric

The teams that stay cheap treat cost-per-task like latency or error rate — a dashboarded metric with a budget, watched per release. When a prompt change quietly doubles average tokens, you want to see it the next day, not in the invoice. Tag spend by task type, alert on per-run outliers, and review your model-routing mix periodically as the workload shifts.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Keeping a Claude agent cheap and fast comes down to not paying twice for the same tokens: cache the stable prefix, batch everything that isn't interactive, route each step to the smallest model that can do it, and keep context tight so it doesn't compound. Done together, these routinely cut agent spend by a large fraction while making the agent feel faster to users.

Frequently asked questions

What is prompt caching and how much does it save?

Prompt caching marks a stable prefix of your request — system prompt, tool definitions, reference material — so Claude reads it from cache on later calls instead of reprocessing it. Cached input tokens are far cheaper and faster than fresh ones, which is dramatic for multi-turn agents where only the last turn changes between calls.

When should I use the Batch API for an agent?

Use batching for any non-interactive workload — overnight enrichment, bulk classification, backlog summarization — where you don't need a sub-second response. It runs asynchronously at a significant discount and reduces rate-limit pressure on your live path. Keep user-facing turns on standard low-latency calls.

How do I choose between Opus, Sonnet, and Haiku?

Route by step difficulty rather than running everything on the biggest model. Use Haiku 4.5 for extraction, routing, and short classification; Sonnet 4.6 for the main reasoning; and reserve Opus 4.8 for genuinely hard planning. Most steps in a real workflow are easy, so tiering cuts cost several-fold with little quality loss.

Why is my agent's input token cost so high?

Because an agent re-sends its entire growing context on every turn, so the same system prompt and history get re-priced repeatedly. Fix it with prompt caching for the stable prefix, summarizing large tool results before appending, and dropping stale turns so context doesn't compound across the run.

Bringing fast, affordable agents to live calls

CallSphere runs these cost and latency disciplines — caching, model routing, and tight context — inside voice and chat agents where every extra second and token is felt by a caller on the line. See how lean agentic runs handle real conversations at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.