Skip to content
Agentic AI
Agentic AI8 min read0 views

Cut Claude agent token costs: caching, batching, cheap runs

Lower Claude agent cost and latency with prompt caching, batching, model routing, and context discipline. A practical performance guide for agentic runs.

An agent that works but costs a dollar a run is a prototype, not a product. The moment a Claude Cowork plugin or an Agent SDK service moves from your laptop to real traffic, token cost and latency stop being abstractions and become the line items that decide whether the thing ships. The good news is that agentic cost is highly compressible - most expensive runs are expensive for predictable, fixable reasons. This article is about finding those reasons and squeezing them out without degrading quality.

We will work through where the tokens actually go, then through the levers that matter most: prompt caching, request batching, model routing across Opus, Sonnet, and Haiku, and the context discipline that prevents your message history from ballooning into the biggest bill of all.

Where the tokens really go

Before optimizing anything, measure. In a typical multi-step Claude agent run, the input tokens dwarf the output tokens, and they dwarf them more with every step. The reason is simple and easy to miss: at each step the model re-reads the entire conversation so far - the system prompt, every tool definition, every prior tool result, all of it - and that history grows monotonically. A ten-step run does not cost ten times a single call; it can cost far more, because step ten re-ingests everything from steps one through nine.

This is why agentic cost is dominated by repeated input, not by generated output. A multi-agent run compounds the effect: each subagent carries its own context, and orchestrator-subagent systems routinely use several times more tokens than a single agent solving the same task. That multiplier is fine when the parallelism buys real value and ruinous when it does not, so the first cost question is always whether you needed multiple agents at all.

Prompt caching: stop paying for the same prefix

The single biggest lever is prompt caching. Because the start of your prompt - system instructions, tool definitions, skill content, long reference documents - is identical across every step and often across every run, you should not pay full price to process it each time. Prompt caching stores the model's processing of a stable prefix so that subsequent requests reusing that prefix are read from cache at a steep discount instead of being recomputed.

The diagram below shows how a cached prefix flows through a multi-step run.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Step 1 request"] --> B["Stable prefix: system + tools + docs"]
  B --> C{"Prefix in cache?"}
  C -->|No| D["Process full prefix, write to cache"]
  C -->|Yes| E["Read prefix from cache at discount"]
  D --> F["Append step-specific tokens"]
  E --> F
  F --> G["Model responds, run continues to next step"]

To make caching pay off, structure your prompt so the stable parts come first and the volatile parts come last. Put system instructions, tool schemas, and any large fixed reference material at the very front, then place the dynamic conversation and tool results after the cache boundary. If you interleave changing content into the prefix, you invalidate the cache on every step and pay full price anyway. The ordering discipline is the whole game: stable-then-volatile, never the reverse.

Caching has a freshness window, so it favors workloads with steady traffic. A high-throughput agent that runs constantly keeps its prefix warm; an occasional batch job may see the cache expire between runs. Design for it: for bursty workloads, group related work close together in time so the cache stays warm across the burst.

Batching and parallelism done right

Not every token needs to be processed at interactive speed. If you have a backlog of work that does not need a real-time answer - classifying a queue of tickets, enriching a list of records, generating summaries for a nightly report - batch processing trades latency for a meaningful cost reduction. Send the work as a batch and accept results within a longer window instead of paying the premium for instant responses. For anything offline or scheduled, this is free money.

Parallelism is the other side of the coin, and it cuts wall-clock time rather than cost. When an agent has several independent things to do - read three files, query two APIs, check four records - it should issue those tool calls together rather than one at a time, so the work overlaps. Claude can request multiple tool calls in a single step; design your tools and prompts so independent operations fan out instead of serializing. Just keep the distinction clear in your head: batching saves money on non-urgent work, parallel tool calls save time on urgent work, and neither is a substitute for the other.

Route the right model to the right step

Using the most capable model for every step is the most common way to overspend. A run might need deep reasoning to plan, but the individual steps - extracting a field, classifying a result, formatting an answer - are well within the reach of a smaller, faster, cheaper model. The pattern is model routing: reserve Opus-class reasoning for the hard planning and synthesis, and push the routine steps down to Sonnet or Haiku.

Model routing is the practice of selecting the cheapest model that can reliably complete a given step, rather than using one model for the entire run. In an orchestrator-subagent design this maps cleanly: the orchestrator that decomposes the problem and integrates results may warrant the strongest model, while subagents doing narrow, well-specified subtasks can run on a lighter one. The savings compound because the cheap steps are usually the frequent ones.

Validate routing with evals, not vibes. Drop a smaller model into a step, run your evaluation set, and confirm quality holds before you keep the change. The wins are real but they are not free of risk; a model that is too small for a step fails quietly, and a quiet quality regression costs more in trust than you saved in tokens.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Context discipline: the cheapest token is the one you never send

The most underrated lever is simply sending less. Because input grows every step, anything you can keep out of the context pays off on every subsequent step. Do not dump an entire file into the conversation when the agent needs one function; give it a tool to fetch the specific slice it asks for. Do not paste a giant API response verbatim; have the tool return a compact, structured summary with the fields that matter.

For long-running agents, manage the window actively. Summarize and compact older turns once they are no longer load-bearing, so the history does not grow without bound. Strip verbose tool outputs down to their essentials before they enter the permanent record. Keep tool result payloads tight by design - paginate, filter, and project at the tool layer rather than letting the model wade through noise. Every kilobyte you keep out of step three is a kilobyte you also keep out of steps four through twenty.

Put these levers together and the typical expensive agent gets dramatically cheaper without losing capability: cache the stable prefix, batch the non-urgent work, parallelize independent calls, route routine steps to smaller models, and keep the context lean. Measure before and after with real traces, because the only optimization that counts is the one you can see in the token totals.

Frequently asked questions

Why does my Claude agent cost so much more than a single API call?

Because the conversation history is re-read at every step. Each step re-ingests the system prompt, tool definitions, and all prior tool results, so a long run pays for its early context many times over. Multi-agent runs multiply this further. Caching and context discipline target exactly this repeated input.

How much does prompt caching actually help?

It helps most when a large, stable prefix is reused across many steps or runs - system instructions, tool schemas, and fixed reference docs. Reading that prefix from cache is much cheaper than recomputing it. The benefit depends on keeping the prefix stable and ordering your prompt stable-first, volatile-last.

When should I use batching versus parallel tool calls?

Use batching for non-urgent, offline work where you can tolerate a longer turnaround in exchange for lower cost. Use parallel tool calls when a single agent step has several independent operations and you want to cut wall-clock latency. They solve different problems - cost versus speed - and are not interchangeable.

Is it safe to use a smaller model for some steps?

Yes, for steps that are narrow and well-specified - extraction, classification, formatting - a smaller model often matches a larger one. The rule is to validate with an eval set before committing, since an under-powered model fails quietly and a silent quality drop can cost more than the tokens you saved.

Fast, frugal agents on every call

The same economics - cache the stable prompt, route cheap steps to small models, and keep context lean - are what let a voice agent stay snappy and affordable at thousands of concurrent calls. CallSphere brings this performance engineering to multi-agent voice and chat assistants that handle every conversation, call tools mid-dialogue, and book work nonstop. Try it at callsphere.ai.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.