Skip to content
Agentic AI
Agentic AI7 min read0 views

Cutting token cost in Claude Code: caching and batching

Keep dynamic Claude Code workflows cheap and fast with prompt caching, batching, tight context scoping, and deliberate multi-agent use.

A dynamic Claude Code workflow that works is satisfying right up until the bill arrives. An agent that reads files, runs commands, reasons across a large context, and iterates can quietly consume millions of tokens in a single afternoon — and a multi-agent version of the same job can cost several times more. The capability is real, but so is the spend, and most teams discover the cost only after they've shipped something they now can't afford to run at scale. The good news is that token cost in agentic workflows is highly controllable once you understand where it actually accumulates.

This post is about the levers that matter: prompt caching, batching, scoping context tightly, and being deliberate about when parallelism is worth its price. None of these require sacrificing quality. Most of them make the workflow faster as a bonus.

Where the tokens actually go

Before optimizing, you have to know where the spend lives, and it's rarely where people guess. The dominant cost in most agentic workflows is not the model's output — it's the input context that gets re-sent on every turn. An agentic loop works by appending each tool result to a growing conversation and resending the whole thing to the model on the next step. A workflow with thirty tool calls re-sends an ever-larger prompt thirty times, so the same file you read once is paid for again and again as it rides along in the context.

This is the single most important insight for cost control: in a long agentic run, input tokens dominate, and they grow with every step. A workflow that reads ten large files early and then runs twenty more steps carries those files' tokens through all twenty steps. Reducing what lives in context, and reducing how often you pay full price for it, is where the savings are.

The second cost center is redundant work — re-reading a file the agent already saw, re-running a search it already ran, re-deriving a fact already established. Agentic loops are prone to this because each turn is somewhat fresh, and an agent without good memory of what it already learned will happily pay to relearn it.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Prompt caching: pay full price once

Prompt caching is the highest-leverage cost lever for agentic workflows, because it directly attacks the re-sent-context problem. The idea is that a stable prefix of your prompt — system instructions, tool definitions, a large reference document, the project's context — can be cached after the first request, so subsequent requests that reuse that prefix are billed at a steep discount for the cached portion instead of full input price.

The structural rule that makes caching pay off is put the stable, large content at the front and the volatile content at the back. Caching works on prefixes, so anything before the first change is cacheable and anything after it is not. If you sprinkle a changing timestamp near the top of your prompt, you've invalidated everything after it. Order your context so the immovable mass — instructions, schemas, big docs — sits first and the turn-by-turn deltas come last.

flowchart TD
  A["Workflow turn"] --> B{"Stable prefix cached?"}
  B -->|Yes| C["Reuse cache: pay discounted rate"]
  B -->|No| D["Pay full input price & write cache"]
  C --> E["Append only new delta to context"]
  D --> E
  E --> F{"Context near limit?"}
  F -->|Yes| G["Summarize & prune old turns"]
  F -->|No| H["Continue loop"]
  G --> H

The diagram shows the loop you want: a cached, stable prefix; only the new delta appended each turn; and active pruning when context grows large. Caches typically have a short lifetime, so the win is largest in bursty, iterative sessions where many requests land close together — which is exactly the shape of an agentic workflow.

Batching and scoping: do less, pay less

The cheapest token is the one you never send. Aggressively scoping what the agent reads is the most direct cost reduction available. An agent that's told to read a specific module instead of the whole repository pays for a fraction of the context. Precise instructions — "the routing logic lives in src/routing" — keep the agent from grepping the world and dragging thousands of irrelevant tokens into context where they'll be re-billed on every later turn.

Batching is the complement. When a workflow needs to perform many similar independent operations — classify two hundred records, enrich a list of accounts, summarize fifty documents — running them as one-call-each in a tight loop is wasteful and slow. For high-volume, non-interactive jobs, a batch API processes many requests together at a reduced rate, trading immediacy for a substantial discount. If the work doesn't need an answer this second, batch it.

Model selection is the quiet third lever. Not every step needs the most capable model. A workflow can route a simple classification or extraction step to a smaller, cheaper model like Haiku and reserve a frontier model like Opus for the genuinely hard reasoning. Matching model to task difficulty, rather than using the biggest model for everything, often cuts cost dramatically with no quality loss on the easy steps.

When parallelism is worth the premium

Multi-agent workflows are powerful and expensive. An orchestrator that spawns several subagents to investigate a problem in parallel typically burns several times more tokens than a single agent doing the work serially, because each subagent carries its own context and the orchestrator pays to coordinate them. That premium is sometimes absolutely worth it and sometimes pure waste.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

It's worth it when breadth or wall-clock speed has real value — investigating an incident from several angles at once, or auditing many independent data sources where serial processing would take too long. It's waste when the task is fundamentally sequential, where each step depends on the last and parallelism buys nothing but a bigger bill. The discipline is to default to single-agent and reach for parallelism deliberately, when you can name the speedup or coverage you're buying.

Frequently asked questions

What's the biggest token cost in a Claude Code workflow?

Re-sent input context. Each agentic turn resends the growing conversation, so files and instructions loaded early get paid for again on every subsequent step. Controlling what lives in context and caching the stable parts is where the largest savings are.

How does prompt caching save money in practice?

It lets you pay full price for a stable prefix once and a steep discount on every reuse. Put large, immovable content — system instructions, tool definitions, reference docs — at the front of the prompt and volatile content at the back, since caching works on prefixes and any change invalidates everything after it.

When should I use a multi-agent run versus a single agent?

Use multi-agent only when breadth or speed genuinely pays off, like parallel investigation of independent sources, because it typically costs several times more tokens than a single agent. For sequential work where each step depends on the last, a single agent is cheaper and just as effective.

Can I mix models to save cost?

Yes, and you should. Route easy steps — classification, extraction, formatting — to a smaller, cheaper model and reserve a frontier model for hard reasoning. Matching model size to task difficulty often cuts spend significantly with no quality loss on the simple work.

Bringing agentic AI to your phone lines

The same cost discipline — cache the stable context, batch the bulk work, parallelize only when it pays — keeps CallSphere's voice and chat agents fast and affordable as they answer every call, use tools mid-conversation, and book work around the clock. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.