Cutting Token Cost in Claude Multi-Agent Systems
Use prompt caching, batching, model tiering, and tight context scoping to keep Claude multi-agent runs fast and cheap without losing quality.
Multi-agent systems have a cost problem that catches teams by surprise. A single Claude agent answering a question might spend a few thousand tokens. Spin up an orchestrator that fans out to five subagents, each with its own context and its own back-and-forth with tools, and the same task can consume several times more tokens — sometimes an order of magnitude more. The capability is real, but so is the bill, and a system that is brilliant and unaffordable does not ship. Keeping multi-agent runs cheap and fast is an engineering problem with concrete levers, and most teams leave the easiest ones unpulled.
This post covers the three biggest levers — prompt caching, batching, and ruthless context scoping — plus the measurement discipline that tells you whether your changes actually helped.
Where the tokens actually go
Before optimizing, you have to know where the spend lives. In an orchestrator-subagent system, cost concentrates in three places. First, the system prompt and tool definitions are re-sent on every model call, and in a long agent loop that fixed preamble is paid for again and again. Second, each subagent carries its own context window, so spawning five subagents means five separate contexts that do not share memory. Third, the orchestrator re-reads the growing transcript on every turn, so its per-call cost climbs as the run progresses.
A useful definition: effective token cost is the sum of input tokens across every model call in a run, weighted by whether each call's input was a cache hit or a full-price read. That framing matters because the same logical context can cost wildly different amounts depending on how well you cache it. Two systems doing identical work can differ fivefold in cost purely on caching hygiene.
Lever one: prompt caching
Prompt caching is the highest-leverage optimization available, and it is underused. Claude lets you mark a stable prefix of a prompt — system instructions, tool schemas, reference documents — so that repeated calls reuse the cached prefix at a steep discount instead of paying full input price each time. In an agent loop where the system prompt and tool definitions are identical across dozens of turns, caching that prefix turns a large recurring cost into a small one.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Build prompt"] --> B["Stable prefix: system + tools + docs"]
A --> C["Volatile suffix: latest turn"]
B --> D{"Prefix unchanged from last call?"}
D -->|Yes| E["Cache hit: pay discounted rate"]
D -->|No| F["Cache miss: full price + re-cache"]
E --> G["Send call with cheap prefix"]
F --> G
C --> G
The structural rule is to order your prompt from most stable to least stable. Put the system prompt and tool definitions first, reference material next, and the changing conversation last. Caching keys on an exact prefix match, so a single token inserted near the top invalidates everything after it. Teams routinely sabotage their own cache by injecting a timestamp or a per-turn counter early in the prompt; move that to the end and the cache holds across the whole run. For multi-agent systems, share an identical cached preamble across subagents of the same role so the cache warms once and benefits all of them.
Lever two: batching independent work
The reason to use multiple agents at all is parallelism, and parallelism is also a cost tool when used well. If five subagents each do independent work, running them concurrently does not reduce total tokens, but it dramatically reduces wall-clock time — and for many workloads, latency is the cost that users feel. The mistake is fanning out work that is not actually independent, so subagents duplicate effort or wait on each other, paying for parallelism without getting it.
For high-volume, latency-tolerant work — scoring a backlog of records, generating many summaries, running an eval suite — the Message Batches API is the right tool. It processes large sets of requests asynchronously at a substantial discount compared to real-time calls. The trade is latency: you submit a batch and collect results later. For anything that does not need an instant answer, batching is close to free money. Reserve real-time calls for the interactive path and push everything else to batch.
Match the model to the job as part of batching. Not every subagent needs Opus. A classification or extraction subagent often runs perfectly on Haiku at a fraction of the cost, while the orchestrator and the genuinely hard reasoning steps stay on Sonnet or Opus. A tiered fleet — cheap models for narrow tasks, expensive models only where capability is required — is one of the largest cost reductions available, and it costs nothing but configuration.
Lever three: scope the context hard
Every token a subagent does not need is a token you should not send. The instinct to hand each subagent the full conversation history "just in case" is the single most expensive habit in multi-agent design. A subagent that summarizes a document needs the document and a tight instruction — not the orchestrator's entire transcript. Give each subagent the minimum context to do its job and have it return a compact result, not its full reasoning trace.
This is where the orchestrator earns its keep: it holds the global state and hands each subagent a precise, bounded slice. When subagents return, they should return distilled outputs — a decision, a structured record, a short summary — rather than dumping their working context back into the orchestrator's window, which would inflate every subsequent orchestrator call. Compaction at the handoff boundary keeps the orchestrator's context from ballooning as the run grows.
Measure before and after every change
None of these levers should be applied on faith. Instrument every run to record total input tokens, output tokens, cache-read tokens, and cache-write tokens, broken down per agent. The cache-hit ratio is the metric that tells you whether your prefix ordering is working; if it is low, your cache is being invalidated and you should find what is changing early in the prompt. Track cost per completed task, not cost per call, because a cheaper call that triggers more retries is not actually cheaper.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Set a token budget per run and treat exceeding it as a failure to investigate, the same way you would treat a latency regression. The combination of caching, batching, model tiering, and tight scoping routinely cuts multi-agent cost by a large margin — but only if you are measuring, because without numbers you cannot tell an optimization from a regression.
Frequently asked questions
Why do multi-agent systems cost so much more than a single agent?
Each subagent runs its own context window with no shared memory, the orchestrator re-reads a growing transcript every turn, and fixed preambles are re-sent on every call. These multiply, so a fan-out can cost several times — sometimes an order of magnitude — more than one agent doing the work serially.
How much can prompt caching realistically save?
It depends on how much of your prompt is stable, but in agent loops the system prompt and tool schemas are identical across every turn, so caching that prefix turns a large recurring input cost into a small discounted one. Order prompts stable-first and avoid injecting volatile tokens early, or the cache breaks.
When should I use the Batches API instead of real-time calls?
Use batching for high-volume work that tolerates latency — scoring backlogs, bulk generation, running eval suites — where you can submit requests and collect results later at a significant discount. Keep real-time calls for the interactive path where a user is waiting.
Do all subagents need the most capable model?
No. Run narrow tasks like classification and extraction on Haiku, reserve Sonnet or Opus for the orchestrator and genuinely hard reasoning, and you cut cost sharply with no quality loss on the simple steps. Tiering models to tasks is one of the cheapest wins available.
Bringing agentic AI to your phone lines
CallSphere applies the same cost discipline to live voice and chat agents — cached preambles, tiered models, and tight context so every call is answered fast without runaway spend. See it in action at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.