Skip to content
Agentic AI
Agentic AI7 min read0 views

Cutting Token Cost in Claude Multi-Agent Systems (Building Multi Agent Systems)

Use prompt caching, batching, and model routing to keep Claude multi-agent runs cheap and fast without sacrificing output quality.

There is a moment most teams hit a few weeks after their first multi-agent system goes live: the bill arrives. A workflow that felt cheap in testing turns out to cost several times what a single Claude agent would, because every subagent carries its own context, makes its own tool calls, and re-reads the same documents the orchestrator already read. Multi-agent architectures buy you parallelism and specialization, but they spend tokens to do it, and that spend is not automatic — it is something you engineer down.

The good news is that token cost in a Claude multi-agent system is highly controllable once you understand where it goes. Most of the waste falls into three buckets: context that gets re-sent on every turn, work that gets done serially when it could be batched, and expensive models doing cheap work. Tackle all three and you can often cut cost by more than half without touching output quality.

Where the tokens actually go

Before optimizing, measure. Instrument every agent run to record input and output tokens per turn, per agent, and per tool call. When you do this for the first time the result is almost always surprising: the bulk of spend is rarely the model's clever reasoning. It is input tokens — the system prompt, the tool definitions, the retrieved documents, and the conversation history re-sent on every single turn.

This matters because it tells you where to aim. A multi-agent run with ten orchestrator turns and four subagents, each running eight turns, is not paying for forty-two clever thoughts. It is paying to re-transmit the same large prompts dozens of times. The optimization target is the repeated input, not the reasoning, and that is exactly what prompt caching is built to attack.

Prompt caching: stop paying for the same prefix

Prompt caching lets you mark a stable prefix of your prompt — system instructions, tool definitions, long reference documents — so that on subsequent calls Claude reads it from cache at a steep discount instead of full price. In a multi-turn agent loop, where that prefix is identical on every turn, the savings compound across the whole run.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The discipline that makes caching pay off is prompt ordering. Put everything stable at the front and everything that changes at the back. If you interleave a changing timestamp or a per-turn variable into the middle of your system prompt, you break the cached prefix and lose the discount. Structure your prompts so the boundary between cached and dynamic content is clean, and verify with your token metrics that cache reads are actually landing.

flowchart TD
  A["Incoming agent turn"] --> B{"Stable prefix cached?"}
  B -->|Yes| C["Read prefix from cache (cheap)"]
  B -->|No| D["Pay full price, write cache"]
  C --> E["Append dynamic suffix"]
  D --> E
  E --> F{"Task batchable?"}
  F -->|Yes| G["Queue for batch API"]
  F -->|No| H["Route by difficulty: Haiku / Sonnet / Opus"]
  G --> H

Batching: do many things in one pass

A lot of multi-agent work is embarrassingly parallel and accidentally serial. If your system classifies a hundred support tickets one agent call at a time, you are paying per-call overhead a hundred times. Two batching strategies fix this. The first is within a single prompt: ask one Claude call to process a batch of items and return structured results, amortizing the system prompt and tool definitions across all of them.

The second is the asynchronous batch API, which trades latency for a meaningful per-token discount on work that does not need an immediate answer. Overnight enrichment, bulk summarization, eval runs over a large dataset — anything where a result in an hour is fine — belongs here. The architectural move is to separate your real-time agent path from your background batch path so that only genuinely interactive work pays real-time prices.

Model routing: send cheap work to cheap models

Not every step in a multi-agent system needs your most capable model. The 2026 Claude family spans Opus for the hardest reasoning, Sonnet for balanced general work, and Haiku for fast, cheap, high-volume tasks. A common and costly mistake is running every subagent on Opus because the orchestrator needed it. Routing by difficulty is one of the largest single levers you have.

The pattern that works: use a capable model for the orchestrator and for genuinely hard reasoning, and route mechanical subagent work — extraction, classification, formatting, simple lookups — to Haiku. You can even let the orchestrator declare a difficulty hint when it spawns a subagent and map that hint to a model tier. The result is that your expensive model is reserved for the small fraction of turns that actually need it, and the long tail of routine work runs at a fraction of the cost.

Trimming context before it reaches an agent

Every token you put into a subagent's context is a token you pay for on every one of its turns, so context hygiene is a direct cost lever. Do not pass a subagent the orchestrator's entire history; pass it a tight, task-specific brief. Do not stuff a whole document into context when retrieval can hand the agent just the relevant section. And summarize tool results aggressively — a tool that returns a 50KB JSON blob the agent only needs three fields from is pure waste re-sent on every later turn.

This is also where multi-agent architecture earns its keep on cost rather than just spending it. The reason to split work across subagents is so each one operates with a small, focused context instead of one giant agent dragging an enormous history through every turn. Done well, specialization keeps per-turn input small, and small input is the whole game.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Measuring cost as a first-class metric

Speed and cost should sit in your dashboards next to quality, not in a monthly billing surprise. Track tokens and latency per workflow, and watch them over time the way you watch error rates. When a change to a prompt quietly doubles context size, your metrics should catch it the same day. Treat a regression in cost-per-run as a real regression that needs a fix, and the system stays cheap as it grows.

Frequently asked questions

How much more do multi-agent systems cost than single-agent?

Multi-agent runs typically consume several times more tokens than a single-agent equivalent, because each subagent carries its own context and tool calls. That multiplier is the price of parallelism and specialization, but caching, batching, and model routing can claw most of it back without hurting quality.

What gives the biggest token savings fastest?

Prompt caching, usually. In a multi-turn agent loop the stable prefix — system prompt, tool definitions, reference docs — gets re-sent every turn, and caching it cuts the cost of that repetition sharply. Order your prompts so all stable content sits at the front and only dynamic content varies at the end.

When should I use the batch API versus real-time calls?

Use the asynchronous batch API for anything that does not need an immediate response — overnight enrichment, bulk classification, large eval runs — because it offers a meaningful per-token discount in exchange for latency. Keep only genuinely interactive, user-facing turns on the real-time path.

Should every subagent use the same model?

No. Reserve your most capable model for the orchestrator and hard reasoning, and route mechanical subagent work like extraction and classification to a fast, cheap model such as Haiku. Letting the orchestrator pass a difficulty hint that maps to a model tier is a clean way to automate this.

Bringing efficient agents to your phone lines

CallSphere applies these same cost and latency patterns to voice and chat — multi-agent assistants that answer every call and message fast, cache what they can, and keep per-conversation cost low at scale. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.