---
title: "Cutting token cost in Claude Code dynamic workflows"
description: "Keep Claude Code dynamic workflows cheap and fast with prompt caching, batching, context scoping, and the right multi-agent tradeoff."
canonical: https://callsphere.ai/blog/cutting-token-cost-in-claude-code-dynamic-workflows
category: "Agentic AI"
tags: ["agentic ai", "claude", "claude code", "prompt caching", "token cost", "performance", "dynamic workflows"]
author: "CallSphere Team"
published: 2026-05-28T11:23:11.000Z
updated: 2026-06-06T21:47:41.498Z
---

# Cutting token cost in Claude Code dynamic workflows

> Keep Claude Code dynamic workflows cheap and fast with prompt caching, batching, context scoping, and the right multi-agent tradeoff.

A dynamic workflow that works is easy to fall in love with and easy to overspend on. The agent reads a little more context than it needs, re-reads it on the next turn, spawns a subagent that re-derives what the parent already knew, and your token bill quietly triples. The agent still produces good output, so nobody notices until the monthly invoice or the latency complaints arrive. Cost and speed are not afterthoughts in agentic systems — they are design constraints you build for from the first run.

The good news is that most of the waste in a Claude Code dynamic workflow comes from a handful of predictable patterns, and each has a clean fix. This post covers the three highest-leverage levers — prompt caching, batching, and scoping context — plus the multi-agent tradeoff that quietly dominates everything else. The goal is a workflow that is cheap and fast by construction, not one you have to babysit.

## Where the tokens actually go

Before optimizing, measure. The single most useful habit is to log per-run token usage broken down by input and output, and to attribute it to the tool calls that drove it. When you do this for the first time, the result is almost always surprising: the bulk of the cost is input tokens, not output, and most of those input tokens are the same context being re-sent on every turn of the conversation.

This is fundamental to how agentic loops work. Each turn, the model receives the entire conversation so far — system prompt, every prior tool call, every tool result — plus the new step. A ten-turn workflow re-sends turn one's context ten times. If your system prompt and tool definitions are large, you pay for them on every single turn. That is the number to attack first, because it scales with conversation length and dominates long runs.

## Prompt caching: the biggest single win

Prompt caching lets you mark a stable prefix of the request — system prompt, tool definitions, large reference documents — so that on subsequent turns those tokens are read from cache at a fraction of the normal input price instead of being reprocessed. For a long dynamic workflow where the same prefix is resent every turn, this is frequently the largest cost reduction available, often turning the dominant line item into a rounding error.

```mermaid
flowchart TD
  A["New turn"] --> B{"Stable prefix
matches cache?"}
  B -->|Yes| C["Read prefix from cache
(cheap)"]
  B -->|No| D["Process full prefix
(full price)"]
  D --> E["Write prefix to cache"]
  C --> F["Process only new
turn tokens"]
  E --> F
  F --> G["Model responds"]
```

To use caching well, structure your request so the stable parts come first and the volatile parts come last. Put the system prompt, tool definitions, and any large fixed reference material at the front of the context where they can be cached, and keep the changing conversation tail after them. The moment you insert volatile content before stable content, you invalidate the cache for everything after it and pay full price again.

The trap to avoid is cache thrash. If you regenerate your system prompt with a timestamp, a random ID, or freshly-ordered tool definitions on every run, the prefix never matches and caching does nothing. Keep the cacheable prefix byte-for-byte identical across turns and runs. A stable prefix is worth engineering for deliberately, not leaving to chance.

## Batching: do more per turn

Every turn of an agentic loop carries fixed overhead — the resent context, the model's reasoning, the round-trip latency. The fewer turns it takes to finish, the cheaper and faster the run. Batching is the practice of accomplishing more work per turn so the loop is shorter. Instead of reading one file, deciding, reading the next, decide once to read all the files you know you will need, then reason over them together.

This is partly a tool design problem. If your tools only operate on one item at a time, the model is forced into one-item-per-turn loops. Provide tools that accept multiple targets — read several files in one call, run several independent checks together — and the model can collapse what would have been six turns into one. Fewer turns means less resent context, which compounds with caching for a multiplicative win.

Batching also applies to parallelism. When subtasks are genuinely independent, Claude Code can run them concurrently rather than serially, which cuts wall-clock latency even if total tokens are similar. The key word is independent: batching only helps when the items do not depend on each other's results. If step two needs step one's output, forcing them into one turn just produces a guess.

## Scope the context you feed in

The cheapest token is the one you never send. A surprising amount of waste comes from over-feeding the agent: dumping an entire large file when only one function matters, or loading a reference document the task will never touch. Every token of context you include is paid for on the turn it enters and, without caching, on every turn after. Be deliberate about what enters the context window.

This is where Agent Skills earn their keep. Rather than stuffing every possible instruction into the system prompt, skills let Claude load instructions and resources dynamically only when the task is relevant. A workflow that supports twenty capabilities does not pay for twenty capabilities' worth of prompt on every run — it loads the one or two it needs. Designing your reference material to be loaded on demand keeps the baseline context lean.

For retrieval-style workflows, prefer fetching the specific span over the whole document. Return the matching function, the relevant section, the rows that satisfy the query — not the entire file or table. This keeps each turn's input small and, as a bonus, improves quality, because the model reasons better over a tight, relevant context than over a haystack it has to search.

## The multi-agent tradeoff

The largest cost decision in agentic design is whether to use multiple agents at all. Multi-agent runs typically consume several times more tokens than a single agent solving the same task, because each subagent carries its own context and the orchestrator pays to coordinate them. That cost can be worth it for genuinely parallel, well-isolated subtasks — but it is rarely worth it for work a single focused agent could do in sequence.

The discipline is to reach for multi-agent only when the parallelism is real and the subtasks are cleanly separable, and to pick the cheapest capable model for each role. A subagent doing mechanical extraction does not need your most powerful model; reserve the heavy model for the orchestration and judgment, and use a lighter, faster model for the bulk grunt work. Matching model size to task difficulty is one of the most reliable cost levers there is.

## Frequently asked questions

### What is prompt caching and why does it matter for agents?

Prompt caching lets you mark a stable prefix — system prompt, tool definitions, fixed references — so it is read from cache cheaply on later turns instead of reprocessed. Because agentic loops resend the full context every turn, caching the stable prefix is often the single largest cost reduction in a long dynamic workflow.

### How do I make a Claude Code workflow run faster?

Shorten the loop. Use batching tools that handle multiple items per call so the agent finishes in fewer turns, run genuinely independent subtasks in parallel to cut wall-clock time, and scope the context so each turn processes less. Fewer turns and smaller turns both reduce latency.

### When is a multi-agent setup worth the extra tokens?

Only when subtasks are genuinely parallel and cleanly isolated. Multi-agent runs cost several times more tokens than a single agent, so reserve them for true parallelism and use the cheapest capable model for each subagent role rather than running everything on your most powerful model.

### Why is most of my cost input tokens rather than output?

Because each turn resends the entire conversation so far. A long workflow reprocesses its system prompt and tool definitions on every turn. Cache the stable prefix and trim what you feed in, and the input-token line item usually drops dramatically.

## Agentic AI that pays for itself on your phone lines

CallSphere uses these same cost and latency tactics — caching, batching, and right-sized models — to run **voice and chat** agents that answer every call and message and book work 24/7 without runaway cost. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/cutting-token-cost-in-claude-code-dynamic-workflows
