Cutting Token Cost in Claude Clinical Abstraction Runs
Use prompt caching, batching, and context pruning to keep Claude clinical-abstraction agents fast and cheap at registry scale in 2026.
Clinical abstraction at scale is a volume problem. A hospital quality team might run an agent across tens of thousands of charts a month, each chart sprawling across dozens of pages of notes, labs, and reports. If you build that agent naively — full chart in the context, full system prompt re-sent on every turn, one chart per request — your token bill and your latency both explode, and the project dies in a budget meeting. Keeping abstraction runs cheap and fast is its own engineering discipline, separate from getting the answers right.
The good news is that abstraction workloads are unusually friendly to optimization. The instructions are identical across charts, the tool definitions never change, and most of each chart is irrelevant to any given field. That structure is exactly what prompt caching, batching, and aggressive context pruning are built to exploit. This post covers how to apply each one to a Claude abstraction agent without sacrificing accuracy.
Where the tokens actually go
Before optimizing, measure. Instrument each run to report input tokens, output tokens, and cache-read versus cache-write tokens per turn. In a typical unoptimized abstraction agent, the dominant cost is input tokens re-sent every turn: the system prompt, the abstraction guidelines, the tool schemas, and the full chart all get re-billed on each step of the loop. A five-step abstraction over a long chart can re-send the same 30,000-token chart five times. That repetition, not the model's thinking, is your bill.
Output tokens matter less in pure cost but more in latency, because they're generated serially. An agent that narrates every step in prose is slower than one that emits terse tool calls. So the two levers are: stop re-sending identical input (caching), and stop generating unnecessary output (terse formats and smaller context).
Prompt caching: the single biggest win
Prompt caching lets you mark a stable prefix of the request so Claude stores it and bills cache reads at a fraction of normal input cost on subsequent calls. For abstraction, the stable prefix is large and obvious: system prompt, abstraction rules, code-set references, and tool definitions are byte-for-byte identical across every chart and every turn. Put all of that at the front, mark it as cacheable, and you stop paying full price to re-send it thousands of times.
flowchart TD
A["Build request"] --> B["Cacheable prefix: rules + code sets + tools"]
B --> C["Per-chart suffix: this chart's text"]
C --> D{"Prefix in cache?"}
D -->|Yes| E["Cache read: cheap"]
D -->|No| F["Cache write: full price once"]
E --> G["Claude abstracts fields"]
F --> GThe ordering rule is strict: cacheable content must come first and must be stable. The moment you interleave the per-chart text into the middle of your guidelines, the cache breaks and you pay full price again. Keep a clean boundary — invariant prefix, then variable chart suffix. Across a month of runs this routinely turns the largest line item on the bill into a rounding error, because the expensive prefix is written to cache once and read cheaply forever after.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Cache entries expire after a short idle window, so caching pays off best when you keep volume flowing. If you abstract in bursts with long gaps, the cache goes cold between batches and each batch re-pays the write. Schedule runs to keep the cache warm during a processing window rather than dribbling charts in one at a time across the day.
Batching and concurrency
Each chart is independent, which makes abstraction embarrassingly parallel. Run many charts concurrently rather than serially — the wall-clock time for ten thousand charts drops by your concurrency factor, and the shared cacheable prefix is warm for all of them. For non-urgent backfills where latency doesn't matter, an asynchronous batch path trades immediate results for a meaningful per-token discount, which is ideal for monthly registry submissions that have a deadline but no real-time requirement.
Be deliberate about not over-parallelizing the wrong layer. Concurrency across charts is pure win. Spawning multiple subagents per chart, by contrast, multiplies tokens several times over and rarely helps a bounded extraction task. Reserve multi-agent fan-out for genuinely decomposable work; for single-chart abstraction, one well-instrumented agent per chart, many charts in parallel, is the cheap and fast shape.
Prune the context to the fields you need
The biggest chart-side saving comes from not sending the whole chart. Most fields live in predictable sections — the principal diagnosis is in the discharge summary, the ejection fraction in the echo report. A cheap pre-pass with a small model like Claude Haiku 4.5 can route each field to the one or two sections likely to contain it, so the expensive abstraction step sees a few thousand tokens instead of thirty thousand. You pay a little for routing and save a lot on the main call.
Pair this with tool results that return spans, not whole documents. When the agent fetches a section, return just that section, and when it searches, return the matching passages with a little surrounding context — not the entire note. Every token you don't put in the context is a token you don't pay for on this turn and every subsequent turn of the loop. Lean context is the gift that keeps giving in an agentic loop because the loop re-bills whatever you leave in it.
Keep output terse and structured
Ask the agent to emit structured results — field, value, source quote, confidence — rather than narrating its reasoning in long prose. Structured output is shorter, faster to generate, easier to validate, and easier to cache downstream. If you need the reasoning for audit, capture it in the trajectory log rather than in the final output the agent must generate token by token. Reserve extended thinking for the genuinely hard fields, not for routine ones where it just burns output tokens and latency for no accuracy gain.
The same logic applies to the loop shape itself. A bounded extraction task should resolve in a small number of tool calls, so cap the agent's step budget and make each step do real work. An agent allowed to wander will rack up turns — and every extra turn re-bills the cached prefix read plus whatever variable context you left in the window. Tightening the loop is not just a reliability win; it is directly a cost win, because tokens in an agentic system are paid per turn, not per task.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Watch the cost dashboard like a latency budget
Cost optimization decays if you don't measure it continuously. Build a simple per-run record — input tokens, output tokens, cache-read tokens, cache-write tokens, tool calls, and wall-clock time — and aggregate it into a dashboard you actually look at. The two numbers that tell you whether your caching is working are the cache-read ratio (you want most input tokens served from cache) and the average tool calls per chart (you want it low and stable). When either drifts, something regressed: a prompt edit broke the cache boundary, or a change made the agent loopier.
Set alerts on those ratios the way you'd set alerts on p95 latency. A silent regression — someone reorders the prompt and the cacheable prefix stops matching — can double your bill overnight with no error and no failing test. The teams that keep abstraction cheap over months are the ones that treat token economics as a monitored system property, not a one-time tuning pass at launch.
Frequently asked questions
How much can prompt caching realistically save here?
It depends on how much of your request is invariant, but for abstraction the invariant prefix — rules, code sets, tool schemas — is usually large and re-sent on every turn. Caching it converts that from full-price input on each call to a cheap cache read. When the prefix dwarfs the per-chart text, the savings on input cost are substantial and compound across thousands of charts.
Does pruning the chart hurt accuracy?
Only if the router sends the wrong section. Build the router conservatively — when unsure, include an extra section rather than fewer — and keep a fallback that widens the context if a field comes back not-documented. Done carefully, pruning preserves accuracy while cutting most of the token cost.
Should I use batch processing or real-time calls?
Use real-time for anything a clinician is waiting on, and asynchronous batch for backfills and scheduled registry runs where a delay of hours is fine. Batch paths typically cost less per token, so routing non-urgent volume through them lowers the bill without touching quality.
Is a cheaper model enough on its own?
Using Haiku for routing and reserving a stronger model for the actual extraction is a good split, but model choice alone won't fix re-sent context. Caching and pruning attack the structural waste; model selection tunes the per-token rate. You want both.
Efficient agents on the phone, too
CallSphere applies the same cost discipline to live voice and chat — cached instructions, lean context, and tight tool calls so every conversation stays fast and affordable at scale. See it at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.