Cutting Token Cost in Claude Legal Agents: Caching & Batching
Prompt caching, batching, and model routing that keep Claude agents in legal workflows fast and cheap — practical cost engineering for law firm deployments.
A single Claude agent that reviews a 90-page master services agreement can cost more than a paralegal's hourly rate if you build it carelessly. Multiply that by a firm reviewing hundreds of contracts a week, and the bill becomes the deciding factor in whether the deployment survives its first quarter. When you deploy Claude across the legal industry, performance and cost are not afterthoughts — they are the difference between a pilot and a platform.
The good news is that legal workloads are unusually friendly to optimization. The documents are long but stable; the same clause library and the same firm playbook get sent on every run; many tasks are embarrassingly parallel. Each of those properties maps onto a specific cost lever. This post walks through the levers in the order they pay off.
Where the tokens actually go
Before optimizing, measure. Instrument every run to record input tokens, output tokens, and which portion of the input is static versus per-request. In a typical contract-review agent, the static portion — system prompt, firm playbook, clause definitions, tool schemas — dwarfs the dynamic portion. We have seen the fixed preamble run four to five times longer than the actual contract excerpt being analyzed.
That ratio is the whole game. If 80% of your input tokens are identical on every call, you are paying full price to re-process the same text thousands of times a day. The first optimization is to stop doing that.
Prompt caching: pay once for the boilerplate
Prompt caching lets Claude store a prefix of your prompt and reuse it across requests at a steep discount, so you pay full rate only for the new tokens after the cached prefix. For legal agents this is transformational, because your playbook and clause library are the perfect cache prefix: large, stable, and reused on every matter.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Structure your prompt so the static content comes first and the cache breakpoint sits right before the per-matter document. Order matters: system prompt, then tool definitions, then firm playbook and clause library, then — last — the specific contract text. Anything above the breakpoint is cached; anything below is fresh. Keep the cached region byte-for-byte identical between calls; a single changed character invalidates the cache and you pay full price again.
flowchart TD
A["Incoming review request"] --> B{"Cached prefix valid?"}
B -->|Yes| C["Reuse playbook & schema at discount"]
B -->|No| D["Process full prefix, write cache"]
C --> E["Append per-matter contract text"]
D --> E
E --> F{"Task urgent?"}
F -->|No| G["Queue to batch API"]
F -->|Yes| H["Run live, route by complexity"]
G --> I["Collect results & store"]
H --> IOne caution specific to legal work: cache invalidation is easy to trigger by accident. If you inject the current date, a request ID, or a freshly shuffled tool order into the cached region, you destroy the discount silently. Pin the static region and audit it. A good test is to hash the cached prefix on every request and alert if the hash changes when it shouldn't.
Batching the work that isn't urgent
A surprising share of legal-agent work is not interactive. Overnight due-diligence sweeps, bulk clause extraction across a deal room, retroactive risk scoring of an archive — none of these need a sub-second response. For those, the batch API processes large volumes asynchronously at a meaningful discount over live calls. Route non-urgent jobs to a queue, submit them as a batch, and collect results when they complete.
The architectural move is to separate the request's latency class from its content. Tag each task as interactive or deferrable at intake. Interactive tasks — a lawyer asking a live question about a contract on screen — run synchronously. Deferrable tasks fall into the batch lane. This single split often cuts the blended cost of a legal-agent platform substantially, because so much of the volume is bulk processing that nobody is waiting on in real time.
Right-sizing the model for each step
Not every step needs the most capable model. A pipeline that runs Opus-class reasoning for clause extraction, conflict detection, and final summarization alike is overpaying for the easy steps. Use the model family deliberately: a fast, cheap model such as Haiku for classification, routing, and "is this clause present" checks; a mid-tier Sonnet for standard extraction and drafting; and the most capable Opus only for the genuinely hard reasoning — novel risk analysis, ambiguous cross-references, adversarial redlines.
Implement this as explicit routing. A cheap first-pass model triages each document and decides which sections warrant escalation; only those sections go to the expensive model. In legal review, most clauses are boilerplate the firm has seen a thousand times; reserving heavy reasoning for the few that aren't can cut spend dramatically without touching quality on the hard cases.
Trimming the loop itself
Beyond per-call cost, the number of turns drives spend. A multi-agent legal pipeline can use several times more tokens than a single well-prompted agent, so reach for orchestration only when the task genuinely needs parallel specialists. When you do use subagents, give each a narrow brief and the smallest context that lets it succeed; passing the full document to every subagent is the most common token leak in these systems.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Also cap retrieval. If your agent pulls ten documents when two would do, you pay for eight irrelevant ones and you dilute the model's attention. Retrieve narrowly, summarize early, and discard intermediate context you no longer need. Cheaper runs and better answers usually come from the same discipline: send the model only what the current step requires.
Frequently asked questions
How much can prompt caching save on legal agents?
It depends on your static-to-dynamic ratio, but legal agents are ideal candidates because the playbook and clause library — the bulk of the prompt — repeat on every run. Cached prefix tokens bill at a fraction of full rate, so the larger and more stable your preamble, the bigger the saving.
When should I use the batch API instead of live calls?
Whenever no human is waiting on the result — overnight due diligence, bulk extraction, archive scoring. Tag tasks as interactive or deferrable at intake and route deferrable ones to batch for a real per-token discount.
Does using a cheaper model hurt accuracy?
Only if you use it for the wrong steps. Route easy work — classification, presence checks, routing — to a fast model and escalate genuinely hard reasoning to the most capable one. Most legal clauses are boilerplate, so this preserves quality where it matters while cutting cost elsewhere.
Why did my cache discount disappear?
Something in the cached prefix changed. A dynamic date, a request ID, or a reshuffled tool order inside the cached region invalidates it. Pin the static region byte-for-byte and hash it on each request to catch accidental drift.
Bringing fast, affordable agents to your phone lines
CallSphere applies the same caching, batching, and model-routing discipline to voice and chat agents — keeping real-time conversations fast and cheap while they answer calls, use tools, and book work 24/7. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.