Skip to content
Agentic AI
Agentic AI8 min read0 views

Token Cost and Speed for Grounded Claude: Caching

Cut cost on citation-grounded Claude: prompt caching, the Batches API, chunk budgeting, and per-stage model selection across Claude 4.x.

Grounding is expensive by construction. To cite well, Claude needs the source text in front of it, which means stuffing retrieved chunks into the context window on every single turn. A naive citation pipeline re-sends the same system prompt, the same tool definitions, and a fresh pile of documents for each question, and the bill scales linearly with traffic. The good news: most of those tokens are repeated, and repeated tokens are exactly what Claude's caching and batching features were built to make cheap.

This post is about keeping a grounded, citation-producing Claude system fast and affordable without weakening the grounding. We will look at prompt caching for the stable parts, batching for throughput-tolerant workloads, model selection across the Claude 4.x family, and the chunk-budgeting decisions that quietly drive most of your spend.

Key takeaways

  • Cache the stable prefix — system prompt, tool defs, and any fixed reference docs — so you pay full price for them once, not per request.
  • Order context cache-first: stable content before volatile content, or the cache never gets a clean hit.
  • Batch the latency-tolerant work (offline grounding, eval runs, backfills) for a substantial per-token discount.
  • Right-size the model: Haiku 4.5 for retrieval and extraction, Sonnet 4.6 for most grounded answers, Opus 4.8 only when reasoning is the bottleneck.
  • Retrieve fewer, better chunks — top-k is a cost dial, not a free quality lever.

Where the tokens actually go

Before optimizing, measure. In a typical grounded turn, the token budget splits across four buckets: the system prompt and citation instructions (stable), the tool definitions (stable), the retrieved chunks (volatile, often the largest), and the user question plus conversation history (volatile). The retrieved chunks usually dominate, and they are also the part people over-provision — pulling twenty chunks "to be safe" when five would ground the answer.

The stable buckets are pure waste if you re-send them at full price every turn, because they are byte-for-byte identical across thousands of requests. That is the opening for prompt caching. The volatile buckets are where retrieval discipline pays off: every chunk you do not include is input tokens you do not buy and latency you do not incur. Profiling a few hundred real requests to see this split is the cheapest hour you will spend on cost.

Prompt caching: pay once for what does not change

Prompt caching lets you mark a prefix of the prompt as cacheable; on subsequent requests that reuse that exact prefix, those tokens are served from cache at a steep discount instead of being reprocessed. For a grounded agent, the natural cache boundary is right after your tool definitions and any always-present reference material, and right before the per-request retrieved chunks and question.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
resp = client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {"type": "text", "text": CITATION_INSTRUCTIONS},
        {"type": "text", "text": STYLE_GUIDE,
         "cache_control": {"type": "ephemeral"}},  # cache up to here
    ],
    tools=SEARCH_TOOLS,  # stable, sits inside the cached prefix
    messages=[
        {"role": "user", "content": retrieved_chunks + user_question},
    ],
)

The rule that makes or breaks caching: stable content must come first, volatile content last. The cache matches on an exact prefix, so if you interleave a fresh document into the middle of your system block, everything after it misses. Put the citation instructions, style guide, and tool definitions up front and freeze their wording. A surprising number of "caching does nothing" reports trace to a timestamp or session ID accidentally placed inside the cached region.

flowchart TD
  A["Incoming question"] --> B["Assemble prompt"]
  B --> C{"Stable prefix cached?"}
  C -->|Yes| D["Reuse cached prefix (cheap)"]
  C -->|No| E["Process full prefix once, store cache"]
  D --> F["Append retrieved chunks + question"]
  E --> F
  F --> G{"Latency tolerant?"}
  G -->|Yes| H["Send via Batches API (discounted)"]
  G -->|No| I["Send real-time"]

Batching the work that can wait

Not every grounded request needs an answer in two seconds. Overnight document enrichment, pre-computing citations for an FAQ corpus, running an eval suite, backfilling summaries — these are throughput problems, not latency problems. The Message Batches API exists for exactly this: submit a large set of requests, accept higher latency, and pay a meaningful per-token discount on the whole batch.

A practical pattern is to split your traffic by SLA. Interactive user questions go through the real-time API with caching on. Everything offline — nightly grounding refreshes, eval gates in CI, bulk reprocessing after a corpus change — goes through batches. The mental model is simple: if no human is staring at a spinner, batch it. Teams that adopt this split often find the majority of their token volume was never latency-sensitive in the first place.

Choosing the right model per stage

Grounded pipelines have stages with very different difficulty, and using one model for all of them overpays. Retrieval-adjacent tasks — query rewriting, deciding which chunks are relevant, extracting a quote span — are well within Haiku 4.5's range and run fast and cheap. The grounded answer with citations is the sweet spot for Sonnet 4.6: strong reasoning, faithful to sources, sensible cost. Reserve Opus 4.8 for genuinely hard synthesis where multiple sources conflict and the reasoning is the bottleneck.

A multi-model pipeline does add orchestration complexity, so do not split prematurely. Start everything on Sonnet, profile, and only break out a stage to Haiku once you can see it is simple and high-volume, or escalate to Opus once you can see Sonnet failing on a specific class of hard questions. Model choice is a measured decision, not a default.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Common pitfalls

  • Putting volatile data inside the cached prefix. A single per-request token in the cached region busts every hit. Keep timestamps, IDs, and chunks strictly after the cache boundary.
  • Over-retrieving chunks. Top-k of twenty when five suffices triples your input cost and can lower answer quality by burying the relevant span.
  • Batching latency-sensitive traffic. Never route interactive questions through batches; users feel the delay and the discount is not worth it.
  • Using Opus everywhere. The most capable model is rarely the right default for extraction and routing; reserve it for the hard reasoning.
  • Optimizing before measuring. Without a token breakdown per request, you will tune the wrong bucket. Profile first.

Cut grounded inference cost in 5 steps

  1. Profile a few hundred real requests and break tokens into stable vs. volatile buckets.
  2. Move all stable content to the front and mark a cache boundary after the tool definitions.
  3. Drop top-k to the smallest value that keeps answer quality flat on your eval set.
  4. Route all latency-tolerant jobs through the Batches API.
  5. Demote simple stages (query rewrite, extraction) to Haiku and confirm quality holds.

Cost levers compared

LeverBest forTradeoff
Prompt cachingRepeated stable prefixesRequires strict ordering discipline
Message batchingOffline / bulk groundingHigher latency, not for interactive use
Smaller top-kMost grounded answersRisk of missing a needed chunk
Model right-sizingMulti-stage pipelinesMore orchestration to maintain

Frequently asked questions

What is prompt caching and how does it help grounded agents?

Prompt caching is a feature that lets you mark a stable prefix of a prompt so repeated requests reuse those processed tokens at a large discount instead of paying full price each time. For grounded agents, the system prompt, citation instructions, and tool definitions are identical across requests, so caching them removes most of the fixed per-turn cost.

When should I use the Batches API instead of real-time calls?

Use batching whenever no human is waiting on the result — overnight grounding refreshes, eval suites, bulk reprocessing after a corpus change. You trade higher latency for a per-token discount across the whole batch. Keep interactive user questions on the real-time API.

Does retrieving more chunks make grounded answers better?

Not usually. Beyond a small number of well-ranked chunks, extra context mostly adds cost and can bury the relevant span, lowering quality. Treat top-k as a tunable dial: lower it until your eval scores start to drop, then stop. Better ranking beats more chunks almost every time.

Which Claude model should ground my answers?

Sonnet 4.6 is the sensible default for grounded answers with citations — strong reasoning and source fidelity at moderate cost. Use Haiku 4.5 for high-volume simple stages like query rewriting and extraction, and reserve Opus 4.8 for hard synthesis where sources conflict.

Fast, grounded answers on every call

CallSphere runs these same caching and batching patterns behind voice and chat agents — grounded responses that stay fast and affordable even at call-center volume, pulling live from your knowledge base. See it in action at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.