---
title: "Claude Prompt Caching Patterns: Structuring Prompts and Tools"
description: "Reusable code-level caching patterns for Claude: layer stable vs volatile content, freeze tools, and place breakpoints for multi-turn and fan-out work."
canonical: https://callsphere.ai/blog/claude-prompt-caching-patterns-structuring-prompts-and-tools
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "anthropic", "prompt engineering", "tool use", "agent design"]
author: "CallSphere Team"
published: 2026-02-06T08:46:22.000Z
updated: 2026-06-07T01:28:24.092Z
---

# Claude Prompt Caching Patterns: Structuring Prompts and Tools

> Reusable code-level caching patterns for Claude: layer stable vs volatile content, freeze tools, and place breakpoints for multi-turn and fan-out work.

Once caching works for a single document, the interesting problem becomes structural: how do you organize a real application's prompts, tool definitions, and context so that caching happens automatically across every code path, not just the one you hand-tuned? The teams that get this right do not sprinkle `cache_control` markers reactively. They design the prompt-assembly layer around a small set of patterns, so that stable content always renders before volatile content and breakpoints land at natural stability boundaries. This post collects those reusable patterns.

Think of it as a pattern catalog for the prompt-building code itself — the function that takes your inputs and produces the `tools`, `system`, and `messages` payload. Get the layering right there and caching becomes a property of the architecture rather than a per-request afterthought.

## Key takeaways

- **Layer by stability:** never-changes content first, per-session content next, per-turn content last, per-request junk eliminated or pushed to the very end.
- Serialize tools deterministically (sorted keys, stable order) so the position-0 block is byte-identical across requests and users.
- For shared-prefix-varying-suffix workloads, put the breakpoint at the end of the *shared* part, not the end of the whole prompt.
- In multi-turn conversations, mark the last block of the newest turn; earlier breakpoints stay valid as read points.
- For fan-out, send one request first, await its first token, then fire the rest so they read the cache the first one wrote.

## Pattern: layer content by how often it changes

The foundational pattern is to classify every input that flows into the prompt by its change frequency, then physically order the prompt so the slowest-changing content comes first. There are four buckets, and each has a home.

| Bucket | Examples | Where it goes |
| --- | --- | --- |
| Never changes | Agent persona, tool schemas, policy doc | Tools + start of system, behind a breakpoint |
| Per session | User profile, retrieved knowledge base | After the global prefix, own breakpoint |
| Per turn | Conversation history, latest message | End of messages, marked on newest turn |
| Per request | Timestamps, request UUIDs, random IDs | Eliminate, or place dead last after all breakpoints |

This ordering is not optional decoration; it is what makes the prefix match. A per-request UUID that sneaks into the system header sits in front of everything and invalidates the entire cache on every call. The same UUID placed in the final user message invalidates nothing before that message. The pattern is: classify, then sort by stability ascending, then place breakpoints at the bucket boundaries.

## Pattern: a deterministic, frozen tool surface

Tools render at position zero, so any nondeterminism in how you serialize them poisons the whole cache for every request that follows. The pattern is to build the tool list once, sort it canonically, and never vary it per user or per request.

```
def build_tools():
    tools = [
        {"name": "get_order", "description": "Look up an order by ID",
         "input_schema": {"type": "object",
             "properties": {"order_id": {"type": "string"}},
             "required": ["order_id"]}},
        {"name": "check_inventory", "description": "Check stock for a SKU",
         "input_schema": {"type": "object",
             "properties": {"sku": {"type": "string"}},
             "required": ["sku"]}},
    ]
    return sorted(tools, key=lambda t: t["name"])  # stable order every time
```

Sorting by name guarantees byte-identical output regardless of how the list was assembled. If you genuinely need different tools for different request types, resist the urge to swap the tool set inline — that is the single most expensive cache invalidation. Instead, keep one superset of tools and steer with `tool_choice` or message content, both of which preserve the tools cache. If the tool library is large, tool search appends discovered schemas rather than swapping them, so the existing prefix survives.

```mermaid
flowchart TD
  A["Inputs"] --> B["Frozen tools (sorted)"]
  B --> C["Stable system prompt"]
  C --> D{"breakpoint 1"}
  D --> E["Per-session: profile + retrieved docs"]
  E --> F{"breakpoint 2"}
  F --> G["Conversation history"]
  G --> H["Newest user turn"]
  H --> I{"breakpoint 3"}
```

The diagram shows the canonical three-breakpoint layout: one after the frozen tools-plus-system block, one after the per-session context, one on the newest turn. Each breakpoint is a read point that stays valid as long as everything before it is unchanged, so a new turn only invalidates from breakpoint three forward.

## Pattern: shared prefix, varying suffix

A huge class of workloads send a large fixed preamble — few-shot examples, a retrieved document, a rubric — followed by a small varying question. The mistake is marking the end of the whole prompt, which makes every request write a distinct entry that is never read. The pattern is to mark the end of the *shared* portion and leave the question unmarked.

```
messages = [{"role": "user", "content": [
    {"type": "text", "text": SHARED_CONTEXT,
     "cache_control": {"type": "ephemeral"}},   # breakpoint here
    {"type": "text", "text": user_question},      # no marker — varies
]}]
```

Now the shared context is written once and read by every subsequent request regardless of the question. This is the structural fix for the classic "my cache writes but never reads" symptom: the breakpoint was simply too late in the stream.

## Pattern: multi-turn conversation caching

In a chat or agent loop, the conversation grows by appending turns. The pattern is to place the breakpoint on the last content block of the most recently appended turn. Each new request reuses the entire prior conversation as a cached prefix, and hits accrue incrementally as the dialog lengthens.

```
def add_user_turn(messages, text):
    messages.append({"role": "user", "content": [
        {"type": "text", "text": text,
         "cache_control": {"type": "ephemeral"}}
    ]})
    return messages
```

Two refinements matter for long agent turns. First, because a breakpoint walks back only twenty content blocks, a turn that emits more than twenty tool_use and tool_result pairs needs an intermediate breakpoint roughly every fifteen blocks or the next request misses. Second, when an operator instruction arrives mid-conversation — a mode switch, injected state — append it as a `{"role": "system", ...}` message rather than editing the top-level system prompt, which would invalidate the whole cached history.

## Pattern: fork operations reuse the parent prefix

Side computations — summarization, compaction, a sub-agent — often spin up a separate API call. The pattern that keeps them cheap is to copy the parent request's `system`, `tools`, and `model` verbatim, then append only the fork-specific instruction at the end. If the fork rebuilds the prefix with any difference, even a reworded system line, it misses the parent's cache entirely and pays full price for the shared context twice.

The same discipline applies to concurrency. When you need N parallel calls over a shared prefix at cold start, they cannot read a cache that is still being written. Send one request, await its first streamed token so the entry becomes readable, then fan out the remaining N minus one. They will all read the cache the first call just created instead of each paying the write premium.

## Common pitfalls

- **Marking the end of the whole prompt for a shared-prefix workload.** Every request becomes a unique write. Mark the end of the shared portion instead.
- **Non-deterministic JSON serialization.** `json.dumps` without `sort_keys=True`, or iterating a `set`, produces different bytes per request and silently breaks the prefix match.
- **Swapping tools to implement modes.** Tools are at position zero; any change invalidates everything. Use `tool_choice` or message content for mode steering.
- **Editing the system prompt mid-session.** It sits ahead of the whole history, so every cached turn is reprocessed. Append a `role: "system"` message instead.
- **Ignoring the twenty-block lookback in agent loops.** Long tool-heavy turns silently fall out of the window; add intermediate breakpoints.

## Frequently asked questions

### How many cache breakpoints should a typical agent use?

Most agents are well served by three: one after the frozen tools-plus-system block, one after per-session context like a user profile or retrieved documents, and one on the newest conversation turn. That uses three of the four available breakpoints and maps cleanly onto the stability layers.

### Can I cache tool definitions separately from the system prompt?

They render adjacently — tools first, then system — so a single breakpoint on the last system block caches both together. There is no separate tools-only breakpoint; the placement on the system block is what captures the tool definitions in the same cached prefix.

### What happens to my breakpoints when the conversation gets compacted?

Compaction summarizes earlier history into a compaction block, which changes the prefix from that point forward, so cached entries before the compaction are invalidated. After compaction the new summarized prefix becomes the cacheable content; place your breakpoint on the most recent turn as usual and it warms again on the next request.

### Does using structured outputs interact with caching?

Structured output configuration affects the response format, not the cached prefix of `tools`, `system`, and history. You can cache the prefix and still constrain output with `output_config.format`; the breakpoint placement rules are unchanged.

## Bringing agentic AI to your phone lines

CallSphere applies these layering and breakpoint patterns to keep **voice and chat** agents cheap at scale — multi-agent assistants that answer every call and message, call tools mid-conversation, and book work 24/7. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/claude-prompt-caching-patterns-structuring-prompts-and-tools
