---
title: "Anthropic Prompt Caching for System Prompts: 90% Off (2026)"
description: "Anthropic's prompt cache cuts cached input tokens to 0.1x base price — a 90% discount. We map the 5-min vs 1-hour TTL math, the cache_control placement rules, and the layout that drops CallSphere's Claude bill from ~$4,200/mo to ~$420/mo on a 12k system prompt."
canonical: https://callsphere.ai/blog/vw9g-anthropic-prompt-caching-system-prompts-2026
category: "AI Engineering"
tags: ["Prompt Engineering", "Anthropic", "Prompt Caching", "Cost Optimization", "Claude"]
author: "CallSphere Team"
published: 2026-03-29T00:00:00.000Z
updated: 2026-05-08T17:26:02.565Z
---

# Anthropic Prompt Caching for System Prompts: 90% Off (2026)

> Anthropic's prompt cache cuts cached input tokens to 0.1x base price — a 90% discount. We map the 5-min vs 1-hour TTL math, the cache_control placement rules, and the layout that drops CallSphere's Claude bill from ~$4,200/mo to ~$420/mo on a 12k system prompt.

> **TL;DR** — Anthropic's prompt cache reads back at 0.1x base input price (90% off). Writes are 1.25x (5-min TTL) or 2x (1-hour TTL). For a 12k-token system prompt hit ≥2 times in 5 minutes, caching pays for itself immediately. Place `cache_control` on the *last* static block — everything earlier is cached transitively.

## The technique

The Claude API exposes a `cache_control: {type: "ephemeral"}` marker on any block in `system`, `messages`, or `tools`. Place it on the final block of the *static* prefix:

```json
{
  "system": [
    {"type":"text","text":"",
     "cache_control":{"type":"ephemeral"}}
  ],
  "tools": [ /* tool defs, all cached */ ],
  "messages": [ /* dynamic part */ ]
}
```

Anthropic caches *up to and including* the marked block. Up to 4 cache breakpoints per request. Default TTL 5 minutes (refreshed on each hit); 1-hour TTL available as `cache_control: {type:"ephemeral", ttl:"1h"}`.

## Why it works

Cache reads cost 0.1x base input. For Sonnet 4.6 ($3 / $15 per 1M tokens), cached reads are $0.30/M vs $3/M — 90% off. Write cost is 1.25x ($3.75/M for 5-min) so a *single* cached read amortizes the 5-min write; the 1-hour write (2x base) needs two reads to break even.

Real-world impact: workloads with stable system prompts (agents, chat bots, doc Q&A) see 70–90% cost reduction. ProjectDiscovery cut LLM cost 59%. A reported case dropped a $720/mo bill to $72.

```mermaid
flowchart LR
  REQ[Request] -->|first hit| WR[Cache write 1.25x]
  WR --> RESP[Response]
  REQ2[Request 2 within TTL] -->|cache hit| RD[Cache read 0.1x]
  RD --> RESP2[Response 90% cheaper]
  REQ3[Request after TTL] -->|miss| WR2[Re-write]
```

## CallSphere implementation

CallSphere's Healthcare voice agent runs a 12,000-token system prompt (14 tools + role + refusal taxonomy + 5-shot examples). At Scale-tier volume (~350k calls/mo), the un-cached cost would be ~$4,200/mo on Sonnet 4.6. With ephemeral caching the cost is ~$420/mo — a $3,800/mo saving on one agent. Across **37 agents**, **6 verticals**, **115+ DB tables**, savings exceed $25k/mo.

We pin tool definitions and the agent persona before the cache breakpoint; conversation history flows after, uncached. The OneRoof Triage Aria's 800-token routing prompt also caches — small prompts still benefit when QPS is high. **Starter $149**, **Growth $499**, **Scale $1,499**. **14-day trial** + **22% affiliate** included. See [the pricing page](https://callsphere.ai/pricing).

## Build steps with prompt code

```ts
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const res = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    { type: "text", text: ROLE_AND_TAXONOMY },                        // 1.5k
    { type: "text", text: TOOL_DESCRIPTIONS },                        // 4k
    { type: "text", text: FEW_SHOT_EXAMPLES,
      cache_control: { type: "ephemeral", ttl: "1h" } },              // 6.5k cached
  ],
  tools: TOOLS_ARRAY,
  messages: [
    { role: "user", content: turn },                                  // dynamic, uncached
  ],
});
```

## FAQ

**Q: How do I check if the cache hit?**
Response includes `usage.cache_read_input_tokens` (hits) and `cache_creation_input_tokens` (writes). Log both.

**Q: Does caching change output quality?**
No — same model, same weights. Only the prefix is reused.

**Q: 5-min or 1-hour TTL?**
5-min for chatty interactive workloads (auto-refreshes on each hit). 1-hour for batch evals or low-QPS agents where a stale cache still pays.

**Q: What invalidates the cache?**
Any byte-level change to the cached prefix. A new tool, a date in the system prompt — anything. Keep the prefix static; put dynamic data in messages.

## Sources

- [Anthropic Prompt Caching Docs](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)
- [Anthropic 2026 Pricing Guide — Finout](https://www.finout.io/blog/anthropic-api-pricing)
- [Prompt Caching 2026 Cost & Latency — AI Checker Hub](https://aicheckerhub.com/anthropic-prompt-caching-2026-cost-latency-guide)
- [How We Cut LLM Costs 59% with Caching — ProjectDiscovery](https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching)
- [Prompt Caching Saved $648/mo — Du'An Lightfoot](https://medium.com/@labeveryday/prompt-caching-is-a-must-how-i-went-from-spending-720-to-72-monthly-on-api-costs-3086f3635d63)

## Anthropic Prompt Caching for System Prompts: 90% Off (2026): production view

Anthropic Prompt Caching for System Prompts: 90% Off (2026) sounds like a single decision, but in production it splits into eval design, prompt cost, and observability.  The deeper you push toward live traffic, the more those three pull against each other — better evals catch silent failures, prompt cost limits how often you can re-run them, and weak observability hides which retries are actually saving conversations versus burning latency budget.

## Shipping the agent to production

Production AI agents live or die on three loops: evals, retries, and handoff state. CallSphere runs **37 agents** across 6 verticals, each with its own eval suite — synthetic call transcripts replayed nightly with assertion checks on extracted entities (date, time, party size, insurance, address). Without that loop, prompt regressions ship silently and you only find out when bookings drop.

Structured tools beat free-form text every time. Our **90+ function tools** all enforce JSON schemas validated server-side; if the model hallucinates an integer where a string is required, we retry with a corrective system message before falling back to a deterministic path. For long-running flows, we treat agent handoffs as a state machine — booking → confirmation → SMS — so context survives turn boundaries.

The Realtime API vs. async decision usually comes down to "is the user holding the phone right now?" If yes, Realtime; if no (callback queue, after-hours voicemail), async wins on cost-per-conversation, which we track per agent in **115+ database tables** spanning all 6 verticals.

## FAQ

**What's the right way to scope the proof-of-concept?**
CallSphere runs 37 production agents and 90+ function tools across 115+ database tables in 6 verticals, so most workflows you'd want already have a template. For a topic like "Anthropic Prompt Caching for System Prompts: 90% Off (2026)", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations.

**How do you handle compliance and data isolation?**
Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar.

**When does it make sense to switch from a managed model to a self-hosted one?**
The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer.

## Talk to us

Want to see how this maps to your stack? Book a live walkthrough at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting), or try the vertical-specific demo at [healthcare.callsphere.tech](https://healthcare.callsphere.tech). 14-day trial, no credit card, pilot live in 3–5 business days.

---

Source: https://callsphere.ai/blog/vw9g-anthropic-prompt-caching-system-prompts-2026