What Prompt Caching Is

Modern LLM providers cache the prefix tokens of your prompts. When you submit a prompt that shares a long prefix with a recent prompt, the cached prefix is much cheaper to process. For agentic systems with stable system prompts and tool definitions, this is the single largest cost lever in 2026.

This piece walks through what each major provider charges, the savings math, and how to architect for high cache hit rates.

How Each Provider Charges

flowchart TB
    Anthropic[Anthropic] --> A1[Cache write: 1.25x base]
    Anthropic --> A2[Cache read: 0.1x base]
    Anthropic --> A3[5-min default TTL, 1-hr extended]
    OAI[OpenAI] --> O1[Cache hit: 0.5x base, automatic]
    OAI --> O2[No write surcharge]
    OAI --> O3[~5-10 min TTL, no extended]
    Goo[Google] --> G1[Implicit cache: 0.25x base]
    Goo --> G2[Explicit cache: ~0.1x base]
    Goo --> G3[Configurable TTL, paid by storage]

Three different models:

Anthropic: explicit caching with a small write surcharge and a large read discount. 5-minute default TTL, 1-hour extended TTL (priced higher to write).
OpenAI: automatic caching for prompts above a threshold (typically 1024 tokens). Cache hit is roughly half-price; no write surcharge.
Google Gemini: both implicit (automatic) and explicit (developer-managed) caching. Implicit is automatic and cheap; explicit has paid storage and configurable TTL.

The pricing details and exact discounts shift; the structural differences are stable.

Savings Math

For a typical agent with a 6K-token system prompt, 3K-token tool definitions, 1K-token retrieved context, and 500-token user message — a 10K-token prompt — and a 2K-token output:

Without caching, every request pays for 10K input + 2K output. With caching after the first request and assuming the system prompt and tool definitions are reused:

Anthropic with caching: ~9K cached input (cheap) + 1.5K fresh input + 2K output
OpenAI with auto-caching: ~9K cached input (half-price) + 1.5K fresh + 2K output
Google with explicit caching: ~9K cached (very cheap) + 1.5K fresh + 2K output

Net cost reduction is roughly 60-85 percent on input tokens for repeated prompts. Output tokens are not cached.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

What Triggers a Cache Miss

A few things invalidate the cache:

Any change in the cached prefix (tool definitions changed, system prompt edited)
Cache TTL expires (5 min default for Anthropic; 5-10 min for OpenAI; configurable for Google)
Cache eviction (rare for active prefixes)
Different model version

Cache management is the hidden discipline. A small change to a system prompt that you make casually wipes the cache for everyone using it.

Architecting for Hit Rates

flowchart LR
    Stable[Stable content first:<br/>system prompt, tool defs, big reference docs] --> Cached[Cached]
    Var[Variable content last:<br/>user message, retrieved snippet] --> Fresh[Fresh]

The pattern:

Put the stable, reusable content at the start of the prompt
Put the request-specific content at the end
Avoid changing the cacheable prefix between requests
Use the explicit cache control where the API supports it (Anthropic's cache_control, Google's CacheBucket)

When Caching Doesn't Help

Long-tail one-off prompts where prefixes do not repeat
Highly varied system prompts (per-user customization that breaks reuse)
Cold-start workloads where TTL expires before reuse
Outputs that are streamed and depend on heavy variable context

For agent platforms with stable system prompts and tool definitions, caching helps a lot. For one-off creative generation, less so.

Cross-Provider Strategy

Multi-provider deployments need to think about caching across providers:

Each provider has its own cache; switching providers means a fresh cache cold start
Some workloads make more sense pinned to one provider for caching benefits
Routing decisions should consider cache locality

Real Numbers from CallSphere

For our healthcare voice agent on Anthropic with extensive caching:

System prompt + tool definitions: ~5K tokens, cached
Per-call retrieved patient context: ~1K tokens, fresh
User turn: ~50-200 tokens, fresh
Cache hit rate after warmup: ~92%
Net cost reduction vs no-cache baseline: ~73%

These numbers are typical for production agent workloads with stable prompts.

What's Coming

Cross-region cache sharing (some providers experimenting)
Cross-model cache where models share architecture
More aggressive automatic caching that obviates the need for explicit control
Cache for chains (caching at the multi-call level, not just per-call)

Practical Guidance

For any production agent in 2026:

Enable caching on every provider where available
Restructure prompts to maximize stable prefix
Audit your prompts for unnecessary variation in cacheable sections
Track cache hit rate as a first-class operational metric
Consider extended TTL for high-traffic stable prefixes if your provider offers it

Sources

Anthropic prompt caching documentation — https://docs.anthropic.com/claude/docs/prompt-caching
OpenAI prompt caching — https://platform.openai.com/docs/guides/prompt-caching
Google Gemini context caching — https://ai.google.dev/gemini-api/docs/caching
"Prompt caching in production" Hamel Husain — https://hamel.dev
"LLM cost optimization" Anthropic engineering — https://www.anthropic.com/engineering

Prompt Caching Pricing 2026: Anthropic, OpenAI, Google, and the Savings Math

What Prompt Caching Is

How Each Provider Charges

Savings Math

What Triggers a Cache Miss

Architecting for Hit Rates

When Caching Doesn't Help

Cross-Provider Strategy

Real Numbers from CallSphere

What's Coming

Practical Guidance

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Constitutional AI Revisited: Anthropic's Updated Principles for 2026 Agents

The Orchestrator-Worker Pattern: Anthropic's Research Architecture Explained

How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)

Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

8 AI System Design Interview Questions Actually Asked at FAANG in 2026