Skip to content
Technology
Technology8 min read1 views

Prompt Caching Pricing 2026: Anthropic, OpenAI, Google, and the Savings Math

Prompt caching pricing varies a lot across providers in 2026. The numbers, the savings math, and how to architect for cache hits.

What Prompt Caching Is

Modern LLM providers cache the prefix tokens of your prompts. When you submit a prompt that shares a long prefix with a recent prompt, the cached prefix is much cheaper to process. For agentic systems with stable system prompts and tool definitions, this is the single largest cost lever in 2026.

This piece walks through what each major provider charges, the savings math, and how to architect for high cache hit rates.

How Each Provider Charges

flowchart TB
    Anthropic[Anthropic] --> A1[Cache write: 1.25x base]
    Anthropic --> A2[Cache read: 0.1x base]
    Anthropic --> A3[5-min default TTL, 1-hr extended]
    OAI[OpenAI] --> O1[Cache hit: 0.5x base, automatic]
    OAI --> O2[No write surcharge]
    OAI --> O3[~5-10 min TTL, no extended]
    Goo[Google] --> G1[Implicit cache: 0.25x base]
    Goo --> G2[Explicit cache: ~0.1x base]
    Goo --> G3[Configurable TTL, paid by storage]

Three different models:

  • Anthropic: explicit caching with a small write surcharge and a large read discount. 5-minute default TTL, 1-hour extended TTL (priced higher to write).
  • OpenAI: automatic caching for prompts above a threshold (typically 1024 tokens). Cache hit is roughly half-price; no write surcharge.
  • Google Gemini: both implicit (automatic) and explicit (developer-managed) caching. Implicit is automatic and cheap; explicit has paid storage and configurable TTL.

The pricing details and exact discounts shift; the structural differences are stable.

Savings Math

For a typical agent with a 6K-token system prompt, 3K-token tool definitions, 1K-token retrieved context, and 500-token user message — a 10K-token prompt — and a 2K-token output:

Without caching, every request pays for 10K input + 2K output. With caching after the first request and assuming the system prompt and tool definitions are reused:

  • Anthropic with caching: ~9K cached input (cheap) + 1.5K fresh input + 2K output
  • OpenAI with auto-caching: ~9K cached input (half-price) + 1.5K fresh + 2K output
  • Google with explicit caching: ~9K cached (very cheap) + 1.5K fresh + 2K output

Net cost reduction is roughly 60-85 percent on input tokens for repeated prompts. Output tokens are not cached.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

What Triggers a Cache Miss

A few things invalidate the cache:

  • Any change in the cached prefix (tool definitions changed, system prompt edited)
  • Cache TTL expires (5 min default for Anthropic; 5-10 min for OpenAI; configurable for Google)
  • Cache eviction (rare for active prefixes)
  • Different model version

Cache management is the hidden discipline. A small change to a system prompt that you make casually wipes the cache for everyone using it.

Architecting for Hit Rates

flowchart LR
    Stable[Stable content first:<br/>system prompt, tool defs, big reference docs] --> Cached[Cached]
    Var[Variable content last:<br/>user message, retrieved snippet] --> Fresh[Fresh]

The pattern:

  • Put the stable, reusable content at the start of the prompt
  • Put the request-specific content at the end
  • Avoid changing the cacheable prefix between requests
  • Use the explicit cache control where the API supports it (Anthropic's cache_control, Google's CacheBucket)

When Caching Doesn't Help

  • Long-tail one-off prompts where prefixes do not repeat
  • Highly varied system prompts (per-user customization that breaks reuse)
  • Cold-start workloads where TTL expires before reuse
  • Outputs that are streamed and depend on heavy variable context

For agent platforms with stable system prompts and tool definitions, caching helps a lot. For one-off creative generation, less so.

Cross-Provider Strategy

Multi-provider deployments need to think about caching across providers:

  • Each provider has its own cache; switching providers means a fresh cache cold start
  • Some workloads make more sense pinned to one provider for caching benefits
  • Routing decisions should consider cache locality

Real Numbers from CallSphere

For our healthcare voice agent on Anthropic with extensive caching:

  • System prompt + tool definitions: ~5K tokens, cached
  • Per-call retrieved patient context: ~1K tokens, fresh
  • User turn: ~50-200 tokens, fresh
  • Cache hit rate after warmup: ~92%
  • Net cost reduction vs no-cache baseline: ~73%

These numbers are typical for production agent workloads with stable prompts.

What's Coming

  • Cross-region cache sharing (some providers experimenting)
  • Cross-model cache where models share architecture
  • More aggressive automatic caching that obviates the need for explicit control
  • Cache for chains (caching at the multi-call level, not just per-call)

Practical Guidance

For any production agent in 2026:

  • Enable caching on every provider where available
  • Restructure prompts to maximize stable prefix
  • Audit your prompts for unnecessary variation in cacheable sections
  • Track cache hit rate as a first-class operational metric
  • Consider extended TTL for high-traffic stable prefixes if your provider offers it

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.