Skip to content
AI Engineering
AI Engineering9 min read0 views

Prompt Caching for Voice Agents: The Real 90% Savings in 2026

Anthropic and OpenAI both offer 90%+ prompt cache discounts on stable input. We measured 91% cache hit rates in production — here is the engineering pattern that gets you there.

Anthropic and OpenAI both offer 90%+ prompt cache discounts on stable input. We measured 91% cache hit rates in production — here is the engineering pattern that gets you there.

The cost problem

flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
CallSphere reference architecture

Voice agent system prompts are huge. A typical production prompt for a healthcare intake or sales discovery flow runs 8,000 to 22,000 tokens — clinical guardrails, tool schemas, tone rules, escalation paths, FAQ snippets. That prompt re-charges every turn on naive token billing.

A 12-turn call with a 22k-token prompt charges 264k input tokens just for the prompt repetition. At GPT-4o text rates ($2.50/M input) that is $0.66 per call before the model says a word. Caching is no longer optional.

How prompt caching prices it

OpenAI (May 2026):

  • Cached input: 90% discount on most models (gpt-4o $2.50→$0.25 per M; gpt-realtime audio $32→$0.40 per M)
  • Implicit cache, 5-minute TTL, automatic on repeated prefixes
  • No special configuration required for stable prefixes

Anthropic Claude (May 2026):

  • Cached input: 90% discount (0.1× standard rate)
  • Cache write: 1.25× standard input rate (one-time)
  • 5-minute or 1-hour TTL options
  • Explicit cache_control markers required

Google Gemini:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • Implicit cache: 25% discount automatically on repeated content
  • Explicit context cache: up to 75% discount when you cache via CachedContent
  • 1-hour default TTL, configurable

Honest math

Without caching, 12-turn call with 22k system prompt:

  • 22k × 12 turns = 264k input tokens
  • gpt-4o text: 264k × $2.50 / 1M = $0.66 per call

With OpenAI implicit caching (90% hit rate after turn 1):

  • Turn 1: 22k × $2.50 / 1M = $0.055
  • Turns 2–12: 22k × 11 × $0.25 / 1M = $0.061
  • Total: $0.116 per call (82% savings)

With Anthropic explicit caching:

  • Cache write: 22k × $3.75 / 1M = $0.0825 (one-time)
  • Cache reads: 22k × 11 × $0.30 / 1M = $0.0726
  • Output (constant): ~$0.05
  • Total: $0.205 per call (substantially under uncached)

The pattern: savings are real and big, but the engineering matters. A few rules:

  1. The cached portion has to be prefix — anything after a dynamic insert breaks the cache.
  2. TTL is short (5 min default) — cold-call patterns underperform.
  3. Cache write costs 25% extra one-time on Anthropic; OpenAI is implicit, no write penalty.
  4. Tool schemas should be in the prefix portion, not appended later.

How CallSphere optimizes

CallSphere runs three caching patterns across 6 verticals (37 agents, 90+ tools, 115+ DB tables):

Pattern 1: Healthcare post-call analytics with GPT-4o-mini. A 14k-token clinical analysis prompt runs against every Healthcare call's transcript at end-of-call. We hit 96% cache rate because the prompt prefix is identical across calls and only the transcript varies in the user message (post-prefix). Cost: $0.0024 per analysis vs $0.024 uncached — a 90% savings.

Pattern 2: Sales product live agent with ElevenLabs Sarah voice + GPT-4o-mini brain. The 9k-token sales playbook is split into a static head (8.4k, cached) and a dynamic tail (600 tokens, per-call: prospect name, lead score, last touch). We hit 91% cache rate. Cost: roughly $0.018 per minute LLM-only.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Pattern 3: Healthcare Voice Agent on OpenAI Realtime PCM16 24kHz. An 18k-token clinical prompt with 14 tools. Same split approach — 16.4k stable head, 1.6k dynamic tail. 91% cache hit on the realtime audio cache rate ($32 → $0.40 per M, 98.75% off). Net effective LLM cost: under $0.05/min on the voice path.

The pricing tiers ($149 / $499 / $1499) bake this caching savings into the margin. Without caching we could not run the 14-day no-card trial without burning cash. Caching is the difference between a sustainable SMB price point and an enterprise-only product.

Optimization checklist

  1. Split your prompt into stable head + dynamic tail.
  2. Put tool schemas in the stable head — not appended to the user message.
  3. Keep dynamic tail under 10% of total prompt size for max cache benefit.
  4. On Anthropic, set explicit cache_control markers at boundary points.
  5. On OpenAI, just keep the prefix stable — implicit cache handles it.
  6. Monitor your hit rate via the API response prompt_tokens_details field.
  7. Pre-warm cache with a low-cost call at start-of-shift if traffic is bursty.
  8. Use 1-hour TTL on Anthropic only when calls are frequent enough — 1.25× write cost amortizes.
  9. Never put PII in cached content — clinical prompts are fine, patient names are not.
  10. Re-measure quarterly — both vendors keep tweaking the cache discount rate.

FAQ

Is OpenAI prompt caching truly automatic? Yes — implicit caching on identical prefixes triggers automatically with 5-minute TTL. No code change required.

Why does Anthropic charge for cache write? The cache state is stored on Anthropic infrastructure; the 1.25× write fee covers that. Reads are 0.1× input.

What is the typical cache hit rate in production? 80–95% for stable prompts in chat agents; 85–96% for voice agents because turns repeat the prefix.

Does caching work with tool calls? Yes — tool schemas are part of the prompt prefix and benefit from the cache.

Can I cache the user message? On Anthropic yes (with markers); on OpenAI not directly, only the system prompt portion benefits from implicit cache.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like