Prompt Caching for Voice Agents: The Real 90% Savings in 2026
Anthropic and OpenAI both offer 90%+ prompt cache discounts on stable input. We measured 91% cache hit rates in production — here is the engineering pattern that gets you there.
Anthropic and OpenAI both offer 90%+ prompt cache discounts on stable input. We measured 91% cache hit rates in production — here is the engineering pattern that gets you there.
The cost problem
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]Voice agent system prompts are huge. A typical production prompt for a healthcare intake or sales discovery flow runs 8,000 to 22,000 tokens — clinical guardrails, tool schemas, tone rules, escalation paths, FAQ snippets. That prompt re-charges every turn on naive token billing.
A 12-turn call with a 22k-token prompt charges 264k input tokens just for the prompt repetition. At GPT-4o text rates ($2.50/M input) that is $0.66 per call before the model says a word. Caching is no longer optional.
How prompt caching prices it
OpenAI (May 2026):
- Cached input: 90% discount on most models (gpt-4o $2.50→$0.25 per M; gpt-realtime audio $32→$0.40 per M)
- Implicit cache, 5-minute TTL, automatic on repeated prefixes
- No special configuration required for stable prefixes
Anthropic Claude (May 2026):
- Cached input: 90% discount (0.1× standard rate)
- Cache write: 1.25× standard input rate (one-time)
- 5-minute or 1-hour TTL options
- Explicit
cache_controlmarkers required
Google Gemini:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Implicit cache: 25% discount automatically on repeated content
- Explicit context cache: up to 75% discount when you cache via
CachedContent - 1-hour default TTL, configurable
Honest math
Without caching, 12-turn call with 22k system prompt:
- 22k × 12 turns = 264k input tokens
- gpt-4o text: 264k × $2.50 / 1M = $0.66 per call
With OpenAI implicit caching (90% hit rate after turn 1):
- Turn 1: 22k × $2.50 / 1M = $0.055
- Turns 2–12: 22k × 11 × $0.25 / 1M = $0.061
- Total: $0.116 per call (82% savings)
With Anthropic explicit caching:
- Cache write: 22k × $3.75 / 1M = $0.0825 (one-time)
- Cache reads: 22k × 11 × $0.30 / 1M = $0.0726
- Output (constant): ~$0.05
- Total: $0.205 per call (substantially under uncached)
The pattern: savings are real and big, but the engineering matters. A few rules:
- The cached portion has to be prefix — anything after a dynamic insert breaks the cache.
- TTL is short (5 min default) — cold-call patterns underperform.
- Cache write costs 25% extra one-time on Anthropic; OpenAI is implicit, no write penalty.
- Tool schemas should be in the prefix portion, not appended later.
How CallSphere optimizes
CallSphere runs three caching patterns across 6 verticals (37 agents, 90+ tools, 115+ DB tables):
Pattern 1: Healthcare post-call analytics with GPT-4o-mini. A 14k-token clinical analysis prompt runs against every Healthcare call's transcript at end-of-call. We hit 96% cache rate because the prompt prefix is identical across calls and only the transcript varies in the user message (post-prefix). Cost: $0.0024 per analysis vs $0.024 uncached — a 90% savings.
Pattern 2: Sales product live agent with ElevenLabs Sarah voice + GPT-4o-mini brain. The 9k-token sales playbook is split into a static head (8.4k, cached) and a dynamic tail (600 tokens, per-call: prospect name, lead score, last touch). We hit 91% cache rate. Cost: roughly $0.018 per minute LLM-only.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Pattern 3: Healthcare Voice Agent on OpenAI Realtime PCM16 24kHz. An 18k-token clinical prompt with 14 tools. Same split approach — 16.4k stable head, 1.6k dynamic tail. 91% cache hit on the realtime audio cache rate ($32 → $0.40 per M, 98.75% off). Net effective LLM cost: under $0.05/min on the voice path.
The pricing tiers ($149 / $499 / $1499) bake this caching savings into the margin. Without caching we could not run the 14-day no-card trial without burning cash. Caching is the difference between a sustainable SMB price point and an enterprise-only product.
Optimization checklist
- Split your prompt into stable head + dynamic tail.
- Put tool schemas in the stable head — not appended to the user message.
- Keep dynamic tail under 10% of total prompt size for max cache benefit.
- On Anthropic, set explicit
cache_controlmarkers at boundary points. - On OpenAI, just keep the prefix stable — implicit cache handles it.
- Monitor your hit rate via the API response
prompt_tokens_detailsfield. - Pre-warm cache with a low-cost call at start-of-shift if traffic is bursty.
- Use 1-hour TTL on Anthropic only when calls are frequent enough — 1.25× write cost amortizes.
- Never put PII in cached content — clinical prompts are fine, patient names are not.
- Re-measure quarterly — both vendors keep tweaking the cache discount rate.
FAQ
Is OpenAI prompt caching truly automatic? Yes — implicit caching on identical prefixes triggers automatically with 5-minute TTL. No code change required.
Why does Anthropic charge for cache write? The cache state is stored on Anthropic infrastructure; the 1.25× write fee covers that. Reads are 0.1× input.
What is the typical cache hit rate in production? 80–95% for stable prompts in chat agents; 85–96% for voice agents because turns repeat the prefix.
Does caching work with tool calls? Yes — tool schemas are part of the prompt prefix and benefit from the cache.
Can I cache the user message? On Anthropic yes (with markers); on OpenAI not directly, only the system prompt portion benefits from implicit cache.
Sources
- OpenAI Prompt Caching announcement — https://openai.com/index/api-prompt-caching/
- Anthropic Prompt Caching docs — https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- OpenAI API Pricing — https://openai.com/api/pricing/
- Anthropic API Pricing — https://platform.claude.com/docs/en/about-claude/pricing
- ngrok prompt caching benchmark — https://ngrok.com/blog/prompt-caching
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.