By Sagar Shankaran, Founder of CallSphere
Anthropic and OpenAI both offer 90%+ prompt cache discounts on stable input. We measured 91% cache hit rates in production — here is the engineering pattern that gets you there.
Key takeaways
Anthropic and OpenAI both offer 90%+ prompt cache discounts on stable input. We measured 91% cache hit rates in production — here is the engineering pattern that gets you there.
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]Voice agent system prompts are huge. A typical production prompt for a healthcare intake or sales discovery flow runs 8,000 to 22,000 tokens — clinical guardrails, tool schemas, tone rules, escalation paths, FAQ snippets. That prompt re-charges every turn on naive token billing.
A 12-turn call with a 22k-token prompt charges 264k input tokens just for the prompt repetition. At GPT-4o text rates ($2.50/M input) that is $0.66 per call before the model says a word. Caching is no longer optional.
OpenAI (May 2026):
Anthropic Claude (May 2026):
cache_control markers requiredGoogle Gemini:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
CachedContentWithout caching, 12-turn call with 22k system prompt:
With OpenAI implicit caching (90% hit rate after turn 1):
With Anthropic explicit caching:
The pattern: savings are real and big, but the engineering matters. A few rules:
CallSphere runs three caching patterns across 6 verticals (37 agents, 90+ tools, 115+ DB tables):
Pattern 1: Healthcare post-call analytics with GPT-4o-mini. A 14k-token clinical analysis prompt runs against every Healthcare call's transcript at end-of-call. We hit 96% cache rate because the prompt prefix is identical across calls and only the transcript varies in the user message (post-prefix). Cost: $0.0024 per analysis vs $0.024 uncached — a 90% savings.
Pattern 2: Sales product live agent with ElevenLabs Sarah voice + GPT-4o-mini brain. The 9k-token sales playbook is split into a static head (8.4k, cached) and a dynamic tail (600 tokens, per-call: prospect name, lead score, last touch). We hit 91% cache rate. Cost: roughly $0.018 per minute LLM-only.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Pattern 3: Healthcare Voice Agent on OpenAI Realtime PCM16 24kHz. An 18k-token clinical prompt with 14 tools. Same split approach — 16.4k stable head, 1.6k dynamic tail. 91% cache hit on the realtime audio cache rate ($32 → $0.40 per M, 98.75% off). Net effective LLM cost: under $0.05/min on the voice path.
The pricing tiers ($149 / $499 / $1499) bake this caching savings into the margin. Without caching we could not run the 14-day no-card trial without burning cash. Caching is the difference between a sustainable SMB price point and an enterprise-only product.
cache_control markers at boundary points.prompt_tokens_details field.Is OpenAI prompt caching truly automatic? Yes — implicit caching on identical prefixes triggers automatically with 5-minute TTL. No code change required.
Why does Anthropic charge for cache write? The cache state is stored on Anthropic infrastructure; the 1.25× write fee covers that. Reads are 0.1× input.
What is the typical cache hit rate in production? 80–95% for stable prompts in chat agents; 85–96% for voice agents because turns repeat the prefix.
Does caching work with tool calls? Yes — tool schemas are part of the prompt prefix and benefit from the cache.
Can I cache the user message? On Anthropic yes (with markers); on OpenAI not directly, only the system prompt portion benefits from implicit cache.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to texto a voz (text-to-speech in Spanish): LATAM vs Castilian voices, free options, and how CallSphere ships Spanish agents.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
A founder's guide to the Siri voice generator landscape: how AI voice cloning works, what is legal, and how CallSphere uses 57+ voices in production.
A founder's guide to AI voice assistants for ecommerce: customer service, order lookup, and how CallSphere fits in versus virtual receptionists.
Robot text to speech in 2026: how I pick TTS APIs, when robotic voices help, and how CallSphere ships 57+ language voice agents. Hands-on guide.
The customer support specialist role in 2026 is half human, half AI. Here is what the job looks like, the AI tools that pair with it, and how we ship it.
© 2026 CallSphere LLC. All rights reserved.