---
title: "Prompt Caching for Voice Agents: The Real 90% Savings in 2026"
description: "Anthropic and OpenAI both offer 90%+ prompt cache discounts on stable input. We measured 91% cache hit rates in production — here is the engineering pattern that gets you there."
canonical: https://callsphere.ai/blog/vw2c-prompt-caching-voice-agents-real-savings-2026
category: "AI Engineering"
tags: ["Prompt Caching", "Cost", "OpenAI", "Anthropic", "Voice AI"]
author: "CallSphere Team"
published: 2026-03-22T00:00:00.000Z
updated: 2026-05-07T09:32:11.130Z
---

# Prompt Caching for Voice Agents: The Real 90% Savings in 2026

> Anthropic and OpenAI both offer 90%+ prompt cache discounts on stable input. We measured 91% cache hit rates in production — here is the engineering pattern that gets you there.

> Anthropic and OpenAI both offer 90%+ prompt cache discounts on stable input. We measured 91% cache hit rates in production — here is the engineering pattern that gets you there.

## The cost problem

```mermaid
flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
```

CallSphere reference architecture

Voice agent system prompts are huge. A typical production prompt for a healthcare intake or sales discovery flow runs 8,000 to 22,000 tokens — clinical guardrails, tool schemas, tone rules, escalation paths, FAQ snippets. That prompt re-charges every turn on naive token billing.

A 12-turn call with a 22k-token prompt charges 264k input tokens just for the prompt repetition. At GPT-4o text rates ($2.50/M input) that is $0.66 per call before the model says a word. Caching is no longer optional.

## How prompt caching prices it

**OpenAI (May 2026):**

- Cached input: 90% discount on most models (gpt-4o $2.50→$0.25 per M; gpt-realtime audio $32→$0.40 per M)
- Implicit cache, 5-minute TTL, automatic on repeated prefixes
- No special configuration required for stable prefixes

**Anthropic Claude (May 2026):**

- Cached input: 90% discount (0.1× standard rate)
- Cache write: 1.25× standard input rate (one-time)
- 5-minute or 1-hour TTL options
- Explicit `cache_control` markers required

**Google Gemini:**

- Implicit cache: 25% discount automatically on repeated content
- Explicit context cache: up to 75% discount when you cache via `CachedContent`
- 1-hour default TTL, configurable

## Honest math

**Without caching, 12-turn call with 22k system prompt:**

- 22k × 12 turns = 264k input tokens
- gpt-4o text: 264k × $2.50 / 1M = **$0.66 per call**

**With OpenAI implicit caching (90% hit rate after turn 1):**

- Turn 1: 22k × $2.50 / 1M = $0.055
- Turns 2–12: 22k × 11 × $0.25 / 1M = $0.061
- **Total: $0.116 per call (82% savings)**

**With Anthropic explicit caching:**

- Cache write: 22k × $3.75 / 1M = $0.0825 (one-time)
- Cache reads: 22k × 11 × $0.30 / 1M = $0.0726
- Output (constant): ~$0.05
- **Total: $0.205 per call (substantially under uncached)**

The pattern: **savings are real and big, but the engineering matters.** A few rules:

1. The cached portion has to be *prefix* — anything after a dynamic insert breaks the cache.
2. TTL is short (5 min default) — cold-call patterns underperform.
3. Cache write costs 25% extra one-time on Anthropic; OpenAI is implicit, no write penalty.
4. Tool schemas should be in the prefix portion, not appended later.

## How CallSphere optimizes

CallSphere runs three caching patterns across 6 verticals (37 agents, 90+ tools, 115+ DB tables):

**Pattern 1: Healthcare post-call analytics with GPT-4o-mini.** A 14k-token clinical analysis prompt runs against every Healthcare call's transcript at end-of-call. We hit 96% cache rate because the prompt prefix is identical across calls and only the transcript varies in the user message (post-prefix). Cost: $0.0024 per analysis vs $0.024 uncached — a 90% savings.

**Pattern 2: Sales product live agent with ElevenLabs Sarah voice + GPT-4o-mini brain.** The 9k-token sales playbook is split into a static head (8.4k, cached) and a dynamic tail (600 tokens, per-call: prospect name, lead score, last touch). We hit 91% cache rate. Cost: roughly $0.018 per minute LLM-only.

**Pattern 3: Healthcare Voice Agent on OpenAI Realtime PCM16 24kHz.** An 18k-token clinical prompt with 14 tools. Same split approach — 16.4k stable head, 1.6k dynamic tail. 91% cache hit on the realtime audio cache rate ($32 → $0.40 per M, 98.75% off). Net effective LLM cost: under $0.05/min on the voice path.

The [pricing tiers](/pricing) ($149 / $499 / $1499) bake this caching savings into the margin. Without caching we could not run the [14-day no-card trial](/trial) without burning cash. Caching is the difference between a sustainable SMB price point and an enterprise-only product.

## Optimization checklist

1. Split your prompt into stable head + dynamic tail.
2. Put tool schemas in the stable head — not appended to the user message.
3. Keep dynamic tail under 10% of total prompt size for max cache benefit.
4. On Anthropic, set explicit `cache_control` markers at boundary points.
5. On OpenAI, just keep the prefix stable — implicit cache handles it.
6. Monitor your hit rate via the API response `prompt_tokens_details` field.
7. Pre-warm cache with a low-cost call at start-of-shift if traffic is bursty.
8. Use 1-hour TTL on Anthropic only when calls are frequent enough — 1.25× write cost amortizes.
9. Never put PII in cached content — clinical prompts are fine, patient names are not.
10. Re-measure quarterly — both vendors keep tweaking the cache discount rate.

## FAQ

**Is OpenAI prompt caching truly automatic?**
Yes — implicit caching on identical prefixes triggers automatically with 5-minute TTL. No code change required.

**Why does Anthropic charge for cache write?**
The cache state is stored on Anthropic infrastructure; the 1.25× write fee covers that. Reads are 0.1× input.

**What is the typical cache hit rate in production?**
80–95% for stable prompts in chat agents; 85–96% for voice agents because turns repeat the prefix.

**Does caching work with tool calls?**
Yes — tool schemas are part of the prompt prefix and benefit from the cache.

**Can I cache the user message?**
On Anthropic yes (with markers); on OpenAI not directly, only the system prompt portion benefits from implicit cache.

## Sources

- OpenAI Prompt Caching announcement — [https://openai.com/index/api-prompt-caching/](https://openai.com/index/api-prompt-caching/)
- Anthropic Prompt Caching docs — [https://platform.claude.com/docs/en/build-with-claude/prompt-caching](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)
- OpenAI API Pricing — [https://openai.com/api/pricing/](https://openai.com/api/pricing/)
- Anthropic API Pricing — [https://platform.claude.com/docs/en/about-claude/pricing](https://platform.claude.com/docs/en/about-claude/pricing)
- ngrok prompt caching benchmark — [https://ngrok.com/blog/prompt-caching](https://ngrok.com/blog/prompt-caching)

---

Source: https://callsphere.ai/blog/vw2c-prompt-caching-voice-agents-real-savings-2026
