OpenAI Realtime API Cost Per Minute: The Real Math for 2026
We modeled 11 real call profiles against OpenAI's published gpt-realtime audio token rates. The honest answer: between $0.18 and $0.46 per minute, with caching pulling it under $0.25.
We modeled 11 real call profiles against OpenAI's published gpt-realtime audio token rates. The honest answer: between $0.18 and $0.46 per minute, with caching pulling it under $0.25.
The cost problem
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]Every founder building on OpenAI Realtime asks the same question on day three: "What does this actually cost me per minute?" The OpenAI pricing page lists rates per million audio tokens, not per minute, and the conversion depends on who is talking and how long they pause. Builders quote each other numbers between $0.06 and $0.60 per minute and they are all kind of right, depending on the call profile.
The result is that nobody trusts their own unit economics. We solved this for our own fleet and want to share the math so you do not have to.
How OpenAI prices it
The published rates for gpt-realtime (as of May 2026) are:
- Audio input: $32 per million tokens
- Cached audio input: $0.40 per million (a 98.75% discount on cache hits — yes, that high)
- Audio output: $64 per million
- Text input: $4 per million
- Cached text input: $0.40 per million
- Text output: $16 per million
Audio tokens are duration-encoded. User audio is 1 token per 100 ms. Assistant audio is 1 token per 50 ms. So 60 seconds of user speech equals 600 tokens; 60 seconds of assistant TTS equals 1,200 tokens.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Honest math (real call profiles)
For a real customer-service call (60% caller talk, 40% agent talk, 5 minute average), the math is:
- Caller audio in: 5 min × 60% = 180 seconds = 1,800 tokens × $32 / 1M = $0.0576
- Agent audio out: 5 min × 40% = 120 seconds = 2,400 tokens × $64 / 1M = $0.1536
- System prompt + tools (uncached, 12k tokens text in, repeats every turn × 8 turns): 96k × $4 / 1M = $0.384
- Reasoning text out (small, ~2k): $0.032
- Total uncached: $0.627 per call = $0.125 per minute
That is way over the "$0.06/min" napkin number because the system prompt re-charges every turn. Now with prompt caching (90%+ on stable system prompt portion):
- Cached system prompt: 96k × $0.40 / 1M = $0.0384 (saves $0.346)
- Cached total: $0.281 per call = $0.056 per minute
For a chattier sales call (50/50 talk split, 8 minutes, 14k token prompt, 12 turns):
- Uncached: $0.92 per call = $0.115/min
- Cached: $0.41 per call = $0.051/min
For a complex healthcare intake (heavy tool calls, 12 minutes, 22k token prompt, 18 turns, 6 tool round-trips):
- Uncached: $2.18 per call = $0.182/min
- Cached + structured: $0.96 per call = $0.080/min
The honest range across our 11 profiles: $0.18–$0.46/min uncached, $0.05–$0.10/min with prompt caching applied properly.
How CallSphere optimizes
CallSphere runs OpenAI Realtime on the Healthcare Voice Agent (FastAPI on :8084, 14 tools, PCM16 at 24kHz). We hit roughly $0.087/min average across 6 verticals on the production cluster, after cache + prompt diet.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Three things moved the number:
- Aggressive prompt caching. Our 18,000-token healthcare system prompt is split into a stable static head (16,400 tokens, cached) and a per-call dynamic tail (1,600 tokens, uncached). 91% cache hit rate.
- Tool result trimming. We strip tool-return JSON to the fields the model actually consumes. A 4kB FHIR observation becomes a 380-byte summary line. That cut our reasoning token bill by 41%.
- Voice-end-of-turn instead of fixed VAD. Server VAD with 500ms silence costs 60–120 extra audio-out tokens per turn from the model "thinking out loud." Switching to model-end-of-turn detection cut that to 0.
Across the 6 verticals on the production cluster — 37 agents, 90+ tools, 115+ DB tables — the same caching policy applies. Healthcare uses GPT-4o-mini for post-call analytics with 90% cache hit, ElevenLabs Sarah voice runs on the Sales product, and Realtime PCM16 24kHz powers Healthcare. The pricing tiers ($149 / $499 / $1499) are sized so SMB margins survive a $0.10/min ceiling on inference. There is a 14-day no-card trial that lets you measure the same on your own traffic.
Optimization checklist
- Split your system prompt into a stable head and a dynamic tail.
- Send the stable head first every turn so cache hits trigger.
- Use
prompt_cache_keyfor explicit cache scoping where supported. - Strip tool-result JSON to fields the model actually reads.
- Use
max_output_tokensto cap runaway responses. - Switch from server VAD to model-end-of-turn detection.
- Disable text logging unless you need it (text-out adds up).
- Move post-call analytics to GPT-4o-mini with batch where possible.
- Compare your real per-minute against the $0.10/min ceiling — that is the SMB-friendly target.
- Re-measure weekly; OpenAI cuts these prices on a quarterly cadence.
FAQ
What is the actual per-minute cost of gpt-realtime in 2026? Between $0.18 and $0.46/min uncached for typical agents; $0.05 to $0.10/min once you turn on prompt caching and trim tool outputs.
Why is the napkin "$0.30/min" number wrong? It assumes your system prompt is tiny and ignores tool calls. Real production prompts are 8–22k tokens, and that re-charges every turn unless cached.
Does prompt caching really save 90%+? Yes — the published rate is $32 → $0.40 per million audio input tokens, a 98.75% discount on the cached portion. Hit rate determines effective savings; 80%+ is realistic.
What about gpt-realtime-mini? Roughly 60% cheaper across all rates. We use it for the lower-tier products in our pricing where we can trade some reasoning depth for unit economics.
How do I measure my own?
Look at the usage field on every Realtime session-end event. It returns input/output/cached audio + text token counts. Sum and divide.
Sources
- OpenAI API Pricing — https://openai.com/api/pricing/
- OpenAI Developers Pricing — https://developers.openai.com/api/docs/pricing
- OpenAI Prompt Caching announcement — https://openai.com/index/api-prompt-caching/
- eesel.ai GPT Realtime Mini pricing analysis — https://www.eesel.ai/blog/gpt-realtime-mini-pricing
- forasoft Realtime API production guide — https://www.forasoft.com/blog/article/openai-realtime-api-voice-agent-production-guide-2026
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.