Skip to content
AI Engineering
AI Engineering7 min read0 views

GPT-Realtime-Whisper vs Deepgram: Streaming STT in 2026

OpenAI's GPT-Realtime-Whisper launches at $0.017/min for streaming STT. Side-by-side latency, accuracy, and cost math vs Deepgram and the field.

The Announcement, Plain English

On May 7, 2026, OpenAI shipped GPT-Realtime-Whisper, a streaming speech-to-text model priced at $0.017 per minute. It is purpose-built for low-latency transcription — the kind of STT that sits in front of voice agents, live captioning, and real-time analytics.

For teams that have been on Deepgram, AssemblyAI, Azure Speech, or Google Speech, this changes the cost-and-vendor calculus for the first time in two years.

Why Streaming STT Matters Independently

Most voice teams in 2026 still split their stack: a dedicated streaming STT vendor for transcription, a separate LLM for reasoning, and a TTS for output. Even with GPT-Realtime-2's end-to-end voice support, the split-stack pattern remains popular because:

  • Some flows do not need a full conversational model (live captioning, transcription of recorded calls, supervisor coaching feeds).
  • Pricing is often lower per minute than end-to-end voice models.
  • The transcript itself is the product (medical scribe, legal record, sales coaching analytics).

A dedicated streaming STT line item is therefore not going away. The question is which vendor wins it.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The Cost Math At Volume

Streaming STT pricing in 2026 (typical published rates):

  • GPT-Realtime-Whisper: $0.017/min
  • Deepgram Nova-3 streaming: ~$0.0043/min at high volume tiers
  • AssemblyAI Universal streaming: ~$0.015/min
  • Azure Speech streaming: ~$0.011/min
  • Google Speech-to-Text streaming: ~$0.024/min

On raw price-per-minute, Deepgram still wins at volume. GPT-Realtime-Whisper sits in the mid-tier — meaningfully above Deepgram, roughly at parity with AssemblyAI, below Google.

The trade is accuracy and consistency. Whisper's lineage gives it strong out-of-the-box performance on accented English, code-switched audio, and noisier phone audio. Deepgram is faster and cheaper but historically requires more domain tuning to hit production-grade WER on healthcare or financial vocab.

Where Whisper Wins

Three categories where GPT-Realtime-Whisper is the right call:

  1. Multilingual transcription without tuning. Whisper's training set carries it across languages where Deepgram and others need separate models.
  2. Single-vendor simplicity. If you are already on GPT-Realtime-2 for the agent, adding Whisper for raw transcripts means one bill, one auth, one SDK.
  3. Quality on accented and noisy audio. Out-of-the-box numbers on real phone-quality data have been historically strong for Whisper-family models.

Where Deepgram Still Wins

  • Pure cost-per-minute at scale. If you are doing 5M+ minutes per month and your audio is clean English, Deepgram remains the cheapest reliable option.
  • Lowest latency targets. Deepgram is hard to beat on first-word latency for English.
  • Custom model training. Deepgram's tooling for domain-tuned models is more mature.

The Real Numbers For A 50K-Call Month

Assume average 5 minutes per call, 50,000 calls/month = 250,000 minutes:

  • Whisper streaming: 250,000 x $0.017 = $4,250/mo
  • Deepgram Nova-3: 250,000 x $0.0043 = ~$1,075/mo
  • AssemblyAI Universal: 250,000 x $0.015 = $3,750/mo

Whisper costs roughly $3,175 more per month than Deepgram at that volume. For some teams that gap is irrelevant next to the simplification of running fewer vendors; for others it pays a junior engineer's salary.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Production Tradeoffs

  • Diarization. Deepgram has had production-grade diarization for a long time. The Whisper-family approach has historically been weaker here. Verify on your audio.
  • Confidence scores per word. Critical for HIPAA-grade medical scribing and legal applications. Check the API surface, not the marketing page.
  • Profanity filtering and PII redaction. Both vendors have it; the defaults differ.

Where CallSphere Fits

CallSphere is a managed voice and chat agent platform — you buy outcomes, not STT minutes. Underneath, we route STT to whichever streaming model best fits the language, latency, and accuracy profile of the call. Teams that just want "the phone agent works in English, Spanish, and 55 other languages" do not need to pick between Whisper and Deepgram themselves. Pricing: Starter $149/mo (2,000 interactions), Growth $499/mo (10,000), Scale $1,499/mo (50,000). Launch in 3–5 business days.

See pricing: callsphere.ai/pricing.

What To Do This Week

  1. Pull 30 minutes of real call audio from your worst-performing queue. Run it through both Whisper and Deepgram. Compare WER yourself — vendor benchmarks are not your data.
  2. Decide if you optimize for cost or for vendor-count. Both are valid.
  3. If you have multilingual traffic, weight Whisper higher than the raw price-per-minute suggests.

FAQ

Q: Is GPT-Realtime-Whisper the same as the open-source Whisper? A: No. It is the streaming, hosted, low-latency variant — different latency profile, different pricing, different SLA. The open-source Whisper is still a great batch transcription tool.

Q: Can I use Whisper alongside a non-OpenAI conversational model? A: Yes. It is a separate API; you can pipe transcripts anywhere.

Q: Will Deepgram match the $0.017/min price? A: Probably not — they are below it already. The competitive pressure is on the mid-tier (AssemblyAI, Azure) more than on Deepgram.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.