GCP Vertex AI Speech & Live Pricing vs Alternatives in 2026
GCP Speech-to-Text Chirp at $0.016 per 15s and Vertex Live multimodal pricing change the math. Where Google Cloud's voice stack beats AWS and OpenAI — and where it does not.
GCP Speech-to-Text Chirp at $0.016 per 15s and Vertex Live multimodal pricing change the math. Where Google Cloud's voice stack beats AWS and OpenAI — and where it does not.
The cost problem
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]Google Cloud has three voice-relevant products that overlap awkwardly: Speech-to-Text (the standard STT API with Chirp model), Cloud Text-to-Speech (Polly equivalent with Studio voices), and Vertex AI Live (the multimodal Gemini realtime endpoint). Each one prices differently and the documentation sprawls.
If you are evaluating a GCP voice stack, you need to figure out which combination wins for your workload — and whether Vertex Live's bundled approach beats stitching the components.
How GCP prices it
Cloud Speech-to-Text (May 2026):
- Chirp model standard real-time: $0.016 per 15 seconds = $0.064/min
- Chirp_2 / Telephony: similar tier
- Free tier: 60 minutes/month
- Volume discounts available at enterprise spend
Cloud Text-to-Speech:
- Standard: $4 per 1M characters
- WaveNet: $16 per 1M characters
- Studio: $160 per 1M characters
Vertex AI Gemini (May 2026):
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Gemini 2.5 Flash: $0.075/M input · $0.30/M output text tokens
- Gemini 2.5 Pro: $1.25/M input · $5/M output
- Gemini Live audio: similar token model with audio input/output meters
- Context caching: implicit cache 25% off, explicit cache up to 75% off
Honest math
Profile A — 5-minute support call, GCP stitched (Chirp + WaveNet + Gemini 2.5 Flash):
- Speech-to-Text: 5 × $0.064 = $0.32
- TTS WaveNet (2 min × 750 chars ÷ 1M × $16): $0.024
- Gemini 2.5 Flash (12k input cached + 2k output): ~$0.015
- Total: ~$0.359/call → $0.072/min
That is 3× more expensive than the cascaded Deepgram + GPT-4o-mini + Aura-2 stack ($0.019/min). Speech-to-Text Chirp is the line item killing it.
Profile B — 12-min healthcare intake, GCP Live (Gemini 2.5 Pro audio):
Per-minute Gemini Live cost lands roughly $0.20–$0.35/min depending on prompt size and cache hit, similar to gpt-realtime uncached.
Profile C — Same as B with cache and Flash variant:
- ~$0.06–$0.09/min
So GCP wins when you go all-in on Gemini Live with caching and the Flash model. GCP loses when you stitch with Speech-to-Text Chirp because Chirp pricing is uncompetitive vs Deepgram or even Transcribe Tier 2.
When GCP wins
- Multimodal flows (audio + video together) — Gemini Live is the strongest
- You already have committed GCP spend
- Long context windows (Gemini 2.5 Pro handles 2M tokens cleanly)
- You want context caching (75% explicit cache discount is competitive)
- Search and grounding integrations — Vertex AI Search beats most alternatives
When GCP loses
- Pure voice STT-only workloads — Deepgram is 13× cheaper
- Latency-sensitive premium support — gpt-realtime wins on TTFT
- Studio voices are $160/M chars — only justifiable for branded recordings, not live agents
- The pricing surface area is hard to navigate — you will spend ops time decoding it
How CallSphere optimizes
CallSphere does not run a GCP-native voice path in production today. We use Vertex Search for one B2B research feature and we evaluate Gemini 2.5 Pro for long-context post-call summarization where the 2M context window helps.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
For live voice we land on OpenAI Realtime PCM16 24kHz on Healthcare and ElevenLabs Sarah on Sales, with Deepgram Nova-3 for the cascaded paths. Across 6 verticals — 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 aligned — the routing logic gives Gemini a fair shake on long-context analytics but rarely picks it for live audio.
The pricing tiers on our site ($149 / $499 / $1499) are deliberately designed so we can swap providers per agent without breaking margin. If you want to feel the GCP-vs-OpenAI difference in your own data, the ROI calculator plugs your existing usage into a per-provider cost model. The 14-day no-card trial lets you measure live.
Optimization checklist
- Use Vertex Live (not stitched) if you commit to GCP — bundled is cheaper at scale.
- Lean on Gemini 2.5 Flash where possible — the Pro upcharge is usually not worth it.
- Use explicit context caching aggressively — 75% off cached input is competitive.
- Avoid Studio voices for live agents — WaveNet is good enough.
- If you only need STT, Deepgram or Transcribe Tier 2 beat GCP on cost.
- Measure Chirp accuracy on your accent profile — strong on broad English, weaker on rare accents than Deepgram Nova-3.
- Watch for Gemini Live audio-token-rate updates — Google has cut prices three times in 2026 already.
- Use Cloud Logging for per-call cost attribution.
- Pin to a single region for Vertex — multi-region routing adds latency.
- Re-evaluate quarterly — GCP voice pricing moves more than AWS.
FAQ
Is Vertex AI Live cheaper than OpenAI Realtime? Roughly equivalent. Both land $0.06–$0.10/min cached on typical workloads.
Why is Speech-to-Text Chirp so expensive? GCP positioned Chirp as premium quality. For pure STT, Deepgram Nova-3 is dramatically cheaper.
What is context caching on Gemini? A discount on repeated input tokens — implicit gets 25% off, explicit gets up to 75% off. Useful for big system prompts.
Can I use Vertex Live with HIPAA? Yes — Vertex AI is HIPAA-eligible with a BAA in place.
Should I use Gemini for cost-sensitive flows? 2.5 Flash is competitive with GPT-4o-mini for short-context flows. For long-context, Gemini 2.5 Pro wins on context window size.
Sources
- Google Cloud Speech-to-Text Pricing — https://cloud.google.com/speech-to-text/pricing
- Google Cloud Text-to-Speech Pricing — https://cloud.google.com/text-to-speech/pricing
- Vertex AI Generative AI Pricing — https://cloud.google.com/vertex-ai/generative-ai/pricing
- nOps Vertex AI Pricing 2026 guide — https://www.nops.io/blog/vertex-ai-pricing/
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.