By Sagar Shankaran, Founder of CallSphere
GCP Speech-to-Text Chirp at $0.016 per 15s and Vertex Live multimodal pricing change the math. Where Google Cloud's voice stack beats AWS and OpenAI — and where it does not.
Key takeaways
GCP Speech-to-Text Chirp at $0.016 per 15s and Vertex Live multimodal pricing change the math. Where Google Cloud's voice stack beats AWS and OpenAI — and where it does not.
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]Google Cloud has three voice-relevant products that overlap awkwardly: Speech-to-Text (the standard STT API with Chirp model), Cloud Text-to-Speech (Polly equivalent with Studio voices), and Vertex AI Live (the multimodal Gemini realtime endpoint). Each one prices differently and the documentation sprawls.
If you are evaluating a GCP voice stack, you need to figure out which combination wins for your workload — and whether Vertex Live's bundled approach beats stitching the components.
Cloud Speech-to-Text (May 2026):
Cloud Text-to-Speech:
Vertex AI Gemini (May 2026):
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Profile A — 5-minute support call, GCP stitched (Chirp + WaveNet + Gemini 2.5 Flash):
That is 3× more expensive than the cascaded Deepgram + GPT-4o-mini + Aura-2 stack ($0.019/min). Speech-to-Text Chirp is the line item killing it.
Profile B — 12-min healthcare intake, GCP Live (Gemini 2.5 Pro audio):
Per-minute Gemini Live cost lands roughly $0.20–$0.35/min depending on prompt size and cache hit, similar to gpt-realtime uncached.
Profile C — Same as B with cache and Flash variant:
So GCP wins when you go all-in on Gemini Live with caching and the Flash model. GCP loses when you stitch with Speech-to-Text Chirp because Chirp pricing is uncompetitive vs Deepgram or even Transcribe Tier 2.
CallSphere does not run a GCP-native voice path in production today. We use Vertex Search for one B2B research feature and we evaluate Gemini 2.5 Pro for long-context post-call summarization where the 2M context window helps.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
For live voice we land on OpenAI Realtime PCM16 24kHz on Healthcare and ElevenLabs Sarah on Sales, with Deepgram Nova-3 for the cascaded paths. Across 6 verticals — 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 aligned — the routing logic gives Gemini a fair shake on long-context analytics but rarely picks it for live audio.
The pricing tiers on our site ($149 / $499 / $1499) are deliberately designed so we can swap providers per agent without breaking margin. If you want to feel the GCP-vs-OpenAI difference in your own data, the ROI calculator plugs your existing usage into a per-provider cost model. The 14-day no-card trial lets you measure live.
Is Vertex AI Live cheaper than OpenAI Realtime? Roughly equivalent. Both land $0.06–$0.10/min cached on typical workloads.
Why is Speech-to-Text Chirp so expensive? GCP positioned Chirp as premium quality. For pure STT, Deepgram Nova-3 is dramatically cheaper.
What is context caching on Gemini? A discount on repeated input tokens — implicit gets 25% off, explicit gets up to 75% off. Useful for big system prompts.
Can I use Vertex Live with HIPAA? Yes — Vertex AI is HIPAA-eligible with a BAA in place.
Should I use Gemini for cost-sensitive flows? 2.5 Flash is competitive with GPT-4o-mini for short-context flows. For long-context, Gemini 2.5 Pro wins on context window size.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to texto a voz (text-to-speech in Spanish): LATAM vs Castilian voices, free options, and how CallSphere ships Spanish agents.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
A founder's guide to the Siri voice generator landscape: how AI voice cloning works, what is legal, and how CallSphere uses 57+ voices in production.
A founder's guide to AI voice assistants for ecommerce: customer service, order lookup, and how CallSphere fits in versus virtual receptionists.
Robot text to speech in 2026: how I pick TTS APIs, when robotic voices help, and how CallSphere ships 57+ language voice agents. Hands-on guide.
The customer support specialist role in 2026 is half human, half AI. Here is what the job looks like, the AI tools that pair with it, and how we ship it.
© 2026 CallSphere LLC. All rights reserved.