GCP Speech-to-Text Chirp at $0.016 per 15s and Vertex Live multimodal pricing change the math. Where Google Cloud's voice stack beats AWS and OpenAI — and where it does not.

The cost problem

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]

CallSphere reference architecture

Google Cloud has three voice-relevant products that overlap awkwardly: Speech-to-Text (the standard STT API with Chirp model), Cloud Text-to-Speech (Polly equivalent with Studio voices), and Vertex AI Live (the multimodal Gemini realtime endpoint). Each one prices differently and the documentation sprawls.

If you are evaluating a GCP voice stack, you need to figure out which combination wins for your workload — and whether Vertex Live's bundled approach beats stitching the components.

How GCP prices it

Cloud Speech-to-Text (May 2026):

Chirp model standard real-time: $0.016 per 15 seconds = $0.064/min
Chirp_2 / Telephony: similar tier
Free tier: 60 minutes/month
Volume discounts available at enterprise spend

Cloud Text-to-Speech:

Standard: $4 per 1M characters
WaveNet: $16 per 1M characters
Studio: $160 per 1M characters

Vertex AI Gemini (May 2026):

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Gemini 2.5 Flash: $0.075/M input · $0.30/M output text tokens
Gemini 2.5 Pro: $1.25/M input · $5/M output
Gemini Live audio: similar token model with audio input/output meters
Context caching: implicit cache 25% off, explicit cache up to 75% off

Honest math

Profile A — 5-minute support call, GCP stitched (Chirp + WaveNet + Gemini 2.5 Flash):

Speech-to-Text: 5 × $0.064 = $0.32
TTS WaveNet (2 min × 750 chars ÷ 1M × $16): $0.024
Gemini 2.5 Flash (12k input cached + 2k output): ~$0.015
Total: ~$0.359/call → $0.072/min

That is 3× more expensive than the cascaded Deepgram + GPT-4o-mini + Aura-2 stack ($0.019/min). Speech-to-Text Chirp is the line item killing it.

Profile B — 12-min healthcare intake, GCP Live (Gemini 2.5 Pro audio):

Per-minute Gemini Live cost lands roughly $0.20–$0.35/min depending on prompt size and cache hit, similar to gpt-realtime uncached.

Profile C — Same as B with cache and Flash variant:

~$0.06–$0.09/min

So GCP wins when you go all-in on Gemini Live with caching and the Flash model. GCP loses when you stitch with Speech-to-Text Chirp because Chirp pricing is uncompetitive vs Deepgram or even Transcribe Tier 2.

When GCP wins

Multimodal flows (audio + video together) — Gemini Live is the strongest
You already have committed GCP spend
Long context windows (Gemini 2.5 Pro handles 2M tokens cleanly)
You want context caching (75% explicit cache discount is competitive)
Search and grounding integrations — Vertex AI Search beats most alternatives

When GCP loses

Pure voice STT-only workloads — Deepgram is 13× cheaper
Latency-sensitive premium support — gpt-realtime wins on TTFT
Studio voices are $160/M chars — only justifiable for branded recordings, not live agents
The pricing surface area is hard to navigate — you will spend ops time decoding it

How CallSphere optimizes

CallSphere does not run a GCP-native voice path in production today. We use Vertex Search for one B2B research feature and we evaluate Gemini 2.5 Pro for long-context post-call summarization where the 2M context window helps.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

For live voice we land on OpenAI Realtime PCM16 24kHz on Healthcare and ElevenLabs Sarah on Sales, with Deepgram Nova-3 for the cascaded paths. Across 6 verticals — 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 aligned — the routing logic gives Gemini a fair shake on long-context analytics but rarely picks it for live audio.

The pricing tiers on our site ($149 / $499 / $1499) are deliberately designed so we can swap providers per agent without breaking margin. If you want to feel the GCP-vs-OpenAI difference in your own data, the ROI calculator plugs your existing usage into a per-provider cost model. The 14-day no-card trial lets you measure live.

Optimization checklist

Use Vertex Live (not stitched) if you commit to GCP — bundled is cheaper at scale.
Lean on Gemini 2.5 Flash where possible — the Pro upcharge is usually not worth it.
Use explicit context caching aggressively — 75% off cached input is competitive.
Avoid Studio voices for live agents — WaveNet is good enough.
If you only need STT, Deepgram or Transcribe Tier 2 beat GCP on cost.
Measure Chirp accuracy on your accent profile — strong on broad English, weaker on rare accents than Deepgram Nova-3.
Watch for Gemini Live audio-token-rate updates — Google has cut prices three times in 2026 already.
Use Cloud Logging for per-call cost attribution.
Pin to a single region for Vertex — multi-region routing adds latency.
Re-evaluate quarterly — GCP voice pricing moves more than AWS.

FAQ

Is Vertex AI Live cheaper than OpenAI Realtime? Roughly equivalent. Both land $0.06–$0.10/min cached on typical workloads.

Why is Speech-to-Text Chirp so expensive? GCP positioned Chirp as premium quality. For pure STT, Deepgram Nova-3 is dramatically cheaper.

What is context caching on Gemini? A discount on repeated input tokens — implicit gets 25% off, explicit gets up to 75% off. Useful for big system prompts.

Can I use Vertex Live with HIPAA? Yes — Vertex AI is HIPAA-eligible with a BAA in place.

Should I use Gemini for cost-sensitive flows? 2.5 Flash is competitive with GPT-4o-mini for short-context flows. For long-context, Gemini 2.5 Pro wins on context window size.

Sources

Google Cloud Speech-to-Text Pricing — https://cloud.google.com/speech-to-text/pricing
Google Cloud Text-to-Speech Pricing — https://cloud.google.com/text-to-speech/pricing
Vertex AI Generative AI Pricing — https://cloud.google.com/vertex-ai/generative-ai/pricing
nOps Vertex AI Pricing 2026 guide — https://www.nops.io/blog/vertex-ai-pricing/

GCP Vertex AI Speech & Live Pricing vs Alternatives in 2026

The cost problem

How GCP prices it

Honest math

When GCP wins

When GCP loses

How CallSphere optimizes

Optimization checklist

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

Logistics Dispatch Voice Agent 2026: Driver Hotline + Load Assignment Hands-Free

Voice AI market April 2026 roundup — CallSphere, Vapi, Retell

Agno (formerly Phidata): Multimodal Agents the Easy Way in 2026

Agent Memory for Multilingual Call-Center Agents: Real Patterns