By Sagar Shankaran, Founder of CallSphere
Serverless GPU at $0.59–$3.95 per hour looks tempting until you measure cold start. Here is the honest break-even for self-hosting voice TTS or STT vs paying Deepgram or ElevenLabs.
Key takeaways
Serverless GPU at $0.59–$3.95 per hour looks tempting until you measure cold start. Here is the honest break-even for self-hosting voice TTS or STT vs paying Deepgram or ElevenLabs.
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]When voice teams hit ~$5k/month on Deepgram or ElevenLabs, someone always asks: "should we self-host an open-source STT or TTS on Modal/Replicate/Baseten?" The serverless GPU pricing — $1.10/hr for an A10, $2.10/hr for A100-40GB, $3.95/hr for H100 — looks dramatically cheaper than $0.0048/min × thousands of minutes.
But the simple "GPU $/hr ÷ minutes per hour" math is wrong. It ignores cold start, idle time, model loading, batching, and the engineering cost of running production GPU.
Modal (May 2026):
Replicate:
Baseten:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Pretend you have 100k minutes/month of streaming STT.
Buy from Deepgram Nova-3: 100k × $0.0048 = $480/month
Self-host Whisper-large-v3 on Modal A10:
So self-hosting Whisper on Modal is 4–8× more expensive than Deepgram at this volume. Modal wins only if (a) Deepgram cannot meet your latency or accuracy bar, (b) you need on-prem / air-gapped, or (c) you scale past Deepgram's enterprise commit pricing.
100k minutes of agent speech ≈ 50M characters at typical talk speeds.
Buy from ElevenLabs Flash: 50M × $0.05 / 1k = $2,500/month Buy from Deepgram Aura-2: 50M × $0.030 / 1k = $1,500/month
Self-host F5-TTS on Modal A10:
So TTS self-host roughly matches ElevenLabs and is more expensive than Aura-2 at this scale. Self-host wins for TTS only when:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
CallSphere does not self-host live STT or TTS today — Deepgram, ElevenLabs, and OpenAI win on cost and latency at our 6-vertical scale (37 agents, 90+ tools, 115+ DB tables).
We do use Modal for two specific async paths:
The decision rule we follow: if a serverless GPU saves under 30% vs the equivalent vendor API, we do not self-host because the operational tax is real. The pricing tiers ($149 / $499 / $1499) plus the 14-day no-card trial keep us honest — we cannot afford to pay an ops team to babysit GPUs unless the savings are substantial.
Is self-hosting STT cheaper than Deepgram? Below 1M min/month, almost never. Above that with negotiated commits, sometimes.
What about open-source Whisper vs Deepgram quality? Whisper-large-v3 matches Deepgram on broad English; Deepgram wins on streaming TTFT and on phone audio.
Should I use Replicate or Modal? Replicate for prototyping (no infra setup). Modal for production scale.
What is Baseten's value prop? Production reliability, enterprise SLAs, embedded engineering support — pay premium for less ops risk.
When should I switch to fully self-hosted GPUs? Above ~$25k/month in vendor inference, on stable workloads, with a dedicated ML platform team.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
H100 spot at $1.49 vs on-demand at $2.49. The 40-65% savings are real, but interruption math and warmup tax change the answer for live voice. Here is when spot wins.
Ring attention enables million-token contexts by distributing attention across GPUs. The 2026 implementations and what they enable.
When custom CUDA via Triton beats stock PyTorch ops in 2026 — the patterns, the tooling, and what production teams have shipped.
Flash Attention 3 is the kernel behind nearly every fast 2026 LLM. How it works, what it changed, and what's next.
Per-FLOP and per-token cost trends across NVIDIA H200/B200, AMD MI325X, and Google TPU v6 in 2026 — and what the curve says about 2027.
The PyTorch Profiler reveals what is really slow in your training or inference. The 2026 patterns for diagnosing bottlenecks.
© 2026 CallSphere LLC. All rights reserved.