By Sagar Shankaran, Founder of CallSphere
SageMaker Serverless Inference still doesn't support GPUs in 2026. Here's the honest blueprint for voice on AWS — Bedrock for LLM, real-time GPU endpoints for TTS, and Nova for chat.
Key takeaways
TL;DR — In 2026, SageMaker Serverless Inference is still CPU-only — no GPUs, no AWS Marketplace models, no VPC. For voice, that means: use Bedrock (Nova / Claude) for LLM, SageMaker real-time GPU endpoints for TTS/STT, and Lambda for orchestration. April 2026 added Inference Recommendations (auto-tune deployment configs) and serverless customization for Qwen3.5 — useful for chat, not voice.
A lot of teams arrive at SageMaker expecting "serverless" to mean "GPU on demand." It doesn't. Voice models need GPU; therefore voice on AWS = real-time managed endpoints (provisioned concurrency, auto-scaling) plus Bedrock for the LLM step. Get this wrong and you waste 3 weeks discovering serverless can't run Whisper.
flowchart LR
CALLER[Connect / Chime] -->|PCM| LAMBDA[Lambda Orchestrator]
LAMBDA --> SM_STT[SageMaker Real-Time GPU - Whisper]
SM_STT -->|text| BR[Bedrock Nova 2 / Claude 4.7]
BR -->|reply| SM_TTS[SageMaker Real-Time GPU - Kokoro]
SM_TTS -->|audio| CALLER
CallSphere's AWS path is reserved for enterprise customers who require BAA-locked, single-tenant deployments. 37 agents · 90+ tools · 115+ DB tables · 6 verticals ride on Bedrock + SageMaker Real-Time + Connect. Standard plans $149 / $499 / $1,499, 14-day /trial, 22% affiliate at /affiliate.
ml.g5.xlarge (A10G).anthropic.claude-sonnet-4-7-20250620-v1:0 or amazon.nova-2-lite-v1:0 (custom-fine-tuned via the new Nova inference path).ml.g5.xlarge.create-inference-recommendations-job to auto-pick instance + batch size.Q: Why not just use Bedrock + Polly? A: You can. The pure Bedrock + Polly path is the easiest and most teams should start there. SageMaker enters when you need custom models.
Q: HIPAA? A: Bedrock + SageMaker + Polly all BAA-eligible. Connect is BAA-eligible in healthcare configurations. See /industries/healthcare.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Q: Cost?
A: SageMaker ml.g5.xlarge ≈ $1.20/hr; Bedrock Nova Lite ≈ $0.06/M input. CallSphere /pricing abstracts this.
Q: Edge inference? A: Use SageMaker Edge Manager + Greengrass for IoT voice; otherwise Bedrock + Connect is regional, not edge.
Q: AgentCore? A: Bedrock AgentCore Runtime (GA Apr 2026) packages Pipecat-style voice agents into a managed runtime — good for prototypes.
AWS SageMaker for Voice AI in 2026: When Serverless Works (and When It Doesn't) forces a tension most teams underestimate: agent handoff state. A single LLM call is easy. A booking agent that hands a confirmed slot to a billing agent that hands a follow-up to an escalation agent — that's where context loss, hallucinated IDs, and double-bookings live. Solving it well means treating the conversation as a stateful workflow, not a chat.
The big fork is managed (OpenAI Realtime, ElevenLabs Conversational AI) versus self-hosted on GPUs you operate. Managed wins on cold-start, model freshness, and zero-ops; self-hosted wins on unit economics past a certain conversation volume and on data residency for regulated verticals. CallSphere runs hybrid: Realtime for live calls, self-hosted Whisper + a hosted LLM for async, both routed through a Go gateway that enforces per-tenant rate limits.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Latency budgets are non-negotiable on voice. End-to-end target is sub-800ms ASR-to-first-token and sub-1.4s first-audio-out; anything beyond that and turn-taking feels stilted. GPU residency in the same region as your TURN servers matters more than choosing a slightly bigger model.
Observability is the unglamorous backbone — every conversation produces logs, traces, sentiment scoring, and cost attribution piped to a per-tenant dashboard. HIPAA + SOC 2 aligned isolation keeps healthcare traffic separated from salon traffic at the storage layer, not just the API.
What's the right way to scope the proof-of-concept?
Real Estate runs as a 6-container pod (frontend, gateway, ai-worker, voice-server, NATS event bus, Redis) backed by Postgres realestate_voice with row-level security so multi-tenant data never crosses tenants. For a topic like "AWS SageMaker for Voice AI in 2026: When Serverless Works (and When It Doesn't)", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations.
How do you handle compliance and data isolation? Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar.
When does it make sense to switch from a managed model to a self-hosted one? The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer.
Want to see how this maps to your stack? Book a live walkthrough at calendly.com/sagar-callsphere/new-meeting, or try the vertical-specific demo at salon.callsphere.tech. 14-day trial, no credit card, pilot live in 3–5 business days.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
AWS HealthScribe became the open scribe layer EHR vendors built on top of in 2026. Here's the API surface, the per-encounter pricing, the BAA terms.
AWS Multi-Agent Orchestrator ships supervisor routing, classifier, and shared memory. How to compose a customer-support agent team on Bedrock that scales cleanly.
AWS Trainium 2 supply caught up with demand in April 2026, prompting a re-set of EC2 Trn2 instance pricing and a fresh push into mid-market AI workloads.
Amazon's late-April 2026 earnings confirmed AWS AI revenue is 'multi-billion-dollar quarterly run-rate' with Trainium 2 supply outpacing demand for the first time.
Bedrock Claude + Transcribe streaming + Polly Neural runs $0.06–$0.10 per minute on paper. The honest math reveals where the AWS-native stack beats and where it loses to OpenAI Realtime.
Inside Amazon's ~$8B cumulative investment in Anthropic, Trainium exclusivity, AWS Bedrock distribution, and what compute capture means for governance independence and enterprise risk.
© 2026 CallSphere LLC. All rights reserved.