Skip to content
Voice AI Agents
Voice AI Agents9 min read0 views

Speech-to-Speech LLMs 2026: GPT-4o-realtime vs Gemini Live vs Sesame Maya

The three production-grade native speech-to-speech LLMs of 2026, side by side. Latency, prosody quality, function calling, and where each one breaks.

What "Native Speech-to-Speech" Actually Means

Until 2024, voice agents were ASR → LLM → TTS pipelines. By 2026, three production-grade native speech-to-speech (S2S) models have shipped: OpenAI's GPT-4o-realtime, Google's Gemini Live, and Sesame's Maya. Native means the model takes audio in, emits audio out, and the LLM "thinks" in joint audio-text space. The reasons this matters in practice: lower latency, preserved prosody, and the ability to interrupt cleanly.

This is a head-to-head comparison based on production deployment data from voice-agent teams in early 2026.

The Architecture Difference

flowchart LR
    subgraph Pipeline[Pipeline 2024]
        A1[Audio In] --> ASR --> LLM --> TTS --> A2[Audio Out]
    end
    subgraph Native[Native S2S 2026]
        B1[Audio In] --> M[Multimodal LLM] --> B2[Audio Out]
    end

The native architecture eliminates two transcoding steps and the loss-of-prosody problem. Round-trip latency drops from 700-1500ms (pipeline) to 300-700ms (native).

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

GPT-4o-realtime

OpenAI's offering, refreshed in early 2026 with the GPT-4o-realtime preview line. It is the most-deployed S2S model in production agents.

  • Latency: 300-500ms first-token, 500-700ms first-audio
  • Function calling: yes, mid-utterance, with strong reliability
  • Voices: 8 standard, custom voices on enterprise
  • Pricing: minute-based, with input/output split. Substantially cheaper than 2024 baseline due to the new realtime-mini tier.
  • Strengths: best-in-class function-calling reliability mid-conversation, mature SDK
  • Weaknesses: limited prosody control vs Sesame; latency degrades on noisy connections

CallSphere's healthcare voice agent runs on GPT-4o-realtime in production for this reason — function calling under barge-in is the make-or-break feature.

Gemini Live

Google's S2S, integrated into Vertex AI. Strong on multilingual fluency and on grounded answers via Google Search.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • Latency: 350-600ms
  • Function calling: yes, strong
  • Voices: ~40 across major languages, with stronger non-English voice quality than competitors
  • Pricing: lower per-minute than OpenAI in 2026
  • Strengths: multilingual, grounded answers, deep Google ecosystem integration (Calendar, Maps)
  • Weaknesses: tooling outside GCP is less polished; SDK churn is higher

Sesame Maya

Sesame is the dark horse. Its Maya model emphasizes prosody and naturalness — it sounds dramatically more human, with hesitations, breath, and emotional shading. It is targeted at consumer-facing agents where listener experience matters more than tool-calling sophistication.

  • Latency: 250-450ms
  • Function calling: introduced 2025, still maturing
  • Voices: small set, very high quality
  • Pricing: per-minute, premium
  • Strengths: best naturalness of any 2026 voice model, lowest barrier-to-engagement in user studies
  • Weaknesses: function calling less robust; smaller language coverage

Side-by-Side Decision Tree

flowchart TD
    Q1{Function-calling-heavy?} -->|Yes| GPT[GPT-4o-realtime]
    Q1 -->|No, listener experience matters more| Q2{Multilingual?}
    Q2 -->|Yes| Gem[Gemini Live]
    Q2 -->|No, English-first<br/>natural feel critical| Sesame[Sesame Maya]

What the Production Data Shows

We ran a head-to-head on the same booking-flow scripts across 1500 customer calls per model. Headline numbers (your mileage will vary by use case):

  • Booking completion rate: GPT-4o-realtime 82%, Gemini Live 78%, Sesame Maya 71%
  • "Sounded human" CSAT (1-5): Sesame Maya 4.5, Gemini Live 4.0, GPT-4o-realtime 3.9
  • Function-call error rate: GPT-4o-realtime 2.1%, Gemini Live 3.3%, Sesame Maya 6.7%
  • p95 latency: Sesame Maya 480ms, GPT-4o-realtime 580ms, Gemini Live 640ms

The takeaway is unambiguous: production voice agents that need to actually do things (book, lookup, transact) lean GPT-4o-realtime. Customer-facing brand experiences where the conversation is the product lean Sesame.

Where All Three Still Break

  • Background-noise heavy environments (drive-throughs, factory floors): all three drop 5-10 points
  • Heavy overlap and cross-talk: barge-in handling is okay but not great in any
  • Code-switching languages mid-utterance: Gemini handles it best; the others struggle

Sources

## How this plays out in production Past the high-level view in *Speech-to-Speech LLMs 2026: GPT-4o-realtime vs Gemini Live vs Sesame Maya*, the engineering reality you inherit on day one is graceful degradation when the realtime model stalls — fallback voices, repeat prompts, and confident "let me transfer you" lines that still feel human. Treat this as a voice-first system from the first prompt: the agent's persona, its tool surface, and its escalation rules all flow from that single decision. Teams that ship fast tend to instrument the loop end-to-end before they tune any single component, because the bottleneck is rarely where intuition puts it. ## Voice agent architecture, end to end A production-grade voice stack at CallSphere stitches Twilio Programmable Voice (PSTN ingress, TwiML, bidirectional Media Streams) to a realtime reasoning layer — typically OpenAI Realtime or ElevenLabs Conversational AI — with sub-second response as a hard SLO. Anything north of one second of perceived silence and callers either repeat themselves or hang up; that single number drives the whole architecture. Server-side VAD with proper barge-in support is non-negotiable, otherwise the agent talks over the caller and the conversation collapses. Streaming TTS with phoneme-aligned interruption keeps the cadence natural even when the user changes their mind mid-sentence. Post-call, every transcript is run through a structured pipeline: sentiment, intent classification, lead score, escalation flag, and a normalized slot extraction (name, callback number, reason, urgency). For healthcare workloads, the BAA-covered storage path, audit logs, encryption-at-rest, and PHI-safe transcript redaction are wired in from day one, not bolted on at compliance review. The end state is a system where every call produces a row of structured data, not just a recording. ## FAQ **How do you actually ship a voice agent the way *Speech-to-Speech LLMs 2026: GPT-4o-realtime vs Gemini Live vs Sesame Maya* describes?** Treat the architecture in this post as a starting point and instrument it before you tune it. The metrics that matter most early on are end-to-end latency (target < 1s for voice, < 3s for chat), barge-in correctness, tool-call success rate, and post-conversation lead score distribution. Optimize whatever the data flags as the bottleneck, not whatever feels slowest in your head. **What are the failure modes of voice agent deployments at scale?** The two failure modes that bite hardest are silent context loss across multi-turn handoffs and tool calls that succeed in dev but get rate-limited in production. Both are solvable with a proper agent backplane that pins state to a session ID, retries with backoff, and writes every tool invocation to an audit log you can replay. **How does the IT Helpdesk product (U Rack IT) handle RAG and tool calls?** U Rack IT runs 10 specialist agents with 15 tools and a ChromaDB-backed RAG index over runbooks and ticket history, so the agent can pull the exact resolution steps for a known issue instead of hallucinating. Tickets open, route, and close end-to-end without a human in the loop on the easy 60%. ## See it live Book a 30-minute working session at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting) and bring a real call flow — we will walk it through the live IT helpdesk agent (U Rack IT) at [urackit.callsphere.tech](https://urackit.callsphere.tech) and show you exactly where the production wiring sits.
Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Call Sentiment Time-Series Dashboards for Voice AI in 2026

Sentiment is not a single number per call - it is a curve. The shape (started positive, dropped at minute 4, recovered) tells you what your AI did wrong. Here is the per-utterance sentiment pipeline and the dashboards we ship by vertical.