Skip to content
AI Voice Agents
AI Voice Agents8 min read0 views

Cartesia Sonic 3 (April 2026): Real-Time TTS Learns to Laugh

Cartesia's Sonic 3 brings AI laughter, emotion, and 40ms real-time latency to TTS in 2026. Here is how it stacks up for production voice agents.

Cartesia's Sonic 3 brings AI laughter, emotion, and 40ms real-time latency to TTS in 2026. Here is how it stacks up for production voice agents.

What changed

flowchart TD
  In["Inbound voice call"] --> VAD["Server VAD"]
  VAD --> Triage["Triage Agent"]
  Triage -->|booking| Book["Booking Agent"]
  Triage -->|inquiry| Info["Inquiry Agent"]
  Triage -->|reschedule| Resched["Reschedule Agent"]
  Book --> DB[("Postgres + Prisma")]
  Info --> DB
  Resched --> DB
  DB --> Out["Spoken response · ElevenLabs"]
CallSphere reference architecture

Cartesia released Sonic 3 in early 2026 as the successor to the well-regarded Sonic 2.0 (which itself shipped after Cartesia's $64M Series A from Kleiner Perkins). The headline numbers: 40ms real-time latency for the streaming model and 90ms for the full-quality model, plus first-class support for non-verbal audio — laughter, sighs, breaths — generated inline from natural-language tags.

The Cartesia + Vapi partnership made Sonic 2.0 (and now Sonic 3) the default TTS option on Vapi as of mid-2026. Sonic 3 is also live on Together AI and SignalWire. Voice cloning is two-step: upload a 10-second sample and you have a cloneable voice in under a minute. Accents in English (American, British, Australian, Indian) are first-class.

Sonic's underlying architecture is a state-space model rather than a Transformer — that is the engineering reason it can hit 40ms streaming. The trade-off historically was expressive range; Sonic 3 has largely closed that gap.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Why it matters for voice agent builders

Sub-100ms TTS first-byte changes the conversational physics. Once you cross under the human reaction-time threshold (~200ms voice-to-voice), interruptions, back-channels ("uh-huh, mm-hmm"), and overlap become possible. That is the territory where voice agents start to feel like they are co-present, not turn-taking.

Concrete implications:

  1. Pipelines that were impossible become feasible. STT (50ms) + LLM TTFT (300ms) + TTS first-byte (40ms) = 390ms voice-to-voice with overlap support.
  2. Laughter and back-channels finally sound natural. Inline tags for non-verbal audio mean the agent can respond "[laughs] oh, that's a good one" without a recorded clip splice.
  3. Voice cloning at the speed of thought. 10 seconds of audio is enough to onboard a new voice — that is a customer service product feature in itself.

How CallSphere applies this

CallSphere uses Cartesia for two specific patterns. OneRoof Real Estate (10 specialist agents, vision on property photos, OpenAI Agents SDK, WebRTC) routes its outbound buyer-callback flow through Sonic 3 because the agent talks for a while uninterrupted reading property descriptions — and Sonic 3's 90ms full model with inline pacing reads listings with a natural realtor cadence rather than the staccato of pre-Sonic models.

For the Salon GlamBook flow (4 agents, ElevenLabs TTS/STT, GB-YYYYMMDD-### booking refs), we A/B-tested Sonic 3 vs Eleven v3 over a sample of 4,500 booking calls. ElevenLabs won on emotional warmth in the salon receptionist persona; Sonic 3 won on response speed and was cheaper per minute. We kept ElevenLabs for the brand voice but added Sonic 3 as the fallback for high-volume outbound reminders.

This dual-vendor pattern is core to how the 37-agent CallSphere fleet operates: best tool per job, locked behind one billing line at $149 / $499 / $1499 with the 14-day no-card trial.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Build and migration steps

  1. Get a Cartesia API key and pick the sonic-3 voice ID in the API.
  2. Test the same prompt across sonic-3-streaming (40ms) and sonic-3-quality (90ms) — for live agents the streaming model is almost always right.
  3. Add laughter tags inline ("That's hilarious [laughs]") and verify the non-verbal audio renders on your stack.
  4. If you self-host Pipecat or LiveKit, swap the TTS adapter — both already ship with Cartesia support.
  5. Clone a brand voice with 10 seconds of clean audio, then run a 100-call A/B against your existing TTS.
  6. Re-tune your turn-end VAD threshold — with Sonic 3 you can shrink silence detection from 700ms to ~400ms.
  7. Track WER + opinion scores; we recommend a 1,000-call eval before flipping production.

FAQ

What is Cartesia Sonic 3? Cartesia's third-generation real-time text-to-speech model, released in early 2026. It supports 40ms streaming latency, 90ms full-model latency, inline non-verbal audio (laughter, sighs), and accent localization in English.

How is Sonic 3 different from Sonic 2.0? Sonic 3 adds inline non-verbal audio support (laughter, emotion), tighter pacing controls, and a refined voice cloning pipeline. Latency targets are similar to Sonic 2.0.

Can I run Sonic 3 on Vapi? Yes — Cartesia is a default TTS option on Vapi as of 2026, including Sonic 3. The integration ships with both real-time and full models exposed.

What languages does Sonic 3 support? English with American, British, Australian, and Indian accents is the most polished tier. Multilingual support is expanding but not the leader; for global deployments many builders pair Sonic with Soniox or Deepgram for STT and add a translation step.

Is Sonic 3 cheaper than ElevenLabs v3? Generally yes on a per-minute basis, especially in high-volume real-time use. ElevenLabs still leads on character-level voice quality in blind tests for emotional content.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

Agentic AI

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

Streaming changes the eval game — final-answer correctness isn't enough when users perceive the answer one token at a time. Here's the metric set that matters.

Agentic AI

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

How to stream tokens, tool-call deltas, and intermediate steps from an agent — with code for both the OpenAI Agents SDK and LangChain — and the gotchas that bite in production.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Logistics Dispatch Voice Agent 2026: Driver Hotline + Load Assignment Hands-Free

Trucking dispatchers spend half their day on check-calls. Here is how a 2026 AI voice agent runs the driver hotline, assigns loads, and updates the TMS in real time.

Funding & Industry

Voice AI market April 2026 roundup — CallSphere, Vapi, Retell

April 2026's voice AI market is consolidating around five names — CallSphere, Vapi, Retell, Hippocratic AI, and Sierra — each defining a distinct vertical posture.