Cartesia's Sonic 3 brings AI laughter, emotion, and 40ms real-time latency to TTS in 2026. Here is how it stacks up for production voice agents.

What changed

flowchart TD
  In["Inbound voice call"] --> VAD["Server VAD"]
  VAD --> Triage["Triage Agent"]
  Triage -->|booking| Book["Booking Agent"]
  Triage -->|inquiry| Info["Inquiry Agent"]
  Triage -->|reschedule| Resched["Reschedule Agent"]
  Book --> DB[("Postgres + Prisma")]
  Info --> DB
  Resched --> DB
  DB --> Out["Spoken response · ElevenLabs"]

CallSphere reference architecture

Cartesia released Sonic 3 in early 2026 as the successor to the well-regarded Sonic 2.0 (which itself shipped after Cartesia's $64M Series A from Kleiner Perkins). The headline numbers: 40ms real-time latency for the streaming model and 90ms for the full-quality model, plus first-class support for non-verbal audio — laughter, sighs, breaths — generated inline from natural-language tags.

The Cartesia + Vapi partnership made Sonic 2.0 (and now Sonic 3) the default TTS option on Vapi as of mid-2026. Sonic 3 is also live on Together AI and SignalWire. Voice cloning is two-step: upload a 10-second sample and you have a cloneable voice in under a minute. Accents in English (American, British, Australian, Indian) are first-class.

Sonic's underlying architecture is a state-space model rather than a Transformer — that is the engineering reason it can hit 40ms streaming. The trade-off historically was expressive range; Sonic 3 has largely closed that gap.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Why it matters for voice agent builders

Sub-100ms TTS first-byte changes the conversational physics. Once you cross under the human reaction-time threshold (~200ms voice-to-voice), interruptions, back-channels ("uh-huh, mm-hmm"), and overlap become possible. That is the territory where voice agents start to feel like they are co-present, not turn-taking.

Concrete implications:

Pipelines that were impossible become feasible. STT (50ms) + LLM TTFT (300ms) + TTS first-byte (40ms) = 390ms voice-to-voice with overlap support.
Laughter and back-channels finally sound natural. Inline tags for non-verbal audio mean the agent can respond "[laughs] oh, that's a good one" without a recorded clip splice.
Voice cloning at the speed of thought. 10 seconds of audio is enough to onboard a new voice — that is a customer service product feature in itself.

How CallSphere applies this

CallSphere uses Cartesia for two specific patterns. OneRoof Real Estate (10 specialist agents, vision on property photos, OpenAI Agents SDK, WebRTC) routes its outbound buyer-callback flow through Sonic 3 because the agent talks for a while uninterrupted reading property descriptions — and Sonic 3's 90ms full model with inline pacing reads listings with a natural realtor cadence rather than the staccato of pre-Sonic models.

For the Salon GlamBook flow (4 agents, ElevenLabs TTS/STT, GB-YYYYMMDD-### booking refs), we A/B-tested Sonic 3 vs Eleven v3 over a sample of 4,500 booking calls. ElevenLabs won on emotional warmth in the salon receptionist persona; Sonic 3 won on response speed and was cheaper per minute. We kept ElevenLabs for the brand voice but added Sonic 3 as the fallback for high-volume outbound reminders.

This dual-vendor pattern is core to how the 37-agent CallSphere fleet operates: best tool per job, locked behind one billing line at $149 / $499 / $1499 with the 14-day no-card trial.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Build and migration steps

Get a Cartesia API key and pick the sonic-3 voice ID in the API.
Test the same prompt across sonic-3-streaming (40ms) and sonic-3-quality (90ms) — for live agents the streaming model is almost always right.
Add laughter tags inline ("That's hilarious [laughs]") and verify the non-verbal audio renders on your stack.
If you self-host Pipecat or LiveKit, swap the TTS adapter — both already ship with Cartesia support.
Clone a brand voice with 10 seconds of clean audio, then run a 100-call A/B against your existing TTS.
Re-tune your turn-end VAD threshold — with Sonic 3 you can shrink silence detection from 700ms to ~400ms.
Track WER + opinion scores; we recommend a 1,000-call eval before flipping production.

FAQ

What is Cartesia Sonic 3? Cartesia's third-generation real-time text-to-speech model, released in early 2026. It supports 40ms streaming latency, 90ms full-model latency, inline non-verbal audio (laughter, sighs), and accent localization in English.

How is Sonic 3 different from Sonic 2.0? Sonic 3 adds inline non-verbal audio support (laughter, emotion), tighter pacing controls, and a refined voice cloning pipeline. Latency targets are similar to Sonic 2.0.

Can I run Sonic 3 on Vapi? Yes — Cartesia is a default TTS option on Vapi as of 2026, including Sonic 3. The integration ships with both real-time and full models exposed.

What languages does Sonic 3 support? English with American, British, Australian, and Indian accents is the most polished tier. Multilingual support is expanding but not the leader; for global deployments many builders pair Sonic with Soniox or Deepgram for STT and add a translation step.

Is Sonic 3 cheaper than ElevenLabs v3? Generally yes on a per-minute basis, especially in high-volume real-time use. ElevenLabs still leads on character-level voice quality in blind tests for emotional content.

Sources

Cartesia — Sonic product page — https://cartesia.ai/sonic
Cartesia docs — Sonic 3 model card — https://docs.cartesia.ai/build-with-cartesia/tts-models/latest
Together AI — Cartesia Sonic-2 API — https://www.together.ai/models/cartesia-sonic
Vapi blog — "Vapi x Cartesia: Ultra-Realistic Voice AI with Sonic 2.0" — https://vapi.ai/blog/vapi-x-cartesia-ultra-realistic-voice-ai-with-sonic-2-0
Maginative — Cartesia $64M Series A — https://www.maginative.com/article/cartesia-raises-64m-to-advance-real-time-voice-ai-with-sonic-2-0/

Cartesia Sonic 3 (April 2026): Real-Time TTS Learns to Laugh

What changed

Why it matters for voice agent builders

How CallSphere applies this

Build and migration steps

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

Logistics Dispatch Voice Agent 2026: Driver Hotline + Load Assignment Hands-Free

Voice AI market April 2026 roundup — CallSphere, Vapi, Retell