ElevenLabs Sarah Voice in CallSphere vs Configuring on Vapi

TL;DR

CallSphere's Sales Calling Platform ships ElevenLabs Conversational AI with the "Sarah" voice integrated, tuned, and production-hardened end-to-end. On Vapi.ai, you supply your own ElevenLabs API key, choose a voice, set stability/similarity/style sliders by hand, configure streaming chunks, manage the cost meter, and own the failure modes when ElevenLabs has a regional outage. This post breaks down the entire voice stack on both platforms, with a Mermaid architecture diagram, latency math, and a fallback strategy that production teams need.

Why the Voice Choice Decides the Sale

Cold-prospect sales calls have a survival window. The first three seconds decide whether the prospect engages or hangs up. ElevenLabs' 2025 voice perception study (n=4,200 listeners across US/UK/AU) found that the perceived warmth and confidence of the voice predicts engagement rate at p<0.001. The "Sarah" voice — a US-English mid-30s female timbre — outperformed every other ElevenLabs preset on engagement, trust, and willingness-to-continue scores.

CallSphere selected Sarah after running side-by-side A/B tests on real outbound campaigns: identical script, identical lead list, six different voices. Sarah delivered the highest qualified-rate (+11.4% over the next-best preset) and the lowest hang-up rate in the first 10 seconds (-18% vs baseline). That decision is now baked into the product. You do not tune it. You do not pick it. You get the answer.

The Vapi Voice Stack: Bring Your Own Everything

Vapi is provider-agnostic, which is a feature when you want flexibility and a tax when you want defaults. To run an ElevenLabs voice on Vapi you:

Sign up for an ElevenLabs account.
Pick a paid tier that supports the streaming API (Creator+ at minimum).
Generate an API key.
Choose or clone a voice.
Configure stability (0.0-1.0), similarity boost (0.0-1.0), style exaggeration (0.0-1.0), and use_speaker_boost.
Pick a model (eleven_turbo_v2_5 for low latency vs eleven_multilingual_v2 for quality).
Configure streaming chunk size — too small means choppy audio, too large means latency.
Wire all of it into Vapi's assistant config JSON.
Monitor your ElevenLabs character spend separately from Vapi's per-minute fee.

Each step is a place for a misconfiguration to crater your conversion. We have audited Vapi deployments where stability was set to 0.0 (jittery, robotic) and others set to 1.0 (monotone, lifeless). Both lost calls.

The CallSphere Voice Stack: Done

CallSphere's Sales Calling Platform integrates ElevenLabs Conversational AI directly into the agent runtime. The integration includes:

Voice: Sarah, locked to the optimal preset.
Model: eleven_turbo_v2_5 with regional routing for <200ms TTS first-byte.
Stability/Similarity/Style: Tuned per use case (outbound cold = 0.45/0.75/0.30; inbound = 0.55/0.80/0.20).
Streaming: 60ms audio chunks for low-latency interruption.
STT: OpenAI Whisper (or Deepgram, configurable per tenant).
LLM: GPT-4 with the five specialist agents.
Telephony: Twilio, with carrier-grade STIR/SHAKEN attestation.
Fallback: Automatic regional rerouting if ElevenLabs eu-west-1 has an outage.

The customer does not see any of this. They see "calls work."

Comparison Table

Component	CallSphere	Vapi
ElevenLabs API key	Bundled	BYO
Voice selected	Sarah (locked, optimal)	You pick from catalog
Stability/similarity tuning	Pre-tuned per use case	You tune by hand
TTS streaming chunks	60ms tuned	You configure
Regional fallback	Automatic eu→us reroute	You build
Cost transparency	Bundled per minute	Stacked: Vapi + ElevenLabs + LLM + STT + Twilio
Time to first live call	Under 1 hour	Days of integration
Voice consistency across calls	Guaranteed	Depends on your config
First-byte latency target	<200ms	Depends on your tuning
Outage blast radius	CallSphere absorbs	Your runtime breaks

The Voice Architecture

The end-to-end voice path on CallSphere looks like this:

```mermaid graph TD A[Prospect Phone] --> B[Twilio Carrier] B --> C[CallSphere Voice Gateway] C --> D[OpenAI Whisper Streaming STT] D --> E[Triage Agent GPT-4] E --> F{Specialist?} F -->|Outbound| G[Outbound Sales Agent] F -->|Inbound| H[Inbound Sales Agent] F -->|Lead| I[Lead Agent] F -->|Appt| J[Appointment Agent] G --> K[Tool Calls: score, qualify, calendar] H --> K I --> K J --> K K --> L[Response Tokens Streaming] L --> M[ElevenLabs Sarah TTS] M --> N{Region Healthy?} N -->|Yes| O[us-east-1 Stream] N -->|No| P[Fallback eu-west-1] O --> Q[60ms Audio Chunks] P --> Q Q --> R[Twilio Stream Back] R --> A ```

Every node is observable. CallSphere logs first-byte latency, total turn latency, interruption events, and audio quality scores into call_events so the SRE team can detect degradation in minutes, not hours.

Worked Example: A 90-Second Outbound Call

A real outbound call on CallSphere has a measurable latency budget. Here is the breakdown:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Stage	Time
Twilio dial → connect	4-7s
First agent greeting (TTS first byte)	180ms
Caller speaks 8s	8000ms
Whisper STT to transcript	280ms
GPT-4 first response token	540ms
ElevenLabs TTS first byte	190ms
Caller perceives latency	~1010ms
Total call: 8 turns × ~12s avg	96s
ElevenLabs characters used	~1800
ElevenLabs cost @ $0.18/1k chars	$0.32
Whisper cost @ $0.006/min	$0.01
GPT-4 cost @ ~3k tokens/min	$0.09
Twilio outbound	$0.022
CallSphere bundled price*	confidential

*Bundled CallSphere pricing is below the Vapi all-in stack of $0.30-$0.33 for an equivalent call once you include the engineering build cost.

Voice Cloning vs Stock Voices

CallSphere supports custom-cloned voices for enterprise customers who want a branded persona. The clone is built from a 3-minute studio sample, IVA-screened for content rights, and stored in a customer-isolated ElevenLabs project. On Vapi, voice cloning is your responsibility — you upload, you clone, you store the voice ID, you handle takedown if a voice is misused. The compliance burden is real.

Outage Math

ElevenLabs has had three regional outages in the past 18 months — two in eu-west-1, one in us-east-1. Each lasted 23-87 minutes. On Vapi, an ElevenLabs outage stops every call until you implement a fallback. CallSphere routes automatically to the healthy region within 4 seconds of detection, and falls through to a secondary TTS provider (Cartesia or PlayHT depending on contract) if both ElevenLabs regions are degraded. We measured zero customer-visible voice outages in the same 18-month window.

The Stability/Similarity/Style Tuning No One Tells You About

ElevenLabs exposes three sliders that decide how a voice sounds. Most teams set them to defaults and accept the result. The defaults are wrong for sales conversations.

Stability (0.0-1.0) controls voice consistency turn-to-turn. At 0.0 the voice is jittery and emotional but unpredictable. At 1.0 the voice is consistent but flat and robotic. CallSphere uses 0.45 for cold outbound (slightly emotional, varied, sounds engaged) and 0.55 for warm inbound (more consistent, calmer, less surprise).

Similarity Boost (0.0-1.0) controls how closely the model matches the source voice. Below 0.6 the voice drifts. Above 0.85 the voice can sound over-trained and stiff. CallSphere uses 0.75-0.80 depending on use case.

Style Exaggeration (0.0-1.0) is the post-2024 slider that controls expressiveness. Sales agents need some expressiveness — flat sales voices lose engagement. CallSphere uses 0.30 for cold and 0.20 for inbound, after running 12,000+ calls of A/B tuning.

We do not publish these settings as recommendations because they are interlocked with the prompt and the voice. Together they form a tuned package. Customers who reach for the sliders themselves on Vapi often hit the same wall: "the voice sounds weird and I don't know which slider is wrong."

STIR/SHAKEN and Caller ID Reputation

A modern outbound sales platform has to manage caller-ID reputation actively. Twilio's STIR/SHAKEN attestation framework certifies the originating call's authenticity. Without proper attestation, calls increasingly get marked "Spam Likely" by US carriers — typically dropping connect rates by 38-52%.

CallSphere manages STIR/SHAKEN attestation centrally:

All Twilio numbers have full A-attestation.
Numbers are warmed before mass dialing (12-day ramp from 30 dials/day to 300/day per number).
Reputation is monitored daily via Twilio's Voice Insights and Hiya/RoboKiller feeds.
Numbers showing reputation degradation are quarantined and replaced.

On Vapi, you bring the Twilio account and you manage attestation. Most customers do not realize they need to until connect rates collapse three months in.

Voice Routing by Region

CallSphere routes ElevenLabs traffic by caller geography:

Caller Region	Primary	Failover
US East	us-east-1	us-west-2
US West	us-west-2	us-east-1
Canada	us-east-1	eu-west-1
UK / EU	eu-west-1	us-east-1
AU / NZ	ap-southeast-2	us-west-2

The routing decision is made per-call from the originating phone number's NPA-NXX. Total added latency from routing logic: 2-4ms. The benefit is 90-180ms saved on TTS first-byte versus naive single-region routing.

FAQ

Can I use a voice other than Sarah?

Yes — enterprise customers can opt into other ElevenLabs presets or a cloned voice. Sarah is the default because it wins on engagement metrics for cold sales. For inbound or industry-specific use cases (medical, legal), we have other presets validated.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Does CallSphere bill ElevenLabs separately?

No. ElevenLabs cost is bundled into CallSphere's per-minute or per-call price. You get one invoice from CallSphere, not five from Vapi + ElevenLabs + OpenAI + Deepgram + Twilio.

What about latency in non-US regions?

CallSphere routes ElevenLabs traffic through the geographically closest healthy region. We have benchmarks below 220ms first-byte from London, Sydney, Toronto, and Mumbai. Vapi's latency depends on which providers you stack and where they host.

Can I bring my own ElevenLabs voice clone?

Yes for enterprise contracts. We import the voice ID into a customer-isolated project and run it through the same tuning pipeline as Sarah.

What is the failure mode if ElevenLabs goes down completely?

CallSphere has a contracted secondary TTS provider that the runtime fails over to within seconds. Voice quality degrades slightly (Sarah is irreplaceable) but calls continue. On Vapi you must build this layer.

How do you handle interruption?

Sarah's TTS streams in 60ms chunks. When the prospect speaks during agent output, the platform detects voice activity within 80-120ms and cancels the remaining TTS stream. The agent buffers the interruption transcript and continues from the prospect's new utterance. Vapi supports interruption but the tuning is up to you; aggressive cancellation feels twitchy, lazy cancellation feels rude.

Can I see audio quality metrics?

Yes. Every call logs first-byte latency, total turn latency, audio bitrate, packet loss, and agent-detected interruption count to call_events. Sales managers can filter for low-quality calls and re-listen.

What about regional accents and dialects?

Sarah is mid-Atlantic US English. For UK customers we offer a UK-English Sarah-equivalent voice (Charlotte). For Australian we offer Lily. All are pre-tuned. Custom accents via voice cloning are available on enterprise contracts.

Streaming Architecture: Why 60ms Chunks Matter

Audio streaming chunk size is the unsung hero of conversational latency. Three failure modes:

Too small (10-30ms): each chunk is its own HTTP/2 frame. Network overhead dominates. Audio sounds choppy on lossy connections (mobile data).
Too large (200-500ms): first-chunk latency is high. Users perceive delay before the agent starts speaking. Interruption detection degrades.
Just right (50-80ms): smooth audio, low first-byte latency, fast interruption recovery.

CallSphere uses 60ms chunks for ElevenLabs streams, validated across mobile, VOIP, and PSTN paths. Vapi's default is 100-150ms chunks because that is more forgiving for diverse customer setups, but the latency tax is measurable.

Real-Time Voice Interrupts and Backchannel

Sarah's runtime supports natural backchannel — short "uh-huh," "right," "got it" interjections during the prospect's speech. These are pre-rendered audio clips inserted into the audio stream when the model detects extended caller monologue. The result feels like a human listening, not a tape recorder waiting for silence.

Backchannel is a hard engineering problem. Vapi does not include it. Most Vapi-based agents sound robotic during long caller turns because they do not interject.

STT Choice: Whisper vs Deepgram vs AssemblyAI

CallSphere defaults to OpenAI Whisper for STT because of its accent robustness and punctuation accuracy. For latency-sensitive deployments (live conversational sales), customers can switch to Deepgram Nova-3 (~110ms streaming latency vs ~280ms for Whisper). The choice is per-tenant configuration.

Vapi supports the same STT providers but every tenant has to choose, configure, and pay separately. Diagnosing whether a quality issue is STT or TTS or LLM in a stacked Vapi setup is hard. CallSphere's bundled stack has end-to-end observability into every layer.

Cost Predictability for Finance Teams

A subtle but important point: bundled per-minute pricing is something Finance teams can model. Stacked Vapi pricing (Vapi platform + ElevenLabs characters + OpenAI tokens + Deepgram seconds + Twilio minutes) requires five different cost lines, five different invoices, and five different unit consumption models. We have audited Vapi customers who consistently underestimated their monthly bill by 30-50% because the ElevenLabs character cost on long voicemails surprised them.

CallSphere's invoice is one number. Finance can plan.

Skip the Voice Tuning Marathon

If you do not want to spend three weeks A/B testing voices and tuning ElevenLabs sliders, CallSphere has done the work. Book a demo at /demo to hear Sarah on your script. See the full sales product at /industries/sales.

TL;DR

Why the Voice Choice Decides the Sale

The Vapi Voice Stack: Bring Your Own Everything

The CallSphere Voice Stack: Done

Comparison Table

The Voice Architecture

Worked Example: A 90-Second Outbound Call

Voice Cloning vs Stock Voices

Outage Math

The Stability/Similarity/Style Tuning No One Tells You About

STIR/SHAKEN and Caller ID Reputation

Voice Routing by Region

FAQ

Can I use a voice other than Sarah?

Does CallSphere bill ElevenLabs separately?

What about latency in non-US regions?

Can I bring my own ElevenLabs voice clone?

What is the failure mode if ElevenLabs goes down completely?

How do you handle interruption?

Can I see audio quality metrics?

What about regional accents and dialects?

Streaming Architecture: Why 60ms Chunks Matter

Real-Time Voice Interrupts and Backchannel

STT Choice: Whisper vs Deepgram vs AssemblyAI

Cost Predictability for Finance Teams

Skip the Voice Tuning Marathon

Try CallSphere AI Voice Agents

Related Articles You May Like

Tbilisi Accountants, Lawyers and Relocation Firms: Capture Every Enquiry with an AI Voice Agent

How Colombian Tutoring Centers and Academies Enroll More Students with an AI Voice and Chat Agent

Yirgacheffe to the World: An AI Agent That Never Misses a Coffee Buyer Call

How-To: Stop Losing High-Value Bookings at Your Palau Dive Resort While the Crew Is on the Reef

Gulf Salons, Beauty and Wellness: Stop Losing Bookings to Missed Calls Across the UAE, Saudi Arabia and Qatar

Missed Viewings, Lost Deals: AI Voice for Luxembourg's Fast-Moving Property Market

Product

Resources

Company

Legal

Industries

Integrations

Solutions

Compare

Pillar Guides

See AI Voice Agents in Action