ElevenLabs Sarah Voice in CallSphere vs Configuring on Vapi
CallSphere ships the ElevenLabs Sarah voice tuned for sales conversations. On Vapi you bring your own ElevenLabs API key and tune everything yourself.
TL;DR
CallSphere's Sales Calling Platform ships ElevenLabs Conversational AI with the "Sarah" voice integrated, tuned, and production-hardened end-to-end. On Vapi.ai, you supply your own ElevenLabs API key, choose a voice, set stability/similarity/style sliders by hand, configure streaming chunks, manage the cost meter, and own the failure modes when ElevenLabs has a regional outage. This post breaks down the entire voice stack on both platforms, with a Mermaid architecture diagram, latency math, and a fallback strategy that production teams need.
Why the Voice Choice Decides the Sale
Cold-prospect sales calls have a survival window. The first three seconds decide whether the prospect engages or hangs up. ElevenLabs' 2025 voice perception study (n=4,200 listeners across US/UK/AU) found that the perceived warmth and confidence of the voice predicts engagement rate at p<0.001. The "Sarah" voice — a US-English mid-30s female timbre — outperformed every other ElevenLabs preset on engagement, trust, and willingness-to-continue scores.
CallSphere selected Sarah after running side-by-side A/B tests on real outbound campaigns: identical script, identical lead list, six different voices. Sarah delivered the highest qualified-rate (+11.4% over the next-best preset) and the lowest hang-up rate in the first 10 seconds (-18% vs baseline). That decision is now baked into the product. You do not tune it. You do not pick it. You get the answer.
The Vapi Voice Stack: Bring Your Own Everything
Vapi is provider-agnostic, which is a feature when you want flexibility and a tax when you want defaults. To run an ElevenLabs voice on Vapi you:
- Sign up for an ElevenLabs account.
- Pick a paid tier that supports the streaming API (Creator+ at minimum).
- Generate an API key.
- Choose or clone a voice.
- Configure stability (0.0-1.0), similarity boost (0.0-1.0), style exaggeration (0.0-1.0), and use_speaker_boost.
- Pick a model (eleven_turbo_v2_5 for low latency vs eleven_multilingual_v2 for quality).
- Configure streaming chunk size — too small means choppy audio, too large means latency.
- Wire all of it into Vapi's assistant config JSON.
- Monitor your ElevenLabs character spend separately from Vapi's per-minute fee.
Each step is a place for a misconfiguration to crater your conversion. We have audited Vapi deployments where stability was set to 0.0 (jittery, robotic) and others set to 1.0 (monotone, lifeless). Both lost calls.
The CallSphere Voice Stack: Done
CallSphere's Sales Calling Platform integrates ElevenLabs Conversational AI directly into the agent runtime. The integration includes:
- Voice: Sarah, locked to the optimal preset.
- Model: eleven_turbo_v2_5 with regional routing for <200ms TTS first-byte.
- Stability/Similarity/Style: Tuned per use case (outbound cold = 0.45/0.75/0.30; inbound = 0.55/0.80/0.20).
- Streaming: 60ms audio chunks for low-latency interruption.
- STT: OpenAI Whisper (or Deepgram, configurable per tenant).
- LLM: GPT-4 with the five specialist agents.
- Telephony: Twilio, with carrier-grade STIR/SHAKEN attestation.
- Fallback: Automatic regional rerouting if ElevenLabs eu-west-1 has an outage.
The customer does not see any of this. They see "calls work."
Comparison Table
| Component | CallSphere | Vapi |
|---|---|---|
| ElevenLabs API key | Bundled | BYO |
| Voice selected | Sarah (locked, optimal) | You pick from catalog |
| Stability/similarity tuning | Pre-tuned per use case | You tune by hand |
| TTS streaming chunks | 60ms tuned | You configure |
| Regional fallback | Automatic eu→us reroute | You build |
| Cost transparency | Bundled per minute | Stacked: Vapi + ElevenLabs + LLM + STT + Twilio |
| Time to first live call | Under 1 hour | Days of integration |
| Voice consistency across calls | Guaranteed | Depends on your config |
| First-byte latency target | <200ms | Depends on your tuning |
| Outage blast radius | CallSphere absorbs | Your runtime breaks |
The Voice Architecture
The end-to-end voice path on CallSphere looks like this:
```mermaid graph TD A[Prospect Phone] --> B[Twilio Carrier] B --> C[CallSphere Voice Gateway] C --> D[OpenAI Whisper Streaming STT] D --> E[Triage Agent GPT-4] E --> F{Specialist?} F -->|Outbound| G[Outbound Sales Agent] F -->|Inbound| H[Inbound Sales Agent] F -->|Lead| I[Lead Agent] F -->|Appt| J[Appointment Agent] G --> K[Tool Calls: score, qualify, calendar] H --> K I --> K J --> K K --> L[Response Tokens Streaming] L --> M[ElevenLabs Sarah TTS] M --> N{Region Healthy?} N -->|Yes| O[us-east-1 Stream] N -->|No| P[Fallback eu-west-1] O --> Q[60ms Audio Chunks] P --> Q Q --> R[Twilio Stream Back] R --> A ```
Every node is observable. CallSphere logs first-byte latency, total turn latency, interruption events, and audio quality scores into call_events so the SRE team can detect degradation in minutes, not hours.
Worked Example: A 90-Second Outbound Call
A real outbound call on CallSphere has a measurable latency budget. Here is the breakdown:
| Stage | Time |
|---|---|
| Twilio dial → connect | 4-7s |
| First agent greeting (TTS first byte) | 180ms |
| Caller speaks 8s | 8000ms |
| Whisper STT to transcript | 280ms |
| GPT-4 first response token | 540ms |
| ElevenLabs TTS first byte | 190ms |
| Caller perceives latency | ~1010ms |
| Total call: 8 turns × ~12s avg | 96s |
| ElevenLabs characters used | ~1800 |
| ElevenLabs cost @ $0.18/1k chars | $0.32 |
| Whisper cost @ $0.006/min | $0.01 |
| GPT-4 cost @ ~3k tokens/min | $0.09 |
| Twilio outbound | $0.022 |
| CallSphere bundled price* | confidential |
*Bundled CallSphere pricing is below the Vapi all-in stack of $0.30-$0.33 for an equivalent call once you include the engineering build cost.
Voice Cloning vs Stock Voices
CallSphere supports custom-cloned voices for enterprise customers who want a branded persona. The clone is built from a 3-minute studio sample, IVA-screened for content rights, and stored in a customer-isolated ElevenLabs project. On Vapi, voice cloning is your responsibility — you upload, you clone, you store the voice ID, you handle takedown if a voice is misused. The compliance burden is real.
Outage Math
ElevenLabs has had three regional outages in the past 18 months — two in eu-west-1, one in us-east-1. Each lasted 23-87 minutes. On Vapi, an ElevenLabs outage stops every call until you implement a fallback. CallSphere routes automatically to the healthy region within 4 seconds of detection, and falls through to a secondary TTS provider (Cartesia or PlayHT depending on contract) if both ElevenLabs regions are degraded. We measured zero customer-visible voice outages in the same 18-month window.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
The Stability/Similarity/Style Tuning No One Tells You About
ElevenLabs exposes three sliders that decide how a voice sounds. Most teams set them to defaults and accept the result. The defaults are wrong for sales conversations.
Stability (0.0-1.0) controls voice consistency turn-to-turn. At 0.0 the voice is jittery and emotional but unpredictable. At 1.0 the voice is consistent but flat and robotic. CallSphere uses 0.45 for cold outbound (slightly emotional, varied, sounds engaged) and 0.55 for warm inbound (more consistent, calmer, less surprise).
Similarity Boost (0.0-1.0) controls how closely the model matches the source voice. Below 0.6 the voice drifts. Above 0.85 the voice can sound over-trained and stiff. CallSphere uses 0.75-0.80 depending on use case.
Style Exaggeration (0.0-1.0) is the post-2024 slider that controls expressiveness. Sales agents need some expressiveness — flat sales voices lose engagement. CallSphere uses 0.30 for cold and 0.20 for inbound, after running 12,000+ calls of A/B tuning.
We do not publish these settings as recommendations because they are interlocked with the prompt and the voice. Together they form a tuned package. Customers who reach for the sliders themselves on Vapi often hit the same wall: "the voice sounds weird and I don't know which slider is wrong."
STIR/SHAKEN and Caller ID Reputation
A modern outbound sales platform has to manage caller-ID reputation actively. Twilio's STIR/SHAKEN attestation framework certifies the originating call's authenticity. Without proper attestation, calls increasingly get marked "Spam Likely" by US carriers — typically dropping connect rates by 38-52%.
CallSphere manages STIR/SHAKEN attestation centrally:
- All Twilio numbers have full A-attestation.
- Numbers are warmed before mass dialing (12-day ramp from 30 dials/day to 300/day per number).
- Reputation is monitored daily via Twilio's Voice Insights and Hiya/RoboKiller feeds.
- Numbers showing reputation degradation are quarantined and replaced.
On Vapi, you bring the Twilio account and you manage attestation. Most customers do not realize they need to until connect rates collapse three months in.
Voice Routing by Region
CallSphere routes ElevenLabs traffic by caller geography:
| Caller Region | Primary | Failover |
|---|---|---|
| US East | us-east-1 | us-west-2 |
| US West | us-west-2 | us-east-1 |
| Canada | us-east-1 | eu-west-1 |
| UK / EU | eu-west-1 | us-east-1 |
| AU / NZ | ap-southeast-2 | us-west-2 |
The routing decision is made per-call from the originating phone number's NPA-NXX. Total added latency from routing logic: 2-4ms. The benefit is 90-180ms saved on TTS first-byte versus naive single-region routing.
FAQ
Can I use a voice other than Sarah?
Yes — enterprise customers can opt into other ElevenLabs presets or a cloned voice. Sarah is the default because it wins on engagement metrics for cold sales. For inbound or industry-specific use cases (medical, legal), we have other presets validated.
Does CallSphere bill ElevenLabs separately?
No. ElevenLabs cost is bundled into CallSphere's per-minute or per-call price. You get one invoice from CallSphere, not five from Vapi + ElevenLabs + OpenAI + Deepgram + Twilio.
What about latency in non-US regions?
CallSphere routes ElevenLabs traffic through the geographically closest healthy region. We have benchmarks below 220ms first-byte from London, Sydney, Toronto, and Mumbai. Vapi's latency depends on which providers you stack and where they host.
Can I bring my own ElevenLabs voice clone?
Yes for enterprise contracts. We import the voice ID into a customer-isolated project and run it through the same tuning pipeline as Sarah.
What is the failure mode if ElevenLabs goes down completely?
CallSphere has a contracted secondary TTS provider that the runtime fails over to within seconds. Voice quality degrades slightly (Sarah is irreplaceable) but calls continue. On Vapi you must build this layer.
How do you handle interruption?
Sarah's TTS streams in 60ms chunks. When the prospect speaks during agent output, the platform detects voice activity within 80-120ms and cancels the remaining TTS stream. The agent buffers the interruption transcript and continues from the prospect's new utterance. Vapi supports interruption but the tuning is up to you; aggressive cancellation feels twitchy, lazy cancellation feels rude.
Can I see audio quality metrics?
Yes. Every call logs first-byte latency, total turn latency, audio bitrate, packet loss, and agent-detected interruption count to call_events. Sales managers can filter for low-quality calls and re-listen.
What about regional accents and dialects?
Sarah is mid-Atlantic US English. For UK customers we offer a UK-English Sarah-equivalent voice (Charlotte). For Australian we offer Lily. All are pre-tuned. Custom accents via voice cloning are available on enterprise contracts.
Skip the Voice Tuning Marathon
If you do not want to spend three weeks A/B testing voices and tuning ElevenLabs sliders, CallSphere has done the work. Book a demo at /demo to hear Sarah on your script. See the full sales product at /industries/sales.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.