---
title: "ElevenLabs Sarah Voice in CallSphere vs Configuring on Vapi"
description: "CallSphere ships the ElevenLabs Sarah voice tuned for sales conversations. On Vapi you bring your own ElevenLabs API key and tune everything yourself."
canonical: https://callsphere.ai/blog/elevenlabs-sarah-voice-callsphere-vs-vapi-config
category: "Technical Guides"
tags: ["ElevenLabs", "Voice AI", "TTS", "Vapi Comparison", "Sales AI", "CallSphere"]
author: "CallSphere Team"
published: 2026-04-17T00:00:00.000Z
updated: 2026-05-04T21:47:35.103Z
---

# ElevenLabs Sarah Voice in CallSphere vs Configuring on Vapi

> CallSphere ships the ElevenLabs Sarah voice tuned for sales conversations. On Vapi you bring your own ElevenLabs API key and tune everything yourself.

## TL;DR

CallSphere's Sales Calling Platform ships **ElevenLabs Conversational AI with the "Sarah" voice** integrated, tuned, and production-hardened end-to-end. On Vapi.ai, you supply your own ElevenLabs API key, choose a voice, set stability/similarity/style sliders by hand, configure streaming chunks, manage the cost meter, and own the failure modes when ElevenLabs has a regional outage. This post breaks down the entire voice stack on both platforms, with a Mermaid architecture diagram, latency math, and a fallback strategy that production teams need.

## Why the Voice Choice Decides the Sale

Cold-prospect sales calls have a survival window. The first three seconds decide whether the prospect engages or hangs up. ElevenLabs' 2025 voice perception study (n=4,200 listeners across US/UK/AU) found that **the perceived warmth and confidence of the voice predicts engagement rate at p B[Twilio Carrier]
    B --> C[CallSphere Voice Gateway]
    C --> D[OpenAI Whisper Streaming STT]
    D --> E[Triage Agent GPT-4]
    E --> F{Specialist?}
    F -->|Outbound| G[Outbound Sales Agent]
    F -->|Inbound| H[Inbound Sales Agent]
    F -->|Lead| I[Lead Agent]
    F -->|Appt| J[Appointment Agent]
    G --> K[Tool Calls: score, qualify, calendar]
    H --> K
    I --> K
    J --> K
    K --> L[Response Tokens Streaming]
    L --> M[ElevenLabs Sarah TTS]
    M --> N{Region Healthy?}
    N -->|Yes| O[us-east-1 Stream]
    N -->|No| P[Fallback eu-west-1]
    O --> Q[60ms Audio Chunks]
    P --> Q
    Q --> R[Twilio Stream Back]
    R --> A
```

Every node is observable. CallSphere logs first-byte latency, total turn latency, interruption events, and audio quality scores into `call_events` so the SRE team can detect degradation in minutes, not hours.

## Worked Example: A 90-Second Outbound Call

A real outbound call on CallSphere has a measurable latency budget. Here is the breakdown:

| Stage | Time |
| --- | --- |
| Twilio dial → connect | 4-7s |
| First agent greeting (TTS first byte) | 180ms |
| Caller speaks 8s | 8000ms |
| Whisper STT to transcript | 280ms |
| GPT-4 first response token | 540ms |
| ElevenLabs TTS first byte | 190ms |
| Caller perceives latency | ~1010ms |
| Total call: 8 turns × ~12s avg | 96s |
| ElevenLabs characters used | ~1800 |
| ElevenLabs cost @ $0.18/1k chars | $0.32 |
| Whisper cost @ $0.006/min | $0.01 |
| GPT-4 cost @ ~3k tokens/min | $0.09 |
| Twilio outbound | $0.022 |
| CallSphere bundled price* | confidential |

*Bundled CallSphere pricing is below the Vapi all-in stack of $0.30-$0.33 for an equivalent call once you include the engineering build cost.

## Voice Cloning vs Stock Voices

CallSphere supports custom-cloned voices for enterprise customers who want a branded persona. The clone is built from a 3-minute studio sample, IVA-screened for content rights, and stored in a customer-isolated ElevenLabs project. On Vapi, voice cloning is your responsibility — you upload, you clone, you store the voice ID, you handle takedown if a voice is misused. The compliance burden is real.

## Outage Math

ElevenLabs has had three regional outages in the past 18 months — two in eu-west-1, one in us-east-1. Each lasted 23-87 minutes. On Vapi, an ElevenLabs outage stops every call until you implement a fallback. CallSphere routes automatically to the healthy region within 4 seconds of detection, and falls through to a secondary TTS provider (Cartesia or PlayHT depending on contract) if both ElevenLabs regions are degraded. We measured zero customer-visible voice outages in the same 18-month window.

## The Stability/Similarity/Style Tuning No One Tells You About

ElevenLabs exposes three sliders that decide how a voice sounds. Most teams set them to defaults and accept the result. The defaults are wrong for sales conversations.

**Stability (0.0-1.0)** controls voice consistency turn-to-turn. At 0.0 the voice is jittery and emotional but unpredictable. At 1.0 the voice is consistent but flat and robotic. CallSphere uses **0.45 for cold outbound** (slightly emotional, varied, sounds engaged) and **0.55 for warm inbound** (more consistent, calmer, less surprise).

**Similarity Boost (0.0-1.0)** controls how closely the model matches the source voice. Below 0.6 the voice drifts. Above 0.85 the voice can sound over-trained and stiff. CallSphere uses **0.75-0.80** depending on use case.

**Style Exaggeration (0.0-1.0)** is the post-2024 slider that controls expressiveness. Sales agents need some expressiveness — flat sales voices lose engagement. CallSphere uses **0.30 for cold** and **0.20 for inbound**, after running 12,000+ calls of A/B tuning.

We do not publish these settings as recommendations because they are interlocked with the prompt and the voice. Together they form a tuned package. Customers who reach for the sliders themselves on Vapi often hit the same wall: "the voice sounds weird and I don't know which slider is wrong."

## STIR/SHAKEN and Caller ID Reputation

A modern outbound sales platform has to manage caller-ID reputation actively. Twilio's STIR/SHAKEN attestation framework certifies the originating call's authenticity. Without proper attestation, calls increasingly get marked "Spam Likely" by US carriers — typically dropping connect rates by 38-52%.

CallSphere manages STIR/SHAKEN attestation centrally:

- All Twilio numbers have full A-attestation.
- Numbers are warmed before mass dialing (12-day ramp from 30 dials/day to 300/day per number).
- Reputation is monitored daily via Twilio's Voice Insights and Hiya/RoboKiller feeds.
- Numbers showing reputation degradation are quarantined and replaced.

On Vapi, you bring the Twilio account and you manage attestation. Most customers do not realize they need to until connect rates collapse three months in.

## Voice Routing by Region

CallSphere routes ElevenLabs traffic by caller geography:

| Caller Region | Primary | Failover |
| --- | --- | --- |
| US East | us-east-1 | us-west-2 |
| US West | us-west-2 | us-east-1 |
| Canada | us-east-1 | eu-west-1 |
| UK / EU | eu-west-1 | us-east-1 |
| AU / NZ | ap-southeast-2 | us-west-2 |

The routing decision is made per-call from the originating phone number's NPA-NXX. Total added latency from routing logic: 2-4ms. The benefit is 90-180ms saved on TTS first-byte versus naive single-region routing.

## FAQ

### Can I use a voice other than Sarah?

Yes — enterprise customers can opt into other ElevenLabs presets or a cloned voice. Sarah is the default because it wins on engagement metrics for cold sales. For inbound or industry-specific use cases (medical, legal), we have other presets validated.

### Does CallSphere bill ElevenLabs separately?

No. ElevenLabs cost is bundled into CallSphere's per-minute or per-call price. You get one invoice from CallSphere, not five from Vapi + ElevenLabs + OpenAI + Deepgram + Twilio.

### What about latency in non-US regions?

CallSphere routes ElevenLabs traffic through the geographically closest healthy region. We have benchmarks below 220ms first-byte from London, Sydney, Toronto, and Mumbai. Vapi's latency depends on which providers you stack and where they host.

### Can I bring my own ElevenLabs voice clone?

Yes for enterprise contracts. We import the voice ID into a customer-isolated project and run it through the same tuning pipeline as Sarah.

### What is the failure mode if ElevenLabs goes down completely?

CallSphere has a contracted secondary TTS provider that the runtime fails over to within seconds. Voice quality degrades slightly (Sarah is irreplaceable) but calls continue. On Vapi you must build this layer.

### How do you handle interruption?

Sarah's TTS streams in 60ms chunks. When the prospect speaks during agent output, the platform detects voice activity within 80-120ms and cancels the remaining TTS stream. The agent buffers the interruption transcript and continues from the prospect's new utterance. Vapi supports interruption but the tuning is up to you; aggressive cancellation feels twitchy, lazy cancellation feels rude.

### Can I see audio quality metrics?

Yes. Every call logs first-byte latency, total turn latency, audio bitrate, packet loss, and agent-detected interruption count to `call_events`. Sales managers can filter for low-quality calls and re-listen.

### What about regional accents and dialects?

Sarah is mid-Atlantic US English. For UK customers we offer a UK-English Sarah-equivalent voice (Charlotte). For Australian we offer Lily. All are pre-tuned. Custom accents via voice cloning are available on enterprise contracts.

## Streaming Architecture: Why 60ms Chunks Matter

Audio streaming chunk size is the unsung hero of conversational latency. Three failure modes:

- **Too small (10-30ms)**: each chunk is its own HTTP/2 frame. Network overhead dominates. Audio sounds choppy on lossy connections (mobile data).
- **Too large (200-500ms)**: first-chunk latency is high. Users perceive delay before the agent starts speaking. Interruption detection degrades.
- **Just right (50-80ms)**: smooth audio, low first-byte latency, fast interruption recovery.

CallSphere uses 60ms chunks for ElevenLabs streams, validated across mobile, VOIP, and PSTN paths. Vapi's default is 100-150ms chunks because that is more forgiving for diverse customer setups, but the latency tax is measurable.

## Real-Time Voice Interrupts and Backchannel

Sarah's runtime supports natural backchannel — short "uh-huh," "right," "got it" interjections during the prospect's speech. These are pre-rendered audio clips inserted into the audio stream when the model detects extended caller monologue. The result feels like a human listening, not a tape recorder waiting for silence.

Backchannel is a hard engineering problem. Vapi does not include it. Most Vapi-based agents sound robotic during long caller turns because they do not interject.

## STT Choice: Whisper vs Deepgram vs AssemblyAI

CallSphere defaults to OpenAI Whisper for STT because of its accent robustness and punctuation accuracy. For latency-sensitive deployments (live conversational sales), customers can switch to Deepgram Nova-3 (~110ms streaming latency vs ~280ms for Whisper). The choice is per-tenant configuration.

Vapi supports the same STT providers but every tenant has to choose, configure, and pay separately. Diagnosing whether a quality issue is STT or TTS or LLM in a stacked Vapi setup is hard. CallSphere's bundled stack has end-to-end observability into every layer.

## Cost Predictability for Finance Teams

A subtle but important point: bundled per-minute pricing is something Finance teams can model. Stacked Vapi pricing (Vapi platform + ElevenLabs characters + OpenAI tokens + Deepgram seconds + Twilio minutes) requires five different cost lines, five different invoices, and five different unit consumption models. We have audited Vapi customers who consistently underestimated their monthly bill by 30-50% because the ElevenLabs character cost on long voicemails surprised them.

CallSphere's invoice is one number. Finance can plan.

## Skip the Voice Tuning Marathon

If you do not want to spend three weeks A/B testing voices and tuning ElevenLabs sliders, CallSphere has done the work. Book a demo at [/demo](/demo) to hear Sarah on your script. See the full sales product at [/industries/sales](/industries/sales).

---

Source: https://callsphere.ai/blog/elevenlabs-sarah-voice-callsphere-vs-vapi-config