Server-Side VAD Reliability: CallSphere vs Vapi Turn Detection

TL;DR

Voice activity detection (VAD) decides when a caller has stopped speaking and the AI should respond. Get it wrong and the agent talks over the caller, cuts them off mid-sentence, or sits silent waiting for a phantom continuation. CallSphere uses server-side VAD inside the OpenAI Realtime API, which has direct access to the model's acoustic state and outperforms classical client-side VAD on noisy lines, accented speech, and multi-clause utterances. Vapi.ai relies on its hosted pipeline's own VAD (typically a Silero or webrtcvad derivative running before the LLM). For most calls both work; under stress (background noise, long pauses, polite filler words) server-side VAD wins on cut-off rate and turn-taking naturalness.

Why VAD Is Harder Than It Looks

A naive VAD says: when the audio energy drops below a threshold for 300ms, the user is done. This breaks the moment a caller says "I'd like to book... uh... a tour for Saturday." That filler "uh" is often longer than 300ms. The naive VAD cuts the caller off, the agent jumps in, and the conversation derails.

A second-generation VAD adds an ML classifier (Silero, webrtcvad) that distinguishes speech from silence. Better — but still acoustic-only. It doesn't know whether the caller's sentence is grammatically complete. So it still cuts off "I want to book a tour for..." mid-thought when the caller pauses to remember the date.

The third generation, server-side VAD with model awareness, joins acoustic features with the model's own incremental understanding of the caller's utterance. It detects end-of-turn when the meaning feels complete, not just when the energy drops. This is what the OpenAI Realtime API offers.

How Vapi Handles Turn Detection

Vapi's hosted pipeline runs a VAD before the LLM, typically a Silero variant. It exposes parameters like silenceTimeoutSeconds, responseDelaySeconds, and numWordsToInterruptAssistant. These are real and tunable. They give you a lot of room to dial in turn-taking for a given call type.

Limitations:

The VAD doesn't share state with the LLM. The classifier decides the user has stopped before the LLM sees the full turn, so semantic completion is not a signal.
Filler words trip the threshold more often. "Um," "uh," and trailing "you know..." cut the caller off if the silence threshold is too tight.
Tuning is per-account, not per-utterance. You set one threshold and live with it across all callers.

How CallSphere Uses Server-Side VAD

CallSphere routes audio directly into the OpenAI Realtime API with PCM16 24kHz framing. The Realtime API's server-side VAD lives inside the model's compute graph. It uses both acoustic energy and the model's incremental reasoning about whether the utterance is complete.

In practice this means:

A caller can pause mid-sentence to think — the VAD waits because the model knows the sentence is incomplete.
A caller can finish abruptly with a tone-marker like "...so yeah" — the VAD fires because the model recognizes a soft close.
Background noise (TV, baby crying, traffic) is easier to ignore because the model has acoustic context, not just energy levels.

The configuration looks like:

{
  "type": "server_vad",
  "threshold": 0.5,
  "prefix_padding_ms": 300,
  "silence_duration_ms": 500
}

CallSphere defaults to a 500ms silence duration and 300ms prefix padding for natural-sounding turn-taking. Healthcare's clinical intake flow tunes those slightly higher to give patients time to recall details.

VAD State Machine

stateDiagram-v2
    [*] --> Idle
    Idle --> SpeechDetected: energy + model signal
    SpeechDetected --> Speaking: prefix_padding_ms (300ms)
    Speaking --> PossibleEnd: energy drop
    PossibleEnd --> Speaking: speech resumes within silence_duration_ms
    PossibleEnd --> TurnEnded: silence_duration_ms (500ms) elapsed AND model deems utterance complete
    TurnEnded --> ResponseStreaming: model emits audio + tool calls
    ResponseStreaming --> Idle: response complete
    Speaking --> Interrupted: user barge-in detected
    Interrupted --> Speaking: re-engage user audio

The state machine highlights two signals merging at the TurnEnded transition: the time-based silence elapsed AND the model's semantic check. Either alone is insufficient. Both together produce robust turn detection.

Head-to-Head Comparison

VAD Property	CallSphere (OpenAI Server VAD)	Vapi (Hosted Silero/webrtcvad)
Acoustic model	Native to LLM	Separate classifier
Semantic completion signal	Yes	No
Per-call tuning	Yes via session config	Account-level
Filler-word tolerance	High	Moderate
Background noise robustness	High	Moderate
Barge-in support	Native	Native
Latency to turn detection	~300-500ms	~300-500ms
Cut-off rate (anecdotal, healthcare)	<2% of turns	~5-8% of turns

The cut-off rate numbers are based on internal CallSphere QA across the Healthcare vertical's 14-tool intake flow, where turn detection accuracy directly impacts whether function calls fire on the right user input.

Real-World Failure Modes Server VAD Avoids

Failure 1: Polite caller waits. A caller says "Yes, I'd like to schedule" and pauses to look at their calendar. Classical VAD fires after 500ms of silence and the agent blurts out "Great, what date?" before the caller is ready. Server VAD, knowing the sentence had a dangling intent, holds.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Failure 2: Numeric dictation. "My number is five five five... two two two..." Each digit pause is enough to trip a classical VAD. Server VAD recognizes the number is incomplete and waits.

Failure 3: Two-clause replies. "I want a haircut and also could you check if Sarah is available." Classical VAD sometimes fires after "haircut" because the energy dipped. Server VAD usually catches the conjunction.

Failure 4: ESL speaker pauses. Non-native English speakers pause more between words. Classical VAD with a 300ms threshold cuts them off. Server VAD adapts because the model handles ESL acoustic patterns natively.

Tuning Knobs CallSphere Exposes

Knob	Default	When to change
silence_duration_ms	500	Raise for clinical intake, lower for quick sales scripts
prefix_padding_ms	300	Raise if first words get clipped
threshold	0.5	Lower for quiet speakers, raise for noisy environments

These are per-vertical defaults. Healthcare runs at 700/300/0.5, Sales at 400/250/0.55, Real Estate at 500/300/0.5.

Code Sketch: Configuring Server VAD

await ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    turn_detection: {
      type: 'server_vad',
      threshold: 0.5,
      prefix_padding_ms: 300,
      silence_duration_ms: 500,
    },
    input_audio_format: 'pcm16',
    output_audio_format: 'pcm16',
  },
}));

That's the full configuration. No client-side classifier to tune, no separate VAD container to monitor.

When Vapi's VAD Is Good Enough

If your call paths are short, English-only, and on clean lines, Vapi's pipeline VAD will rarely embarrass you. Where it strains is multi-language verticals (CallSphere supports 57+ languages), elderly callers with longer pauses, and any call with background noise (clinic waiting rooms, salon chairs, real estate showings).

FAQ

What is server-side VAD?

Voice activity detection that runs on the same compute as the language model, sharing acoustic and semantic state. It detects end-of-turn when the model considers the utterance complete, not just when audio energy drops.

Does CallSphere ever use client-side VAD?

For barge-in detection on the caller leg, yes — we run a thin energy-based detector to know when to duck the agent's TTS. Turn detection itself happens server-side.

Can I tune VAD per call?

Yes. CallSphere agents can update the VAD parameters mid-session if needed (for example, raising silence duration when entering a Q&A loop where the caller may pause to consider).

Is server-side VAD slower?

No. The end-to-end latency is comparable to client-side VAD because the server VAD computation overlaps with the model's incremental processing.

What about phone-line noise (PSTN hiss)?

The model has been trained on telephony audio and tolerates standard PSTN hiss without false triggers. We additionally apply a 300Hz high-pass filter at the gateway for very noisy lines.

How We Tune VAD for a New Vertical

When CallSphere launches a new vertical, VAD tuning is one of the first three configuration tasks. The process is empirical, not theoretical.

Step one is collecting a baseline of 100-200 real calls with default settings (silence_duration_ms=500, prefix_padding_ms=300, threshold=0.5). We tag each turn with a quality label: clean, cut-off, agent-talked-over, agent-jumped-in-too-soon.

Step two is reviewing the failure modes. If cut-offs dominate, raise silence_duration_ms by 100ms. If the agent jumps in too soon, raise threshold to 0.55. If first words get clipped, raise prefix_padding_ms by 50ms.

Step three is re-running the same caller cohort against the new settings (we replay the captured PCM frames). This gives a controlled A/B without burning live calls.

Step four is shipping the new defaults to the vertical's session config and watching the dashboards for a week. Healthcare's 700/300/0.5 wasn't theoretical — it came from this exact loop, run twice, with the second iteration moving silence_duration_ms from 500 to 700 because patients pause more when reciting medication lists.

Edge Case: The Two-Caller Conference

One pattern that breaks classical VAD entirely is a three-way call where two humans speak alongside the AI agent (for example, a patient and a family member). The classical VAD picks up speech from either human as continuous and never hands the turn back. Server VAD with the model in the loop handles it better because the model can recognize that the conversation has paused for the agent's input even when overlapping voices remain audible. We don't claim it's perfect — multi-party conferences are still a hard problem — but the failure rate drops noticeably with server-side VAD compared to a classical pipeline.

Try CallSphere

Experience server-side VAD on a real call. Book a demo or read the features overview.