The Invisible Foundation of Voice Agents

When you talk to another person, you instinctively know when they have finished speaking. You detect pauses, falling intonation, syntactic completeness, and body language. Machines have none of these instincts. They need Voice Activity Detection (VAD) and explicit turn management logic to decide when to listen, when to speak, and when to yield.

Get this wrong and your voice agent either cuts users off mid-sentence or sits in awkward silence for seconds after they stop talking. Get it right and the conversation feels as fluid as talking to a human colleague.

What Is Voice Activity Detection?

VAD is the process of determining whether an audio frame contains human speech or is just background noise. It sounds simple, but the real world is messy: keyboard clicks, air conditioning hum, dogs barking, other people talking in the background. A production VAD system must distinguish intentional speech from all of this.

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937

Energy-Based VAD

The simplest approach measures the signal energy (volume) of each audio frame:

import numpy as np

def energy_vad(audio_frame: np.ndarray, threshold: float = 0.02) -> bool:
    """Return True if the frame contains speech based on energy."""
    rms = np.sqrt(np.mean(audio_frame ** 2))
    return rms > threshold

Energy-based VAD is fast and cheap but fails in noisy environments. A loud air conditioner can register as speech, while a soft-spoken user can fall below the threshold.

Zero-Crossing Rate VAD

Speech has characteristic patterns in how often the audio signal crosses zero. Combining zero-crossing rate with energy gives a more robust detector:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

def zero_crossing_rate(audio_frame: np.ndarray) -> float:
    """Calculate the zero-crossing rate of an audio frame."""
    signs = np.sign(audio_frame)
    crossings = np.sum(np.abs(np.diff(signs)) > 0)
    return crossings / len(audio_frame)

def combined_vad(
    audio_frame: np.ndarray,
    energy_threshold: float = 0.02,
    zcr_range: tuple = (0.1, 0.5),
) -> bool:
    """Combine energy and zero-crossing rate for VAD."""
    rms = np.sqrt(np.mean(audio_frame ** 2))
    zcr = zero_crossing_rate(audio_frame)
    has_energy = rms > energy_threshold
    has_speech_zcr = zcr_range[0] <= zcr <= zcr_range[1]
    return has_energy and has_speech_zcr

Neural VAD Models

Modern production systems use neural network VAD models like Silero VAD or WebRTC VAD. These are trained on massive datasets and handle noise far better than heuristic methods:

import torch

# Silero VAD — lightweight, runs on CPU in real time
model, utils = torch.hub.load(
    repo_or_dir="snakers4/silero-vad",
    model="silero_vad",
    force_reload=False,
)
(get_speech_timestamps, _, read_audio, _, _) = utils

def detect_speech_segments(audio_path: str) -> list:
    """Return timestamps of speech segments in the audio file."""
    wav = read_audio(audio_path, sampling_rate=16000)
    speech_timestamps = get_speech_timestamps(
        wav, model, sampling_rate=16000
    )
    return speech_timestamps

Silero VAD processes 30ms audio chunks and returns a probability between 0 and 1. A threshold of 0.5 works well for most environments, but you can tune it based on your deployment context.

Turn-Taking Strategies

Detecting speech is only the first step. You also need to decide when a user has finished their turn so the agent can respond. This is the turn-taking problem.

Silence-Based Turn Detection

The most common strategy: if the user stops speaking for a configurable duration, assume their turn is complete.

import time
from dataclasses import dataclass, field

@dataclass
class TurnDetector:
    silence_threshold: float = 0.7  # seconds of silence before turn ends
    _last_speech_time: float = field(default=0.0, init=False)
    _is_speaking: bool = field(default=False, init=False)

    def process_frame(self, is_speech: bool) -> str:
        """Process a VAD result and return the turn state."""
        now = time.time()

        if is_speech:
            self._last_speech_time = now
            if not self._is_speaking:
                self._is_speaking = True
                return "turn_started"
            return "speaking"

        if self._is_speaking:
            silence_duration = now - self._last_speech_time
            if silence_duration >= self.silence_threshold:
                self._is_speaking = False
                return "turn_ended"
            return "pause"

        return "idle"

The silence threshold is the single most impactful parameter in turn management. Too short (under 0.4 seconds) and you cut off users who are pausing to think. Too long (over 1.5 seconds) and the agent feels sluggish.

Adaptive Silence Thresholds

A fixed threshold does not fit every situation. Some users speak quickly with short pauses; others think carefully between phrases. Adaptive thresholds adjust in real time:

@dataclass
class AdaptiveTurnDetector:
    base_threshold: float = 0.7
    min_threshold: float = 0.4
    max_threshold: float = 1.5
    adaptation_rate: float = 0.1
    _pause_history: list = field(default_factory=list, init=False)
    _current_threshold: float = field(default=0.7, init=False)

    def record_pause(self, pause_duration: float):
        """Record a mid-turn pause to adapt the threshold."""
        self._pause_history.append(pause_duration)
        if len(self._pause_history) > 20:
            self._pause_history.pop(0)

        if len(self._pause_history) >= 3:
            avg_pause = sum(self._pause_history) / len(self._pause_history)
            target = avg_pause * 2.0  # 2x the average pause
            self._current_threshold += (
                (target - self._current_threshold) * self.adaptation_rate
            )
            self._current_threshold = max(
                self.min_threshold,
                min(self.max_threshold, self._current_threshold),
            )

    @property
    def threshold(self) -> float:
        return self._current_threshold

This detector learns the user's speaking rhythm. If a user consistently pauses for 0.3 seconds between thoughts, the threshold settles around 0.6 seconds — fast enough to feel responsive but not so fast that it interrupts mid-thought pauses.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Handling Overlapping Speech

Real conversations have overlap. Users sometimes start speaking before the agent finishes, or they provide brief acknowledgments ("uh-huh", "yeah") while the agent is talking. Your system must handle these gracefully.

Overlap Classification

Not all overlaps are the same. Classify them to respond appropriately:

from enum import Enum

class OverlapType(str, Enum):
    BACKCHANNEL = "backchannel"    # "uh-huh", "yeah", "ok"
    INTERRUPTION = "interruption"  # user wants to take the floor
    COLLISION = "collision"        # both started at the same time

def classify_overlap(
    user_audio_energy: float,
    user_speech_duration: float,
    agent_is_speaking: bool,
) -> OverlapType:
    """Classify the type of speech overlap."""
    if not agent_is_speaking:
        return OverlapType.COLLISION

    # Short, low-energy speech during agent turn = backchannel
    if user_speech_duration < 0.5 and user_audio_energy < 0.05:
        return OverlapType.BACKCHANNEL

    # Sustained speech during agent turn = interruption
    return OverlapType.INTERRUPTION

For backchannels, the agent should continue speaking. For interruptions, the agent should stop and yield the floor. This distinction prevents the agent from halting every time a user says "mm-hmm."

Integrating VAD with OpenAI Realtime API

The OpenAI Realtime API provides built-in server-side VAD, but understanding how to configure it is essential:

import json
import websockets

async def configure_realtime_session(ws):
    """Configure the OpenAI Realtime API session with VAD settings."""
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.5,
                "prefix_padding_ms": 300,
                "silence_duration_ms": 700,
            },
            "input_audio_transcription": {
                "model": "whisper-1",
            },
        },
    }))

The three key parameters are threshold (VAD sensitivity, 0.0 to 1.0), prefix_padding_ms (how much audio before detected speech to include, preventing clipped beginnings), and silence_duration_ms (how long to wait after speech ends before finalizing the turn).

Production Tuning Guidelines

After deploying VAD and turn management across multiple voice agents, these guidelines consistently produce the best results:

Start with server VAD at threshold 0.5 and silence 700ms, then tune based on user feedback
Log every turn event — turn_started, turn_ended, interruption, backchannel — with timestamps for analysis
Measure end-of-turn latency as the time between the user stopping speech and the agent beginning its response; target under 500ms total
Test with diverse audio conditions: quiet rooms, noisy cafes, speakerphone, Bluetooth headsets
Add a visual indicator (for screen-based agents) showing whether the system thinks the user is speaking — this helps users adjust their behavior

The difference between a frustrating voice agent and a delightful one often comes down to 200 milliseconds of silence threshold tuning. Invest the time to get it right.

Voice Activity Detection and Turn Management in Conversational AI

The Invisible Foundation of Voice Agents

What Is Voice Activity Detection?

Energy-Based VAD

Zero-Crossing Rate VAD

Neural VAD Models

Turn-Taking Strategies

Silence-Based Turn Detection

Adaptive Silence Thresholds

Handling Overlapping Speech

Overlap Classification

Integrating VAD with OpenAI Realtime API

Production Tuning Guidelines

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Model-Native Harness: Why OpenAI and Anthropic Are Killing ReAct Loops

GPT-Realtime-Whisper vs Deepgram: Streaming STT in 2026