Why Interruptions Are Inevitable

In natural conversation, people interrupt each other constantly. A user might say "actually, never mind" halfway through the agent's response. They might correct a misunderstood detail before the agent finishes acting on it. Or they might already know the information being delivered and want to skip ahead.

A voice agent that ignores interruptions — that bulldozes through its response regardless of what the user says — feels robotic and frustrating. Handling barge-in correctly is one of the hallmarks of a well-built voice experience.

The Barge-In Lifecycle

Barge-in is the event where a user starts speaking while the agent is still producing audio output. Handling it involves a sequence of steps:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937

Detect — VAD identifies user speech during agent playback
Classify — Determine if it is a true interruption or a backchannel
Cancel — Stop the agent's current audio output
Capture — Record and transcribe the user's interrupting speech
Resume — Process the interruption and generate an appropriate response

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import asyncio
import time

class InterruptionType(str, Enum):
    CORRECTION = "correction"       # "No, I said Tuesday"
    CANCELLATION = "cancellation"   # "Never mind" / "Stop"
    REDIRECT = "redirect"           # "Actually, can you help with..."
    BACKCHANNEL = "backchannel"     # "Uh-huh" / "OK"
    CLARIFICATION = "clarification" # "Wait, what was that?"

@dataclass
class InterruptionEvent:
    timestamp: float
    type: InterruptionType
    user_transcript: str
    agent_was_saying: str
    agent_progress_pct: float  # how far through the response
    handled: bool = False

Detecting True Interruptions vs Backchannels

Not every user utterance during agent speech is an interruption. The first challenge is distinguishing between a backchannel ("mm-hmm") and a genuine attempt to take the floor. We covered the basics in the VAD post — here we build a more sophisticated classifier:

@dataclass
class BargeInDetector:
    energy_threshold: float = 0.04
    duration_threshold: float = 0.6  # seconds
    backchannel_words: set = field(default_factory=lambda: {
        "uh-huh", "mm-hmm", "yeah", "yes", "ok", "okay",
        "right", "sure", "got it", "i see", "mhm",
    })
    _speech_start: Optional[float] = field(default=None, init=False)
    _accumulated_text: str = field(default="", init=False)

    def on_user_speech_start(self):
        """Called when VAD detects user speech during agent output."""
        self._speech_start = time.time()
        self._accumulated_text = ""

    def on_partial_transcript(self, text: str) -> Optional[InterruptionType]:
        """Process partial transcription to classify the interruption."""
        self._accumulated_text = text.strip().lower()

        # Check for backchannel
        if self._accumulated_text in self.backchannel_words:
            return InterruptionType.BACKCHANNEL

        # Check for explicit cancellation
        cancel_phrases = {"stop", "never mind", "nevermind", "cancel", "shut up"}
        if self._accumulated_text in cancel_phrases:
            return InterruptionType.CANCELLATION

        # Check for corrections
        if self._accumulated_text.startswith(("no ", "not ", "actually ")):
            return InterruptionType.CORRECTION

        # Check for redirects
        if self._accumulated_text.startswith(("can you ", "what about ", "instead ")):
            return InterruptionType.REDIRECT

        # If speech has been going long enough, it is a real interruption
        if self._speech_start and (time.time() - self._speech_start) > self.duration_threshold:
            return InterruptionType.REDIRECT

        return None  # Not enough data yet

The key insight is that classification is progressive. You start making a decision as soon as partial transcription arrives and refine it as more words come in. This minimizes the delay between the user speaking and the agent reacting.

Muting and Cancelling Agent Output

Once you determine the user is truly interrupting, you need to stop the agent's audio output immediately. With the OpenAI Realtime API, this means sending a cancel event:

import json

async def cancel_agent_response(ws, item_id: str):
    """Cancel the current agent response on the Realtime API."""
    await ws.send(json.dumps({
        "type": "response.cancel",
    }))

async def truncate_audio_output(ws, item_id: str, content_index: int, audio_end_ms: int):
    """Truncate the audio output at the current playback position."""
    await ws.send(json.dumps({
        "type": "conversation.item.truncate",
        "item_id": item_id,
        "content_index": content_index,
        "audio_end_ms": audio_end_ms,
    }))

On the client side, you also need to immediately stop audio playback. If there is buffered audio waiting to be played, flush it:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

@dataclass
class AudioPlaybackManager:
    _buffer: list = field(default_factory=list, init=False)
    _is_playing: bool = field(default=False, init=False)
    _muted: bool = field(default=False, init=False)

    def mute(self):
        """Immediately stop playback and clear the buffer."""
        self._muted = True
        self._is_playing = False
        self._buffer.clear()

    def unmute(self):
        """Allow playback to resume."""
        self._muted = False

    def enqueue(self, audio_chunk: bytes):
        """Add audio to the playback buffer."""
        if not self._muted:
            self._buffer.append(audio_chunk)

    def flush(self):
        """Clear all buffered audio without playing it."""
        self._buffer.clear()

Graceful Cancellation Patterns

Abruptly stopping mid-word sounds jarring. A more polished approach is to let the current word or phrase finish before stopping, then acknowledge the interruption:

async def handle_interruption(
    ws,
    event: InterruptionEvent,
    playback: AudioPlaybackManager,
):
    """Handle a classified interruption event."""
    if event.type == InterruptionType.BACKCHANNEL:
        # Do nothing — agent continues speaking
        return

    # Stop agent audio
    playback.mute()

    if event.type == InterruptionType.CANCELLATION:
        playback.flush()
        await send_agent_message(
            ws,
            "Understood, I will stop. What would you like to do instead?",
        )

    elif event.type == InterruptionType.CORRECTION:
        playback.flush()
        await send_agent_message(
            ws,
            f"Sorry about that. Let me address your correction: "
            f"{event.user_transcript}",
        )

    elif event.type == InterruptionType.REDIRECT:
        playback.flush()
        await send_agent_message(
            ws,
            f"Of course, let me help with that instead.",
        )

    elif event.type == InterruptionType.CLARIFICATION:
        playback.flush()
        await send_agent_message(
            ws,
            "Let me repeat that more clearly.",
        )

    event.handled = True
    playback.unmute()

async def send_agent_message(ws, text: str):
    """Inject a text message for the agent to speak."""
    await ws.send(json.dumps({
        "type": "conversation.item.create",
        "item": {
            "type": "message",
            "role": "assistant",
            "content": [{"type": "input_text", "text": text}],
        },
    }))
    await ws.send(json.dumps({"type": "response.create"}))

Tracking Interruption Context

The agent needs to know what it was saying when interrupted so it can resume or adjust. Track the context:

@dataclass
class ConversationTracker:
    _current_response_text: str = field(default="", init=False)
    _current_item_id: Optional[str] = field(default=None, init=False)
    _interruption_history: list = field(default_factory=list, init=False)

    def on_response_text_delta(self, item_id: str, delta: str):
        """Track the agent's response as it streams."""
        self._current_item_id = item_id
        self._current_response_text += delta

    def on_interruption(self, user_text: str) -> InterruptionEvent:
        """Create an interruption event with full context."""
        progress = len(self._current_response_text)
        event = InterruptionEvent(
            timestamp=time.time(),
            type=InterruptionType.REDIRECT,
            user_transcript=user_text,
            agent_was_saying=self._current_response_text,
            agent_progress_pct=min(progress / max(progress + 50, 1), 1.0),
        )
        self._interruption_history.append(event)
        self._current_response_text = ""
        return event

    @property
    def interruption_rate(self) -> float:
        """Track how often the user interrupts — high rates suggest issues."""
        if not self._interruption_history:
            return 0.0
        recent = [
            e for e in self._interruption_history
            if time.time() - e.timestamp < 300  # last 5 minutes
        ]
        return len(recent) / 5.0  # interruptions per minute

A high interruption rate is a signal that something is wrong. The agent might be speaking too slowly, providing irrelevant information, or misunderstanding the user. Log and monitor this metric.

Production Best Practices

Always prefer false negatives over false positives — it is better to miss a backchannel than to incorrectly stop a response due to a cough
Add a minimum speech duration (200-300ms) before triggering barge-in to filter out transient noises
Track what was interrupted so the agent can offer to continue: "I was explaining the refund policy. Would you like me to continue?"
Test with real users early — interruption patterns vary wildly between people, cultures, and contexts
Log every interruption event with timestamps, classification, and user transcript for iterative improvement
Set up alerts on interruption rate spikes — they often indicate a regression in agent behavior or audio quality

Handling interruptions well is what separates a demo-grade voice agent from one that users actually want to talk to. The investment in barge-in logic pays off in every single conversation.

Handling Voice Agent Interruptions and Barge-In

Why Interruptions Are Inevitable

The Barge-In Lifecycle

Detecting True Interruptions vs Backchannels

Muting and Cancelling Agent Output

Graceful Cancellation Patterns

Tracking Interruption Context

Production Best Practices

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Model-Native Harness: Why OpenAI and Anthropic Are Killing ReAct Loops

GPT-Realtime-Whisper vs Deepgram: Streaming STT in 2026