---
title: "Handling Voice Agent Interruptions and Barge-In"
description: "Learn how to handle user interruptions and barge-in events in voice agents with lifecycle management, audio muting, graceful cancellation, and response resumption strategies."
canonical: https://callsphere.ai/blog/handling-voice-agent-interruptions-barge-in
category: "Learn Agentic AI"
tags: ["OpenAI", "Interruptions", "Barge-In", "Voice UX"]
author: "CallSphere Team"
published: 2026-03-14T00:00:00.000Z
updated: 2026-05-12T18:34:50.210Z
---

# Handling Voice Agent Interruptions and Barge-In

> Learn how to handle user interruptions and barge-in events in voice agents with lifecycle management, audio muting, graceful cancellation, and response resumption strategies.

## Why Interruptions Are Inevitable

In natural conversation, people interrupt each other constantly. A user might say "actually, never mind" halfway through the agent's response. They might correct a misunderstood detail before the agent finishes acting on it. Or they might already know the information being delivered and want to skip ahead.

A voice agent that ignores interruptions — that bulldozes through its response regardless of what the user says — feels robotic and frustrating. Handling barge-in correctly is one of the hallmarks of a well-built voice experience.

## The Barge-In Lifecycle

Barge-in is the event where a user starts speaking while the agent is still producing audio output. Handling it involves a sequence of steps:

```mermaid
flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT
Deepgram or Whisper"]
        NLU{"Intent and
Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS
ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and
Schedule")]
        KB[("Knowledge Base
and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS  CRM
    TOOLS  CAL
    TOOLS  KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
```

1. **Detect** — VAD identifies user speech during agent playback
2. **Classify** — Determine if it is a true interruption or a backchannel
3. **Cancel** — Stop the agent's current audio output
4. **Capture** — Record and transcribe the user's interrupting speech
5. **Resume** — Process the interruption and generate an appropriate response

```python
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import asyncio
import time

class InterruptionType(str, Enum):
    CORRECTION = "correction"       # "No, I said Tuesday"
    CANCELLATION = "cancellation"   # "Never mind" / "Stop"
    REDIRECT = "redirect"           # "Actually, can you help with..."
    BACKCHANNEL = "backchannel"     # "Uh-huh" / "OK"
    CLARIFICATION = "clarification" # "Wait, what was that?"

@dataclass
class InterruptionEvent:
    timestamp: float
    type: InterruptionType
    user_transcript: str
    agent_was_saying: str
    agent_progress_pct: float  # how far through the response
    handled: bool = False
```

## Detecting True Interruptions vs Backchannels

Not every user utterance during agent speech is an interruption. The first challenge is distinguishing between a backchannel ("mm-hmm") and a genuine attempt to take the floor. We covered the basics in the VAD post — here we build a more sophisticated classifier:

```python
@dataclass
class BargeInDetector:
    energy_threshold: float = 0.04
    duration_threshold: float = 0.6  # seconds
    backchannel_words: set = field(default_factory=lambda: {
        "uh-huh", "mm-hmm", "yeah", "yes", "ok", "okay",
        "right", "sure", "got it", "i see", "mhm",
    })
    _speech_start: Optional[float] = field(default=None, init=False)
    _accumulated_text: str = field(default="", init=False)

    def on_user_speech_start(self):
        """Called when VAD detects user speech during agent output."""
        self._speech_start = time.time()
        self._accumulated_text = ""

    def on_partial_transcript(self, text: str) -> Optional[InterruptionType]:
        """Process partial transcription to classify the interruption."""
        self._accumulated_text = text.strip().lower()

        # Check for backchannel
        if self._accumulated_text in self.backchannel_words:
            return InterruptionType.BACKCHANNEL

        # Check for explicit cancellation
        cancel_phrases = {"stop", "never mind", "nevermind", "cancel", "shut up"}
        if self._accumulated_text in cancel_phrases:
            return InterruptionType.CANCELLATION

        # Check for corrections
        if self._accumulated_text.startswith(("no ", "not ", "actually ")):
            return InterruptionType.CORRECTION

        # Check for redirects
        if self._accumulated_text.startswith(("can you ", "what about ", "instead ")):
            return InterruptionType.REDIRECT

        # If speech has been going long enough, it is a real interruption
        if self._speech_start and (time.time() - self._speech_start) > self.duration_threshold:
            return InterruptionType.REDIRECT

        return None  # Not enough data yet
```

The key insight is that classification is **progressive**. You start making a decision as soon as partial transcription arrives and refine it as more words come in. This minimizes the delay between the user speaking and the agent reacting.

## Muting and Cancelling Agent Output

Once you determine the user is truly interrupting, you need to stop the agent's audio output immediately. With the OpenAI Realtime API, this means sending a cancel event:

```python
import json

async def cancel_agent_response(ws, item_id: str):
    """Cancel the current agent response on the Realtime API."""
    await ws.send(json.dumps({
        "type": "response.cancel",
    }))

async def truncate_audio_output(ws, item_id: str, content_index: int, audio_end_ms: int):
    """Truncate the audio output at the current playback position."""
    await ws.send(json.dumps({
        "type": "conversation.item.truncate",
        "item_id": item_id,
        "content_index": content_index,
        "audio_end_ms": audio_end_ms,
    }))
```

On the client side, you also need to immediately stop audio playback. If there is buffered audio waiting to be played, flush it:

```python
@dataclass
class AudioPlaybackManager:
    _buffer: list = field(default_factory=list, init=False)
    _is_playing: bool = field(default=False, init=False)
    _muted: bool = field(default=False, init=False)

    def mute(self):
        """Immediately stop playback and clear the buffer."""
        self._muted = True
        self._is_playing = False
        self._buffer.clear()

    def unmute(self):
        """Allow playback to resume."""
        self._muted = False

    def enqueue(self, audio_chunk: bytes):
        """Add audio to the playback buffer."""
        if not self._muted:
            self._buffer.append(audio_chunk)

    def flush(self):
        """Clear all buffered audio without playing it."""
        self._buffer.clear()
```

## Graceful Cancellation Patterns

Abruptly stopping mid-word sounds jarring. A more polished approach is to let the current word or phrase finish before stopping, then acknowledge the interruption:

```python
async def handle_interruption(
    ws,
    event: InterruptionEvent,
    playback: AudioPlaybackManager,
):
    """Handle a classified interruption event."""
    if event.type == InterruptionType.BACKCHANNEL:
        # Do nothing — agent continues speaking
        return

    # Stop agent audio
    playback.mute()

    if event.type == InterruptionType.CANCELLATION:
        playback.flush()
        await send_agent_message(
            ws,
            "Understood, I will stop. What would you like to do instead?",
        )

    elif event.type == InterruptionType.CORRECTION:
        playback.flush()
        await send_agent_message(
            ws,
            f"Sorry about that. Let me address your correction: "
            f"{event.user_transcript}",
        )

    elif event.type == InterruptionType.REDIRECT:
        playback.flush()
        await send_agent_message(
            ws,
            f"Of course, let me help with that instead.",
        )

    elif event.type == InterruptionType.CLARIFICATION:
        playback.flush()
        await send_agent_message(
            ws,
            "Let me repeat that more clearly.",
        )

    event.handled = True
    playback.unmute()

async def send_agent_message(ws, text: str):
    """Inject a text message for the agent to speak."""
    await ws.send(json.dumps({
        "type": "conversation.item.create",
        "item": {
            "type": "message",
            "role": "assistant",
            "content": [{"type": "input_text", "text": text}],
        },
    }))
    await ws.send(json.dumps({"type": "response.create"}))
```

## Tracking Interruption Context

The agent needs to know what it was saying when interrupted so it can resume or adjust. Track the context:

```python
@dataclass
class ConversationTracker:
    _current_response_text: str = field(default="", init=False)
    _current_item_id: Optional[str] = field(default=None, init=False)
    _interruption_history: list = field(default_factory=list, init=False)

    def on_response_text_delta(self, item_id: str, delta: str):
        """Track the agent's response as it streams."""
        self._current_item_id = item_id
        self._current_response_text += delta

    def on_interruption(self, user_text: str) -> InterruptionEvent:
        """Create an interruption event with full context."""
        progress = len(self._current_response_text)
        event = InterruptionEvent(
            timestamp=time.time(),
            type=InterruptionType.REDIRECT,
            user_transcript=user_text,
            agent_was_saying=self._current_response_text,
            agent_progress_pct=min(progress / max(progress + 50, 1), 1.0),
        )
        self._interruption_history.append(event)
        self._current_response_text = ""
        return event

    @property
    def interruption_rate(self) -> float:
        """Track how often the user interrupts — high rates suggest issues."""
        if not self._interruption_history:
            return 0.0
        recent = [
            e for e in self._interruption_history
            if time.time() - e.timestamp < 300  # last 5 minutes
        ]
        return len(recent) / 5.0  # interruptions per minute
```

A high interruption rate is a signal that something is wrong. The agent might be speaking too slowly, providing irrelevant information, or misunderstanding the user. Log and monitor this metric.

## Production Best Practices

1. **Always prefer false negatives over false positives** — it is better to miss a backchannel than to incorrectly stop a response due to a cough
2. **Add a minimum speech duration** (200-300ms) before triggering barge-in to filter out transient noises
3. **Track what was interrupted** so the agent can offer to continue: "I was explaining the refund policy. Would you like me to continue?"
4. **Test with real users early** — interruption patterns vary wildly between people, cultures, and contexts
5. **Log every interruption event** with timestamps, classification, and user transcript for iterative improvement
6. **Set up alerts** on interruption rate spikes — they often indicate a regression in agent behavior or audio quality

Handling interruptions well is what separates a demo-grade voice agent from one that users actually want to talk to. The investment in barge-in logic pays off in every single conversation.

---

Source: https://callsphere.ai/blog/handling-voice-agent-interruptions-barge-in