Handling Voice Agent Interruptions and Barge-In
Learn how to handle user interruptions and barge-in events in voice agents with lifecycle management, audio muting, graceful cancellation, and response resumption strategies.
Why Interruptions Are Inevitable
In natural conversation, people interrupt each other constantly. A user might say "actually, never mind" halfway through the agent's response. They might correct a misunderstood detail before the agent finishes acting on it. Or they might already know the information being delivered and want to skip ahead.
A voice agent that ignores interruptions — that bulldozes through its response regardless of what the user says — feels robotic and frustrating. Handling barge-in correctly is one of the hallmarks of a well-built voice experience.
The Barge-In Lifecycle
Barge-in is the event where a user starts speaking while the agent is still producing audio output. Handling it involves a sequence of steps:
flowchart TD
START["Handling Voice Agent Interruptions and Barge-In"] --> A
A["Why Interruptions Are Inevitable"]
A --> B
B["The Barge-In Lifecycle"]
B --> C
C["Detecting True Interruptions vs Backcha…"]
C --> D
D["Muting and Cancelling Agent Output"]
D --> E
E["Graceful Cancellation Patterns"]
E --> F
F["Tracking Interruption Context"]
F --> G
G["Production Best Practices"]
G --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
- Detect — VAD identifies user speech during agent playback
- Classify — Determine if it is a true interruption or a backchannel
- Cancel — Stop the agent's current audio output
- Capture — Record and transcribe the user's interrupting speech
- Resume — Process the interruption and generate an appropriate response
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import asyncio
import time
class InterruptionType(str, Enum):
CORRECTION = "correction" # "No, I said Tuesday"
CANCELLATION = "cancellation" # "Never mind" / "Stop"
REDIRECT = "redirect" # "Actually, can you help with..."
BACKCHANNEL = "backchannel" # "Uh-huh" / "OK"
CLARIFICATION = "clarification" # "Wait, what was that?"
@dataclass
class InterruptionEvent:
timestamp: float
type: InterruptionType
user_transcript: str
agent_was_saying: str
agent_progress_pct: float # how far through the response
handled: bool = False
Detecting True Interruptions vs Backchannels
Not every user utterance during agent speech is an interruption. The first challenge is distinguishing between a backchannel ("mm-hmm") and a genuine attempt to take the floor. We covered the basics in the VAD post — here we build a more sophisticated classifier:
@dataclass
class BargeInDetector:
energy_threshold: float = 0.04
duration_threshold: float = 0.6 # seconds
backchannel_words: set = field(default_factory=lambda: {
"uh-huh", "mm-hmm", "yeah", "yes", "ok", "okay",
"right", "sure", "got it", "i see", "mhm",
})
_speech_start: Optional[float] = field(default=None, init=False)
_accumulated_text: str = field(default="", init=False)
def on_user_speech_start(self):
"""Called when VAD detects user speech during agent output."""
self._speech_start = time.time()
self._accumulated_text = ""
def on_partial_transcript(self, text: str) -> Optional[InterruptionType]:
"""Process partial transcription to classify the interruption."""
self._accumulated_text = text.strip().lower()
# Check for backchannel
if self._accumulated_text in self.backchannel_words:
return InterruptionType.BACKCHANNEL
# Check for explicit cancellation
cancel_phrases = {"stop", "never mind", "nevermind", "cancel", "shut up"}
if self._accumulated_text in cancel_phrases:
return InterruptionType.CANCELLATION
# Check for corrections
if self._accumulated_text.startswith(("no ", "not ", "actually ")):
return InterruptionType.CORRECTION
# Check for redirects
if self._accumulated_text.startswith(("can you ", "what about ", "instead ")):
return InterruptionType.REDIRECT
# If speech has been going long enough, it is a real interruption
if self._speech_start and (time.time() - self._speech_start) > self.duration_threshold:
return InterruptionType.REDIRECT
return None # Not enough data yet
The key insight is that classification is progressive. You start making a decision as soon as partial transcription arrives and refine it as more words come in. This minimizes the delay between the user speaking and the agent reacting.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Muting and Cancelling Agent Output
Once you determine the user is truly interrupting, you need to stop the agent's audio output immediately. With the OpenAI Realtime API, this means sending a cancel event:
flowchart TD
CENTER(("Core Concepts"))
CENTER --> N0["Detect — VAD identifies user speech dur…"]
CENTER --> N1["Classify — Determine if it is a true in…"]
CENTER --> N2["Cancel — Stop the agent39s current audi…"]
CENTER --> N3["Capture — Record and transcribe the use…"]
CENTER --> N4["Resume — Process the interruption and g…"]
CENTER --> N5["Add a minimum speech duration 200-300ms…"]
style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
import json
async def cancel_agent_response(ws, item_id: str):
"""Cancel the current agent response on the Realtime API."""
await ws.send(json.dumps({
"type": "response.cancel",
}))
async def truncate_audio_output(ws, item_id: str, content_index: int, audio_end_ms: int):
"""Truncate the audio output at the current playback position."""
await ws.send(json.dumps({
"type": "conversation.item.truncate",
"item_id": item_id,
"content_index": content_index,
"audio_end_ms": audio_end_ms,
}))
On the client side, you also need to immediately stop audio playback. If there is buffered audio waiting to be played, flush it:
@dataclass
class AudioPlaybackManager:
_buffer: list = field(default_factory=list, init=False)
_is_playing: bool = field(default=False, init=False)
_muted: bool = field(default=False, init=False)
def mute(self):
"""Immediately stop playback and clear the buffer."""
self._muted = True
self._is_playing = False
self._buffer.clear()
def unmute(self):
"""Allow playback to resume."""
self._muted = False
def enqueue(self, audio_chunk: bytes):
"""Add audio to the playback buffer."""
if not self._muted:
self._buffer.append(audio_chunk)
def flush(self):
"""Clear all buffered audio without playing it."""
self._buffer.clear()
Graceful Cancellation Patterns
Abruptly stopping mid-word sounds jarring. A more polished approach is to let the current word or phrase finish before stopping, then acknowledge the interruption:
async def handle_interruption(
ws,
event: InterruptionEvent,
playback: AudioPlaybackManager,
):
"""Handle a classified interruption event."""
if event.type == InterruptionType.BACKCHANNEL:
# Do nothing — agent continues speaking
return
# Stop agent audio
playback.mute()
if event.type == InterruptionType.CANCELLATION:
playback.flush()
await send_agent_message(
ws,
"Understood, I will stop. What would you like to do instead?",
)
elif event.type == InterruptionType.CORRECTION:
playback.flush()
await send_agent_message(
ws,
f"Sorry about that. Let me address your correction: "
f"{event.user_transcript}",
)
elif event.type == InterruptionType.REDIRECT:
playback.flush()
await send_agent_message(
ws,
f"Of course, let me help with that instead.",
)
elif event.type == InterruptionType.CLARIFICATION:
playback.flush()
await send_agent_message(
ws,
"Let me repeat that more clearly.",
)
event.handled = True
playback.unmute()
async def send_agent_message(ws, text: str):
"""Inject a text message for the agent to speak."""
await ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "assistant",
"content": [{"type": "input_text", "text": text}],
},
}))
await ws.send(json.dumps({"type": "response.create"}))
Tracking Interruption Context
The agent needs to know what it was saying when interrupted so it can resume or adjust. Track the context:
@dataclass
class ConversationTracker:
_current_response_text: str = field(default="", init=False)
_current_item_id: Optional[str] = field(default=None, init=False)
_interruption_history: list = field(default_factory=list, init=False)
def on_response_text_delta(self, item_id: str, delta: str):
"""Track the agent's response as it streams."""
self._current_item_id = item_id
self._current_response_text += delta
def on_interruption(self, user_text: str) -> InterruptionEvent:
"""Create an interruption event with full context."""
progress = len(self._current_response_text)
event = InterruptionEvent(
timestamp=time.time(),
type=InterruptionType.REDIRECT,
user_transcript=user_text,
agent_was_saying=self._current_response_text,
agent_progress_pct=min(progress / max(progress + 50, 1), 1.0),
)
self._interruption_history.append(event)
self._current_response_text = ""
return event
@property
def interruption_rate(self) -> float:
"""Track how often the user interrupts — high rates suggest issues."""
if not self._interruption_history:
return 0.0
recent = [
e for e in self._interruption_history
if time.time() - e.timestamp < 300 # last 5 minutes
]
return len(recent) / 5.0 # interruptions per minute
A high interruption rate is a signal that something is wrong. The agent might be speaking too slowly, providing irrelevant information, or misunderstanding the user. Log and monitor this metric.
Production Best Practices
- Always prefer false negatives over false positives — it is better to miss a backchannel than to incorrectly stop a response due to a cough
- Add a minimum speech duration (200-300ms) before triggering barge-in to filter out transient noises
- Track what was interrupted so the agent can offer to continue: "I was explaining the refund policy. Would you like me to continue?"
- Test with real users early — interruption patterns vary wildly between people, cultures, and contexts
- Log every interruption event with timestamps, classification, and user transcript for iterative improvement
- Set up alerts on interruption rate spikes — they often indicate a regression in agent behavior or audio quality
Handling interruptions well is what separates a demo-grade voice agent from one that users actually want to talk to. The investment in barge-in logic pays off in every single conversation.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.