StreamedAudioInput: Real-Time Voice Interaction with Activity Detection
Build real-time voice agents using StreamedAudioInput with continuous microphone streaming, voice activity detection (VAD), turn detection, and lifecycle events for natural conversational flow.
The Problem with Fixed-Duration Recording
In the previous tutorial, we recorded audio for a fixed duration (5 seconds), sent it through the pipeline, and played the response. This works for demos, but it fails in real conversations for several reasons:
- The user might finish speaking in 1 second but waits 4 seconds for the recording to end
- The user might need more than 5 seconds for a complex question and gets cut off
- There is no way to interrupt the agent while it is speaking
- The interaction feels robotic — speak, wait, listen, repeat
Real voice conversations are fluid. People start and stop speaking naturally. They pause mid-thought. They interrupt when they already know the answer. A production voice agent needs to handle all of this.
StreamedAudioInput solves these problems by accepting a continuous stream of audio rather than a fixed buffer. Combined with voice activity detection (VAD), it automatically detects when the user starts and stops speaking, enabling natural turn-taking.
StreamedAudioInput vs AudioInput
The key difference between the two input types:
flowchart TD
START["StreamedAudioInput: Real-Time Voice Interaction w…"] --> A
A["The Problem with Fixed-Duration Recordi…"]
A --> B
B["StreamedAudioInput vs AudioInput"]
B --> C
C["Voice Activity Detection VAD"]
C --> D
D["Building a Real-Time Voice Agent"]
D --> E
E["Lifecycle Events"]
E --> F
F["Handling Interruptions"]
F --> G
G["Turn Detection Strategies"]
G --> H
H["Production Considerations"]
H --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
AudioInput — a complete audio buffer. You record everything, then send it all at once. The pipeline processes it as a single utterance.
StreamedAudioInput — a live audio stream. You push audio chunks as they arrive from the microphone. The pipeline uses VAD to detect speech boundaries and processes each utterance as it completes.
from agents.voice import AudioInput, StreamedAudioInput
# AudioInput — fixed buffer, all at once
audio = AudioInput(buffer=numpy_array)
result = await pipeline.run(audio)
# StreamedAudioInput — continuous stream
streamed = StreamedAudioInput()
result = await pipeline.run(streamed)
# Push chunks as they arrive from the microphone
streamed.add_audio(chunk_1)
streamed.add_audio(chunk_2)
streamed.add_audio(chunk_3)
With StreamedAudioInput, you call pipeline.run() first and then push audio into the stream. The pipeline runs concurrently, processing audio as it arrives.
Voice Activity Detection (VAD)
VAD is the technology that determines when the user is speaking versus when there is silence or background noise. The Agents SDK includes a built-in VAD implementation that runs locally (no API call required).
The VAD works by analyzing audio energy levels and spectral characteristics in real time:
Audio stream: [silence][silence][SPEECH][SPEECH][SPEECH][silence][silence]
VAD output: [ off ][ off ][ on ][ on ][ on ][ off ][ off ]
^ ^
speech start speech end
(start listening) (trigger STT)
When VAD detects the transition from silence to speech, the pipeline starts buffering audio for transcription. When it detects the transition from speech to silence (with a configurable delay), it sends the buffered audio to STT and triggers the agent.
Configuring VAD
You can tune VAD sensitivity through the pipeline configuration:
from agents.voice import VoicePipeline, VoicePipelineConfig
config = VoicePipelineConfig(
# Minimum speech duration to trigger processing (milliseconds)
min_speech_duration_ms=250,
# How long silence must last before ending a turn (milliseconds)
silence_duration_ms=700,
# VAD sensitivity threshold (0.0 to 1.0)
# Lower = more sensitive (catches quieter speech, more false positives)
# Higher = less sensitive (misses quiet speech, fewer false positives)
vad_threshold=0.5,
# Audio padding around detected speech (milliseconds)
prefix_padding_ms=300,
)
pipeline = VoicePipeline(
workflow=workflow,
config=config,
)
silence_duration_ms is the most important parameter. Set it too low (200ms) and the agent interrupts natural pauses. Set it too high (2000ms) and the user waits awkwardly after finishing their sentence. 500-800ms works well for most English conversations.
vad_threshold controls sensitivity. In a quiet office, 0.5 works well. In a noisy call center, you might increase it to 0.7 to avoid triggering on background chatter.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
prefix_padding_ms adds a buffer before detected speech starts. This prevents clipping the first syllable, which VAD sometimes misses because the energy ramp-up at the start of speech can be gradual.
Building a Real-Time Voice Agent
Here is a complete implementation that streams microphone audio continuously and handles turn-taking automatically:
flowchart TD
CENTER(("Core Concepts"))
CENTER --> N0["The user might finish speaking in 1 sec…"]
CENTER --> N1["The user might need more than 5 seconds…"]
CENTER --> N2["There is no way to interrupt the agent …"]
CENTER --> N3["The interaction feels robotic — speak, …"]
CENTER --> N4["https://openai.github.io/openai-agents-…"]
CENTER --> N5["https://openai.github.io/openai-agents-…"]
style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
import asyncio
import numpy as np
import sounddevice as sd
from agents import Agent
from agents.voice import (
VoicePipeline,
VoicePipelineConfig,
SingleAgentVoiceWorkflow,
StreamedAudioInput,
)
SAMPLE_RATE = 24000
CHANNELS = 1
CHUNK_SIZE = 2400 # 100ms at 24kHz
agent = Agent(
name="RealtimeAssistant",
instructions="""You are a real-time voice assistant.
Keep all responses under 2 sentences.
Be conversational and natural.""",
)
workflow = SingleAgentVoiceWorkflow(agent)
config = VoicePipelineConfig(
silence_duration_ms=700,
vad_threshold=0.5,
prefix_padding_ms=300,
)
pipeline = VoicePipeline(workflow=workflow, config=config)
async def microphone_stream(streamed_input: StreamedAudioInput):
"""Continuously capture microphone audio and push to the stream."""
loop = asyncio.get_event_loop()
def audio_callback(indata, frames, time_info, status):
if status:
print(f"Audio status: {status}")
# indata is a numpy array — push a copy to avoid buffer reuse issues
chunk = indata[:, 0].copy().astype(np.int16)
loop.call_soon_threadsafe(streamed_input.add_audio, chunk)
with sd.InputStream(
samplerate=SAMPLE_RATE,
channels=CHANNELS,
dtype=np.int16,
blocksize=CHUNK_SIZE,
callback=audio_callback,
):
print("Microphone active — speak naturally")
# Keep the stream open indefinitely
while True:
await asyncio.sleep(0.1)
async def handle_pipeline_output(result):
"""Process pipeline output events and play audio."""
audio_chunks = []
async for event in result.stream():
if event.type == "voice_stream_event_audio":
audio_chunks.append(event.data)
elif event.type == "voice_stream_event_turn_started":
print("[Turn started — processing speech]")
elif event.type == "voice_stream_event_turn_ended":
print("[Turn ended]")
# Play accumulated audio for this turn
if audio_chunks:
full_audio = np.concatenate(audio_chunks)
sd.play(full_audio, samplerate=SAMPLE_RATE)
sd.wait()
audio_chunks = []
async def main():
print("Real-Time Voice Agent")
print("=" * 40)
print("Speak naturally. The agent will respond when you pause.")
print("Press Ctrl+C to quit.")
streamed_input = StreamedAudioInput()
# Start the pipeline with streamed input
result = await pipeline.run(streamed_input)
# Run microphone capture and output handling concurrently
await asyncio.gather(
microphone_stream(streamed_input),
handle_pipeline_output(result),
)
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
print("\nShutting down...")
The critical pattern here is asyncio.gather. The microphone capture and the output handler run as concurrent tasks. Audio flows into the stream from the microphone callback, VAD detects speech boundaries, the pipeline processes each utterance, and the output handler plays the response — all running simultaneously.
Lifecycle Events
The pipeline emits lifecycle events that let you track the conversation state:
async for event in result.stream():
if event.type == "voice_stream_event_turn_started":
# VAD detected speech — a new user turn is beginning
# Use this to show a "listening" indicator in the UI
print("User is speaking...")
elif event.type == "voice_stream_event_turn_ended":
# The turn has been fully processed
# STT + Agent + TTS are complete for this utterance
print("Agent responded.")
elif event.type == "voice_stream_event_audio":
# A chunk of TTS audio is ready for playback
play_chunk(event.data)
elif event.type == "voice_stream_event_transcript":
# The STT transcript for the user's speech
print(f"User said: {event.data}")
elif event.type == "voice_stream_event_error":
# Something went wrong in the pipeline
print(f"Pipeline error: {event.data}")
These events are essential for building UI feedback. In a web application, you would use turn_started to show a pulsing microphone icon, transcript to display what the user said, and audio events to stream the response.
Handling Interruptions
One of the advantages of streaming is that users can interrupt the agent mid-response. When the VAD detects new speech while audio is playing, the pipeline can cancel the current TTS output and start processing the new utterance.
import threading
playback_lock = threading.Event()
playback_lock.set() # Not playing initially
async def handle_pipeline_output(result):
async for event in result.stream():
if event.type == "voice_stream_event_turn_started":
# New turn — stop any current playback
sd.stop()
playback_lock.set()
elif event.type == "voice_stream_event_audio":
playback_lock.wait()
playback_lock.clear()
def play_and_signal(audio):
sd.play(audio, samplerate=SAMPLE_RATE)
sd.wait()
playback_lock.set()
threading.Thread(
target=play_and_signal,
args=(event.data,),
daemon=True,
).start()
When a new turn starts (the user speaks again), sd.stop() immediately halts any audio playback. The pipeline processes the new speech, and the previous response is abandoned. This creates a natural interruption flow — exactly how human conversations work.
Turn Detection Strategies
The default VAD-based turn detection works well for most scenarios, but you can implement custom strategies for specific use cases:
Push-to-talk: Disable VAD entirely and use a button press to start and stop recording. Useful for noisy environments or hands-free devices with a physical button.
# Push-to-talk: manually signal when the user is done
streamed_input = StreamedAudioInput()
result = await pipeline.run(streamed_input)
# Start recording on button press
for chunk in microphone_chunks():
streamed_input.add_audio(chunk)
# Signal end of speech on button release
streamed_input.close()
Keyword-based: Use a wake word detector before activating the full pipeline. The VAD only processes audio after the wake word is detected.
Hybrid: Use VAD for the initial turn detection but switch to keyword-based detection when the agent asks a yes/no question, reducing false triggers from background noise during short responses.
Production Considerations
When deploying streamed voice agents to production, keep these factors in mind:
Memory management. Each active stream consumes memory for audio buffering. Monitor buffer sizes and implement cleanup for abandoned streams (user disconnects without closing the stream).
Concurrent sessions. Each pipeline.run() creates a processing pipeline. With 100 concurrent users, you have 100 STT and TTS sessions. Plan your API rate limits and budget accordingly.
Network jitter. In WebSocket-based deployments, audio chunks may arrive with irregular timing. Buffer at least 200ms of audio before forwarding to VAD to smooth out jitter.
Echo cancellation. If the user's microphone picks up the agent's audio output, the VAD may detect it as new speech, creating a feedback loop. Implement acoustic echo cancellation (AEC) on the client side, or mute the microphone during agent playback as a simpler alternative.
Sources:
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.