Skip to content
Learn Agentic AI
Learn Agentic AI11 min read9 views

StreamedAudioInput: Real-Time Voice Interaction with Activity Detection

Build real-time voice agents using StreamedAudioInput with continuous microphone streaming, voice activity detection (VAD), turn detection, and lifecycle events for natural conversational flow.

The Problem with Fixed-Duration Recording

In the previous tutorial, we recorded audio for a fixed duration (5 seconds), sent it through the pipeline, and played the response. This works for demos, but it fails in real conversations for several reasons:

  • The user might finish speaking in 1 second but waits 4 seconds for the recording to end
  • The user might need more than 5 seconds for a complex question and gets cut off
  • There is no way to interrupt the agent while it is speaking
  • The interaction feels robotic — speak, wait, listen, repeat

Real voice conversations are fluid. People start and stop speaking naturally. They pause mid-thought. They interrupt when they already know the answer. A production voice agent needs to handle all of this.

StreamedAudioInput solves these problems by accepting a continuous stream of audio rather than a fixed buffer. Combined with voice activity detection (VAD), it automatically detects when the user starts and stops speaking, enabling natural turn-taking.

StreamedAudioInput vs AudioInput

The key difference between the two input types:

flowchart TD
    START["StreamedAudioInput: Real-Time Voice Interaction w…"] --> A
    A["The Problem with Fixed-Duration Recordi…"]
    A --> B
    B["StreamedAudioInput vs AudioInput"]
    B --> C
    C["Voice Activity Detection VAD"]
    C --> D
    D["Building a Real-Time Voice Agent"]
    D --> E
    E["Lifecycle Events"]
    E --> F
    F["Handling Interruptions"]
    F --> G
    G["Turn Detection Strategies"]
    G --> H
    H["Production Considerations"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

AudioInput — a complete audio buffer. You record everything, then send it all at once. The pipeline processes it as a single utterance.

StreamedAudioInput — a live audio stream. You push audio chunks as they arrive from the microphone. The pipeline uses VAD to detect speech boundaries and processes each utterance as it completes.

from agents.voice import AudioInput, StreamedAudioInput

# AudioInput — fixed buffer, all at once
audio = AudioInput(buffer=numpy_array)
result = await pipeline.run(audio)

# StreamedAudioInput — continuous stream
streamed = StreamedAudioInput()
result = await pipeline.run(streamed)

# Push chunks as they arrive from the microphone
streamed.add_audio(chunk_1)
streamed.add_audio(chunk_2)
streamed.add_audio(chunk_3)

With StreamedAudioInput, you call pipeline.run() first and then push audio into the stream. The pipeline runs concurrently, processing audio as it arrives.

Voice Activity Detection (VAD)

VAD is the technology that determines when the user is speaking versus when there is silence or background noise. The Agents SDK includes a built-in VAD implementation that runs locally (no API call required).

The VAD works by analyzing audio energy levels and spectral characteristics in real time:

Audio stream: [silence][silence][SPEECH][SPEECH][SPEECH][silence][silence]
VAD output:   [  off  ][  off  ][ on  ][  on  ][  on  ][ off   ][ off  ]
                                  ^                       ^
                            speech start              speech end
                         (start listening)          (trigger STT)

When VAD detects the transition from silence to speech, the pipeline starts buffering audio for transcription. When it detects the transition from speech to silence (with a configurable delay), it sends the buffered audio to STT and triggers the agent.

Configuring VAD

You can tune VAD sensitivity through the pipeline configuration:

from agents.voice import VoicePipeline, VoicePipelineConfig

config = VoicePipelineConfig(
    # Minimum speech duration to trigger processing (milliseconds)
    min_speech_duration_ms=250,
    # How long silence must last before ending a turn (milliseconds)
    silence_duration_ms=700,
    # VAD sensitivity threshold (0.0 to 1.0)
    # Lower = more sensitive (catches quieter speech, more false positives)
    # Higher = less sensitive (misses quiet speech, fewer false positives)
    vad_threshold=0.5,
    # Audio padding around detected speech (milliseconds)
    prefix_padding_ms=300,
)

pipeline = VoicePipeline(
    workflow=workflow,
    config=config,
)

silence_duration_ms is the most important parameter. Set it too low (200ms) and the agent interrupts natural pauses. Set it too high (2000ms) and the user waits awkwardly after finishing their sentence. 500-800ms works well for most English conversations.

vad_threshold controls sensitivity. In a quiet office, 0.5 works well. In a noisy call center, you might increase it to 0.7 to avoid triggering on background chatter.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

prefix_padding_ms adds a buffer before detected speech starts. This prevents clipping the first syllable, which VAD sometimes misses because the energy ramp-up at the start of speech can be gradual.

Building a Real-Time Voice Agent

Here is a complete implementation that streams microphone audio continuously and handles turn-taking automatically:

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["The user might finish speaking in 1 sec…"]
    CENTER --> N1["The user might need more than 5 seconds…"]
    CENTER --> N2["There is no way to interrupt the agent …"]
    CENTER --> N3["The interaction feels robotic — speak, …"]
    CENTER --> N4["https://openai.github.io/openai-agents-…"]
    CENTER --> N5["https://openai.github.io/openai-agents-…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
import asyncio
import numpy as np
import sounddevice as sd
from agents import Agent
from agents.voice import (
    VoicePipeline,
    VoicePipelineConfig,
    SingleAgentVoiceWorkflow,
    StreamedAudioInput,
)

SAMPLE_RATE = 24000
CHANNELS = 1
CHUNK_SIZE = 2400  # 100ms at 24kHz

agent = Agent(
    name="RealtimeAssistant",
    instructions="""You are a real-time voice assistant.
    Keep all responses under 2 sentences.
    Be conversational and natural.""",
)

workflow = SingleAgentVoiceWorkflow(agent)
config = VoicePipelineConfig(
    silence_duration_ms=700,
    vad_threshold=0.5,
    prefix_padding_ms=300,
)
pipeline = VoicePipeline(workflow=workflow, config=config)


async def microphone_stream(streamed_input: StreamedAudioInput):
    """Continuously capture microphone audio and push to the stream."""
    loop = asyncio.get_event_loop()

    def audio_callback(indata, frames, time_info, status):
        if status:
            print(f"Audio status: {status}")
        # indata is a numpy array — push a copy to avoid buffer reuse issues
        chunk = indata[:, 0].copy().astype(np.int16)
        loop.call_soon_threadsafe(streamed_input.add_audio, chunk)

    with sd.InputStream(
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype=np.int16,
        blocksize=CHUNK_SIZE,
        callback=audio_callback,
    ):
        print("Microphone active — speak naturally")
        # Keep the stream open indefinitely
        while True:
            await asyncio.sleep(0.1)


async def handle_pipeline_output(result):
    """Process pipeline output events and play audio."""
    audio_chunks = []

    async for event in result.stream():
        if event.type == "voice_stream_event_audio":
            audio_chunks.append(event.data)
        elif event.type == "voice_stream_event_turn_started":
            print("[Turn started — processing speech]")
        elif event.type == "voice_stream_event_turn_ended":
            print("[Turn ended]")
            # Play accumulated audio for this turn
            if audio_chunks:
                full_audio = np.concatenate(audio_chunks)
                sd.play(full_audio, samplerate=SAMPLE_RATE)
                sd.wait()
                audio_chunks = []


async def main():
    print("Real-Time Voice Agent")
    print("=" * 40)
    print("Speak naturally. The agent will respond when you pause.")
    print("Press Ctrl+C to quit.")

    streamed_input = StreamedAudioInput()

    # Start the pipeline with streamed input
    result = await pipeline.run(streamed_input)

    # Run microphone capture and output handling concurrently
    await asyncio.gather(
        microphone_stream(streamed_input),
        handle_pipeline_output(result),
    )


if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        print("\nShutting down...")

The critical pattern here is asyncio.gather. The microphone capture and the output handler run as concurrent tasks. Audio flows into the stream from the microphone callback, VAD detects speech boundaries, the pipeline processes each utterance, and the output handler plays the response — all running simultaneously.

Lifecycle Events

The pipeline emits lifecycle events that let you track the conversation state:

async for event in result.stream():
    if event.type == "voice_stream_event_turn_started":
        # VAD detected speech — a new user turn is beginning
        # Use this to show a "listening" indicator in the UI
        print("User is speaking...")

    elif event.type == "voice_stream_event_turn_ended":
        # The turn has been fully processed
        # STT + Agent + TTS are complete for this utterance
        print("Agent responded.")

    elif event.type == "voice_stream_event_audio":
        # A chunk of TTS audio is ready for playback
        play_chunk(event.data)

    elif event.type == "voice_stream_event_transcript":
        # The STT transcript for the user's speech
        print(f"User said: {event.data}")

    elif event.type == "voice_stream_event_error":
        # Something went wrong in the pipeline
        print(f"Pipeline error: {event.data}")

These events are essential for building UI feedback. In a web application, you would use turn_started to show a pulsing microphone icon, transcript to display what the user said, and audio events to stream the response.

Handling Interruptions

One of the advantages of streaming is that users can interrupt the agent mid-response. When the VAD detects new speech while audio is playing, the pipeline can cancel the current TTS output and start processing the new utterance.

import threading

playback_lock = threading.Event()
playback_lock.set()  # Not playing initially

async def handle_pipeline_output(result):
    async for event in result.stream():
        if event.type == "voice_stream_event_turn_started":
            # New turn — stop any current playback
            sd.stop()
            playback_lock.set()

        elif event.type == "voice_stream_event_audio":
            playback_lock.wait()
            playback_lock.clear()

            def play_and_signal(audio):
                sd.play(audio, samplerate=SAMPLE_RATE)
                sd.wait()
                playback_lock.set()

            threading.Thread(
                target=play_and_signal,
                args=(event.data,),
                daemon=True,
            ).start()

When a new turn starts (the user speaks again), sd.stop() immediately halts any audio playback. The pipeline processes the new speech, and the previous response is abandoned. This creates a natural interruption flow — exactly how human conversations work.

Turn Detection Strategies

The default VAD-based turn detection works well for most scenarios, but you can implement custom strategies for specific use cases:

Push-to-talk: Disable VAD entirely and use a button press to start and stop recording. Useful for noisy environments or hands-free devices with a physical button.

# Push-to-talk: manually signal when the user is done
streamed_input = StreamedAudioInput()
result = await pipeline.run(streamed_input)

# Start recording on button press
for chunk in microphone_chunks():
    streamed_input.add_audio(chunk)

# Signal end of speech on button release
streamed_input.close()

Keyword-based: Use a wake word detector before activating the full pipeline. The VAD only processes audio after the wake word is detected.

Hybrid: Use VAD for the initial turn detection but switch to keyword-based detection when the agent asks a yes/no question, reducing false triggers from background noise during short responses.

Production Considerations

When deploying streamed voice agents to production, keep these factors in mind:

Memory management. Each active stream consumes memory for audio buffering. Monitor buffer sizes and implement cleanup for abandoned streams (user disconnects without closing the stream).

Concurrent sessions. Each pipeline.run() creates a processing pipeline. With 100 concurrent users, you have 100 STT and TTS sessions. Plan your API rate limits and budget accordingly.

Network jitter. In WebSocket-based deployments, audio chunks may arrive with irregular timing. Buffer at least 200ms of audio before forwarding to VAD to smooth out jitter.

Echo cancellation. If the user's microphone picks up the agent's audio output, the VAD may detect it as new speech, creating a feedback loop. Implement acoustic echo cancellation (AEC) on the client side, or mute the microphone during agent playback as a simpler alternative.

Sources:

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)

A full technical walkthrough of how modern AI voice agents work — speech-to-text, LLM orchestration, TTS, tool calling, and sub-second latency.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

Technical Guides

Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.