The Problem with Fixed-Duration Recording

In the previous tutorial, we recorded audio for a fixed duration (5 seconds), sent it through the pipeline, and played the response. This works for demos, but it fails in real conversations for several reasons:

The user might finish speaking in 1 second but waits 4 seconds for the recording to end
The user might need more than 5 seconds for a complex question and gets cut off
There is no way to interrupt the agent while it is speaking
The interaction feels robotic — speak, wait, listen, repeat

Real voice conversations are fluid. People start and stop speaking naturally. They pause mid-thought. They interrupt when they already know the answer. A production voice agent needs to handle all of this.

StreamedAudioInput solves these problems by accepting a continuous stream of audio rather than a fixed buffer. Combined with voice activity detection (VAD), it automatically detects when the user starts and stops speaking, enabling natural turn-taking.

StreamedAudioInput vs AudioInput

The key difference between the two input types:

sequenceDiagram
    autonumber
    participant Client
    participant Edge as Edge Worker
    participant LLM as LLM Provider
    participant DB as Logs and Trace
    Client->>Edge: POST /chat (stream=true)
    Edge->>LLM: messages.create(stream=true)
    loop Each token
        LLM-->>Edge: SSE chunk delta
        Edge-->>Client: SSE chunk delta
        Edge->>DB: append token to span
    end
    LLM-->>Edge: stop_reason=end_turn
    Edge-->>Client: event: done
    Edge->>DB: finalize trace

AudioInput — a complete audio buffer. You record everything, then send it all at once. The pipeline processes it as a single utterance.

StreamedAudioInput — a live audio stream. You push audio chunks as they arrive from the microphone. The pipeline uses VAD to detect speech boundaries and processes each utterance as it completes.

from agents.voice import AudioInput, StreamedAudioInput

# AudioInput — fixed buffer, all at once
audio = AudioInput(buffer=numpy_array)
result = await pipeline.run(audio)

# StreamedAudioInput — continuous stream
streamed = StreamedAudioInput()
result = await pipeline.run(streamed)

# Push chunks as they arrive from the microphone
streamed.add_audio(chunk_1)
streamed.add_audio(chunk_2)
streamed.add_audio(chunk_3)

With StreamedAudioInput, you call pipeline.run() first and then push audio into the stream. The pipeline runs concurrently, processing audio as it arrives.

Voice Activity Detection (VAD)

VAD is the technology that determines when the user is speaking versus when there is silence or background noise. The Agents SDK includes a built-in VAD implementation that runs locally (no API call required).

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

The VAD works by analyzing audio energy levels and spectral characteristics in real time:

Audio stream: [silence][silence][SPEECH][SPEECH][SPEECH][silence][silence]
VAD output:   [  off  ][  off  ][ on  ][  on  ][  on  ][ off   ][ off  ]
                                  ^                       ^
                            speech start              speech end
                         (start listening)          (trigger STT)

When VAD detects the transition from silence to speech, the pipeline starts buffering audio for transcription. When it detects the transition from speech to silence (with a configurable delay), it sends the buffered audio to STT and triggers the agent.

Configuring VAD

You can tune VAD sensitivity through the pipeline configuration:

from agents.voice import VoicePipeline, VoicePipelineConfig

config = VoicePipelineConfig(
    # Minimum speech duration to trigger processing (milliseconds)
    min_speech_duration_ms=250,
    # How long silence must last before ending a turn (milliseconds)
    silence_duration_ms=700,
    # VAD sensitivity threshold (0.0 to 1.0)
    # Lower = more sensitive (catches quieter speech, more false positives)
    # Higher = less sensitive (misses quiet speech, fewer false positives)
    vad_threshold=0.5,
    # Audio padding around detected speech (milliseconds)
    prefix_padding_ms=300,
)

pipeline = VoicePipeline(
    workflow=workflow,
    config=config,
)

silence_duration_ms is the most important parameter. Set it too low (200ms) and the agent interrupts natural pauses. Set it too high (2000ms) and the user waits awkwardly after finishing their sentence. 500-800ms works well for most English conversations.

vad_threshold controls sensitivity. In a quiet office, 0.5 works well. In a noisy call center, you might increase it to 0.7 to avoid triggering on background chatter.

prefix_padding_ms adds a buffer before detected speech starts. This prevents clipping the first syllable, which VAD sometimes misses because the energy ramp-up at the start of speech can be gradual.

Building a Real-Time Voice Agent

Here is a complete implementation that streams microphone audio continuously and handles turn-taking automatically:

import asyncio
import numpy as np
import sounddevice as sd
from agents import Agent
from agents.voice import (
    VoicePipeline,
    VoicePipelineConfig,
    SingleAgentVoiceWorkflow,
    StreamedAudioInput,
)

SAMPLE_RATE = 24000
CHANNELS = 1
CHUNK_SIZE = 2400  # 100ms at 24kHz

agent = Agent(
    name="RealtimeAssistant",
    instructions="""You are a real-time voice assistant.
    Keep all responses under 2 sentences.
    Be conversational and natural.""",
)

workflow = SingleAgentVoiceWorkflow(agent)
config = VoicePipelineConfig(
    silence_duration_ms=700,
    vad_threshold=0.5,
    prefix_padding_ms=300,
)
pipeline = VoicePipeline(workflow=workflow, config=config)

async def microphone_stream(streamed_input: StreamedAudioInput):
    """Continuously capture microphone audio and push to the stream."""
    loop = asyncio.get_event_loop()

    def audio_callback(indata, frames, time_info, status):
        if status:
            print(f"Audio status: {status}")
        # indata is a numpy array — push a copy to avoid buffer reuse issues
        chunk = indata[:, 0].copy().astype(np.int16)
        loop.call_soon_threadsafe(streamed_input.add_audio, chunk)

    with sd.InputStream(
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype=np.int16,
        blocksize=CHUNK_SIZE,
        callback=audio_callback,
    ):
        print("Microphone active — speak naturally")
        # Keep the stream open indefinitely
        while True:
            await asyncio.sleep(0.1)

async def handle_pipeline_output(result):
    """Process pipeline output events and play audio."""
    audio_chunks = []

    async for event in result.stream():
        if event.type == "voice_stream_event_audio":
            audio_chunks.append(event.data)
        elif event.type == "voice_stream_event_turn_started":
            print("[Turn started — processing speech]")
        elif event.type == "voice_stream_event_turn_ended":
            print("[Turn ended]")
            # Play accumulated audio for this turn
            if audio_chunks:
                full_audio = np.concatenate(audio_chunks)
                sd.play(full_audio, samplerate=SAMPLE_RATE)
                sd.wait()
                audio_chunks = []

async def main():
    print("Real-Time Voice Agent")
    print("=" * 40)
    print("Speak naturally. The agent will respond when you pause.")
    print("Press Ctrl+C to quit.")

    streamed_input = StreamedAudioInput()

    # Start the pipeline with streamed input
    result = await pipeline.run(streamed_input)

    # Run microphone capture and output handling concurrently
    await asyncio.gather(
        microphone_stream(streamed_input),
        handle_pipeline_output(result),
    )

if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        print("\nShutting down...")

The critical pattern here is asyncio.gather. The microphone capture and the output handler run as concurrent tasks. Audio flows into the stream from the microphone callback, VAD detects speech boundaries, the pipeline processes each utterance, and the output handler plays the response — all running simultaneously.

Lifecycle Events

The pipeline emits lifecycle events that let you track the conversation state:

async for event in result.stream():
    if event.type == "voice_stream_event_turn_started":
        # VAD detected speech — a new user turn is beginning
        # Use this to show a "listening" indicator in the UI
        print("User is speaking...")

    elif event.type == "voice_stream_event_turn_ended":
        # The turn has been fully processed
        # STT + Agent + TTS are complete for this utterance
        print("Agent responded.")

    elif event.type == "voice_stream_event_audio":
        # A chunk of TTS audio is ready for playback
        play_chunk(event.data)

    elif event.type == "voice_stream_event_transcript":
        # The STT transcript for the user's speech
        print(f"User said: {event.data}")

    elif event.type == "voice_stream_event_error":
        # Something went wrong in the pipeline
        print(f"Pipeline error: {event.data}")

These events are essential for building UI feedback. In a web application, you would use turn_started to show a pulsing microphone icon, transcript to display what the user said, and audio events to stream the response.

Handling Interruptions

One of the advantages of streaming is that users can interrupt the agent mid-response. When the VAD detects new speech while audio is playing, the pipeline can cancel the current TTS output and start processing the new utterance.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

import threading

playback_lock = threading.Event()
playback_lock.set()  # Not playing initially

async def handle_pipeline_output(result):
    async for event in result.stream():
        if event.type == "voice_stream_event_turn_started":
            # New turn — stop any current playback
            sd.stop()
            playback_lock.set()

        elif event.type == "voice_stream_event_audio":
            playback_lock.wait()
            playback_lock.clear()

            def play_and_signal(audio):
                sd.play(audio, samplerate=SAMPLE_RATE)
                sd.wait()
                playback_lock.set()

            threading.Thread(
                target=play_and_signal,
                args=(event.data,),
                daemon=True,
            ).start()

When a new turn starts (the user speaks again), sd.stop() immediately halts any audio playback. The pipeline processes the new speech, and the previous response is abandoned. This creates a natural interruption flow — exactly how human conversations work.

Turn Detection Strategies

The default VAD-based turn detection works well for most scenarios, but you can implement custom strategies for specific use cases:

Push-to-talk: Disable VAD entirely and use a button press to start and stop recording. Useful for noisy environments or hands-free devices with a physical button.

# Push-to-talk: manually signal when the user is done
streamed_input = StreamedAudioInput()
result = await pipeline.run(streamed_input)

# Start recording on button press
for chunk in microphone_chunks():
    streamed_input.add_audio(chunk)

# Signal end of speech on button release
streamed_input.close()

Keyword-based: Use a wake word detector before activating the full pipeline. The VAD only processes audio after the wake word is detected.

Hybrid: Use VAD for the initial turn detection but switch to keyword-based detection when the agent asks a yes/no question, reducing false triggers from background noise during short responses.

Production Considerations

When deploying streamed voice agents to production, keep these factors in mind:

Memory management. Each active stream consumes memory for audio buffering. Monitor buffer sizes and implement cleanup for abandoned streams (user disconnects without closing the stream).

Concurrent sessions. Each pipeline.run() creates a processing pipeline. With 100 concurrent users, you have 100 STT and TTS sessions. Plan your API rate limits and budget accordingly.

Network jitter. In WebSocket-based deployments, audio chunks may arrive with irregular timing. Buffer at least 200ms of audio before forwarding to VAD to smooth out jitter.

Echo cancellation. If the user's microphone picks up the agent's audio output, the VAD may detect it as new speech, creating a feedback loop. Implement acoustic echo cancellation (AEC) on the client side, or mute the microphone during agent playback as a simpler alternative.

Sources:

StreamedAudioInput: Real-Time Voice Interaction with Activity Detection

The Problem with Fixed-Duration Recording

StreamedAudioInput vs AudioInput

Voice Activity Detection (VAD)

Configuring VAD

Building a Real-Time Voice Agent

Lifecycle Events

Handling Interruptions

Turn Detection Strategies

Production Considerations

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Model-Native Harness: Why OpenAI and Anthropic Are Killing ReAct Loops

GPT-Realtime-Whisper vs Deepgram: Streaming STT in 2026