Skip to content
Learn Agentic AI
Learn Agentic AI10 min read11 views

Building Your First Voice Agent with VoicePipeline

Step-by-step tutorial to build a working voice agent using OpenAI's VoicePipeline — from installing dependencies and capturing microphone audio to streaming agent responses through your speakers.

From Text Agent to Voice Agent

If you have built a text-based agent with the OpenAI Agents SDK, you already have 80% of what you need for a voice agent. The VoicePipeline wraps your existing agent in an audio processing layer — speech goes in, speech comes out, and your agent logic stays exactly the same.

In this tutorial, we will build a complete voice agent from scratch. By the end, you will have a Python script that listens to your microphone, processes your speech through an AI agent, and plays the response through your speakers.

Installation

The voice capabilities are packaged as an optional extra in the Agents SDK:

flowchart TD
    START["Building Your First Voice Agent with VoicePipeline"] --> A
    A["From Text Agent to Voice Agent"]
    A --> B
    B["Installation"]
    B --> C
    C["Defining the Agent"]
    C --> D
    D["Setting Up the VoicePipeline"]
    D --> E
    E["Capturing Microphone Audio"]
    E --> F
    F["Running the Pipeline"]
    F --> G
    G["Playing the Response"]
    G --> H
    H["Putting It All Together"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
pip install 'openai-agents[voice]'

This installs the core SDK plus the voice module dependencies including numpy, websockets, and the audio processing utilities. You also need a library for microphone access and audio playback:

pip install sounddevice numpy

sounddevice provides cross-platform access to your system's audio devices. It works on macOS, Linux, and Windows without additional drivers.

Make sure your OPENAI_API_KEY environment variable is set:

export OPENAI_API_KEY="sk-..."

Defining the Agent

Start by defining a simple agent. This is identical to any text-based agent — the voice pipeline does not change how you define agent behavior:

from agents import Agent

agent = Agent(
    name="VoiceAssistant",
    instructions="""You are a friendly voice assistant. Follow these rules:
    - Keep responses to 2-3 sentences maximum
    - Use natural, conversational language
    - Avoid bullet points, markdown, or formatted text
    - Never say "as an AI" or "I'm a language model"
    - If you don't understand something, ask for clarification
    """,
)

Notice the instructions emphasize concise, conversational responses. This is critical for voice agents. A 500-word response that reads well on screen becomes a 3-minute monologue when spoken aloud. Voice agent instructions should always bias toward brevity.

Setting Up the VoicePipeline

The pipeline connects your agent to audio input and output:

from agents.voice import VoicePipeline, SingleAgentVoiceWorkflow

workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)

SingleAgentVoiceWorkflow is the simplest workflow type — one agent handles the entire conversation. The SDK also supports multi-agent workflows with handoffs, but we will start simple.

Capturing Microphone Audio

The pipeline expects audio as a numpy array of 16-bit signed integers at 24kHz mono. Here is how to capture a single utterance from the microphone:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import numpy as np
import sounddevice as sd
from agents.voice import AudioInput

SAMPLE_RATE = 24000
CHANNELS = 1
RECORD_SECONDS = 5

def record_audio(duration: float = RECORD_SECONDS) -> AudioInput:
    """Record audio from the default microphone."""
    print(f"Recording for {duration} seconds...")

    # Record raw audio
    audio_data = sd.rec(
        int(duration * SAMPLE_RATE),
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype=np.int16,
    )
    sd.wait()  # Block until recording is complete

    print("Recording complete.")

    # Flatten to 1D array and wrap in AudioInput
    buffer = audio_data.flatten()
    return AudioInput(buffer=buffer)

The AudioInput class wraps a numpy buffer and tells the pipeline the audio format. The default expectation is 24kHz, mono, int16 — which is exactly what we record.

Running the Pipeline

With audio captured, running the pipeline is a single async call:

import asyncio

async def process_voice(audio: AudioInput):
    """Send audio through the pipeline and collect response audio."""
    result = await pipeline.run(audio)

    response_audio = []
    async for event in result.stream():
        if event.type == "voice_stream_event_audio":
            response_audio.append(event.data)

    if response_audio:
        # Concatenate all audio chunks
        full_audio = np.concatenate(response_audio)
        return full_audio
    return None

The pipeline returns a stream of events. The key event type is voice_stream_event_audio, which carries numpy arrays of synthesized speech. Other event types include lifecycle events (stream start, stream end) and transcript events, but the audio events are what we need for playback.

Playing the Response

Once we have the response audio as a numpy array, playing it through the speakers is straightforward:

def play_audio(audio_data: np.ndarray):
    """Play audio through the default speakers."""
    print("Playing response...")
    sd.play(audio_data, samplerate=SAMPLE_RATE)
    sd.wait()  # Block until playback is complete
    print("Playback complete.")

Putting It All Together

Here is the complete script that records your voice, processes it through the agent, and plays the response:

import asyncio
import numpy as np
import sounddevice as sd
from agents import Agent
from agents.voice import AudioInput, VoicePipeline, SingleAgentVoiceWorkflow

SAMPLE_RATE = 24000
CHANNELS = 1

# Define the agent
agent = Agent(
    name="VoiceAssistant",
    instructions="""You are a friendly voice assistant.
    Keep responses to 2-3 sentences maximum.
    Use natural, conversational language.""",
)

# Create the pipeline
workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)

def record_audio(duration: float = 5.0) -> AudioInput:
    print(f"Listening for {duration} seconds...")
    audio_data = sd.rec(
        int(duration * SAMPLE_RATE),
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype=np.int16,
    )
    sd.wait()
    return AudioInput(buffer=audio_data.flatten())

def play_audio(audio_data: np.ndarray):
    sd.play(audio_data, samplerate=SAMPLE_RATE)
    sd.wait()

async def main():
    print("Voice Assistant Ready")
    print("=" * 40)

    while True:
        input("Press Enter to speak (Ctrl+C to quit)...")

        # Record
        audio = record_audio(duration=5.0)

        # Process through agent
        print("Thinking...")
        result = await pipeline.run(audio)

        # Collect response audio
        chunks = []
        async for event in result.stream():
            if event.type == "voice_stream_event_audio":
                chunks.append(event.data)

        if chunks:
            full_audio = np.concatenate(chunks)
            play_audio(full_audio)
        else:
            print("No audio response received.")

if __name__ == "__main__":
    asyncio.run(main())

Save this as voice_agent.py and run it:

python voice_agent.py

Press Enter, speak for up to 5 seconds, and the assistant will respond through your speakers.

Adding Tools to the Voice Agent

The agent definition supports tools just like text-based agents. Here is an example with a weather lookup tool:

from agents import Agent, function_tool
import random

@function_tool
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    # Simulated — replace with a real API call
    temp = random.randint(15, 35)
    conditions = random.choice(["sunny", "cloudy", "rainy", "windy"])
    return f"The weather in {city} is {temp} degrees and {conditions}."

agent = Agent(
    name="WeatherVoiceAssistant",
    instructions="""You are a weather assistant. When asked about weather,
    use the get_weather tool. Keep responses brief and conversational.""",
    tools=[get_weather],
)

# The rest of the pipeline code stays exactly the same
workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)

Tools work transparently in the voice pipeline. The STT transcribes "What is the weather in Tokyo?", the agent calls get_weather("Tokyo"), receives the result, formulates a spoken response, and the TTS synthesizes it. The user never knows a tool call happened — they just hear the answer.

Understanding Pipeline Events

The result stream emits several event types beyond audio:

async for event in result.stream():
    if event.type == "voice_stream_event_audio":
        # Audio chunk — numpy array of int16 samples
        chunks.append(event.data)
    elif event.type == "voice_stream_event_lifecycle":
        # Pipeline stage transitions
        print(f"Lifecycle: {event.data}")
    elif event.type == "voice_stream_event_transcript":
        # Agent's text response (before TTS)
        print(f"Transcript: {event.data}")
    elif event.type == "voice_stream_event_error":
        # Something went wrong
        print(f"Error: {event.data}")

The transcript event is especially useful for logging and debugging. It gives you the text that the TTS model will synthesize, so you can verify the agent produced the right response without listening to the audio.

Common Pitfalls

Audio format mismatch. If your microphone records at 48kHz (common default), you need to resample to 24kHz before creating the AudioInput. Use scipy.signal.resample or specify the sample rate in sd.rec.

Long recording durations. Recording a fixed 5-second window is simple but wasteful. If the user speaks for 1 second, the pipeline processes 4 seconds of silence. The next post covers StreamedAudioInput with voice activity detection to solve this.

Blocking the event loop. sd.rec and sd.play are blocking calls. In a production system, run them in a thread executor to keep the async event loop responsive:

import asyncio

loop = asyncio.get_event_loop()
audio = await loop.run_in_executor(None, record_audio, 5.0)

Response length. If your agent generates a paragraph, the TTS produces a long audio clip with noticeable generation delay. Optimize your agent instructions to produce concise responses. For voice, 1-3 sentences is ideal.

Sources:

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Healthcare

AI Voice Agents for Therapy Practices: The Complete 2026 Guide to Automating Insurance Verification, Scheduling, and Patient Intake

AI voice agents help therapy and counseling practices automate insurance verification, appointment scheduling, and patient intake. Learn how behavioral health practices save 20+ admin hours per week with HIPAA-compliant AI.

Technical Guides

How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)

A full technical walkthrough of how modern AI voice agents work — speech-to-text, LLM orchestration, TTS, tool calling, and sub-second latency.

Technical Guides

Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.