Building Your First Voice Agent with VoicePipeline
Step-by-step tutorial to build a working voice agent using OpenAI's VoicePipeline — from installing dependencies and capturing microphone audio to streaming agent responses through your speakers.
From Text Agent to Voice Agent
If you have built a text-based agent with the OpenAI Agents SDK, you already have 80% of what you need for a voice agent. The VoicePipeline wraps your existing agent in an audio processing layer — speech goes in, speech comes out, and your agent logic stays exactly the same.
In this tutorial, we will build a complete voice agent from scratch. By the end, you will have a Python script that listens to your microphone, processes your speech through an AI agent, and plays the response through your speakers.
Installation
The voice capabilities are packaged as an optional extra in the Agents SDK:
flowchart TD
START["Building Your First Voice Agent with VoicePipeline"] --> A
A["From Text Agent to Voice Agent"]
A --> B
B["Installation"]
B --> C
C["Defining the Agent"]
C --> D
D["Setting Up the VoicePipeline"]
D --> E
E["Capturing Microphone Audio"]
E --> F
F["Running the Pipeline"]
F --> G
G["Playing the Response"]
G --> H
H["Putting It All Together"]
H --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
pip install 'openai-agents[voice]'
This installs the core SDK plus the voice module dependencies including numpy, websockets, and the audio processing utilities. You also need a library for microphone access and audio playback:
pip install sounddevice numpy
sounddevice provides cross-platform access to your system's audio devices. It works on macOS, Linux, and Windows without additional drivers.
Make sure your OPENAI_API_KEY environment variable is set:
export OPENAI_API_KEY="sk-..."
Defining the Agent
Start by defining a simple agent. This is identical to any text-based agent — the voice pipeline does not change how you define agent behavior:
from agents import Agent
agent = Agent(
name="VoiceAssistant",
instructions="""You are a friendly voice assistant. Follow these rules:
- Keep responses to 2-3 sentences maximum
- Use natural, conversational language
- Avoid bullet points, markdown, or formatted text
- Never say "as an AI" or "I'm a language model"
- If you don't understand something, ask for clarification
""",
)
Notice the instructions emphasize concise, conversational responses. This is critical for voice agents. A 500-word response that reads well on screen becomes a 3-minute monologue when spoken aloud. Voice agent instructions should always bias toward brevity.
Setting Up the VoicePipeline
The pipeline connects your agent to audio input and output:
from agents.voice import VoicePipeline, SingleAgentVoiceWorkflow
workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)
SingleAgentVoiceWorkflow is the simplest workflow type — one agent handles the entire conversation. The SDK also supports multi-agent workflows with handoffs, but we will start simple.
Capturing Microphone Audio
The pipeline expects audio as a numpy array of 16-bit signed integers at 24kHz mono. Here is how to capture a single utterance from the microphone:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import numpy as np
import sounddevice as sd
from agents.voice import AudioInput
SAMPLE_RATE = 24000
CHANNELS = 1
RECORD_SECONDS = 5
def record_audio(duration: float = RECORD_SECONDS) -> AudioInput:
"""Record audio from the default microphone."""
print(f"Recording for {duration} seconds...")
# Record raw audio
audio_data = sd.rec(
int(duration * SAMPLE_RATE),
samplerate=SAMPLE_RATE,
channels=CHANNELS,
dtype=np.int16,
)
sd.wait() # Block until recording is complete
print("Recording complete.")
# Flatten to 1D array and wrap in AudioInput
buffer = audio_data.flatten()
return AudioInput(buffer=buffer)
The AudioInput class wraps a numpy buffer and tells the pipeline the audio format. The default expectation is 24kHz, mono, int16 — which is exactly what we record.
Running the Pipeline
With audio captured, running the pipeline is a single async call:
import asyncio
async def process_voice(audio: AudioInput):
"""Send audio through the pipeline and collect response audio."""
result = await pipeline.run(audio)
response_audio = []
async for event in result.stream():
if event.type == "voice_stream_event_audio":
response_audio.append(event.data)
if response_audio:
# Concatenate all audio chunks
full_audio = np.concatenate(response_audio)
return full_audio
return None
The pipeline returns a stream of events. The key event type is voice_stream_event_audio, which carries numpy arrays of synthesized speech. Other event types include lifecycle events (stream start, stream end) and transcript events, but the audio events are what we need for playback.
Playing the Response
Once we have the response audio as a numpy array, playing it through the speakers is straightforward:
def play_audio(audio_data: np.ndarray):
"""Play audio through the default speakers."""
print("Playing response...")
sd.play(audio_data, samplerate=SAMPLE_RATE)
sd.wait() # Block until playback is complete
print("Playback complete.")
Putting It All Together
Here is the complete script that records your voice, processes it through the agent, and plays the response:
import asyncio
import numpy as np
import sounddevice as sd
from agents import Agent
from agents.voice import AudioInput, VoicePipeline, SingleAgentVoiceWorkflow
SAMPLE_RATE = 24000
CHANNELS = 1
# Define the agent
agent = Agent(
name="VoiceAssistant",
instructions="""You are a friendly voice assistant.
Keep responses to 2-3 sentences maximum.
Use natural, conversational language.""",
)
# Create the pipeline
workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)
def record_audio(duration: float = 5.0) -> AudioInput:
print(f"Listening for {duration} seconds...")
audio_data = sd.rec(
int(duration * SAMPLE_RATE),
samplerate=SAMPLE_RATE,
channels=CHANNELS,
dtype=np.int16,
)
sd.wait()
return AudioInput(buffer=audio_data.flatten())
def play_audio(audio_data: np.ndarray):
sd.play(audio_data, samplerate=SAMPLE_RATE)
sd.wait()
async def main():
print("Voice Assistant Ready")
print("=" * 40)
while True:
input("Press Enter to speak (Ctrl+C to quit)...")
# Record
audio = record_audio(duration=5.0)
# Process through agent
print("Thinking...")
result = await pipeline.run(audio)
# Collect response audio
chunks = []
async for event in result.stream():
if event.type == "voice_stream_event_audio":
chunks.append(event.data)
if chunks:
full_audio = np.concatenate(chunks)
play_audio(full_audio)
else:
print("No audio response received.")
if __name__ == "__main__":
asyncio.run(main())
Save this as voice_agent.py and run it:
python voice_agent.py
Press Enter, speak for up to 5 seconds, and the assistant will respond through your speakers.
Adding Tools to the Voice Agent
The agent definition supports tools just like text-based agents. Here is an example with a weather lookup tool:
from agents import Agent, function_tool
import random
@function_tool
def get_weather(city: str) -> str:
"""Get the current weather for a city."""
# Simulated — replace with a real API call
temp = random.randint(15, 35)
conditions = random.choice(["sunny", "cloudy", "rainy", "windy"])
return f"The weather in {city} is {temp} degrees and {conditions}."
agent = Agent(
name="WeatherVoiceAssistant",
instructions="""You are a weather assistant. When asked about weather,
use the get_weather tool. Keep responses brief and conversational.""",
tools=[get_weather],
)
# The rest of the pipeline code stays exactly the same
workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)
Tools work transparently in the voice pipeline. The STT transcribes "What is the weather in Tokyo?", the agent calls get_weather("Tokyo"), receives the result, formulates a spoken response, and the TTS synthesizes it. The user never knows a tool call happened — they just hear the answer.
Understanding Pipeline Events
The result stream emits several event types beyond audio:
async for event in result.stream():
if event.type == "voice_stream_event_audio":
# Audio chunk — numpy array of int16 samples
chunks.append(event.data)
elif event.type == "voice_stream_event_lifecycle":
# Pipeline stage transitions
print(f"Lifecycle: {event.data}")
elif event.type == "voice_stream_event_transcript":
# Agent's text response (before TTS)
print(f"Transcript: {event.data}")
elif event.type == "voice_stream_event_error":
# Something went wrong
print(f"Error: {event.data}")
The transcript event is especially useful for logging and debugging. It gives you the text that the TTS model will synthesize, so you can verify the agent produced the right response without listening to the audio.
Common Pitfalls
Audio format mismatch. If your microphone records at 48kHz (common default), you need to resample to 24kHz before creating the AudioInput. Use scipy.signal.resample or specify the sample rate in sd.rec.
Long recording durations. Recording a fixed 5-second window is simple but wasteful. If the user speaks for 1 second, the pipeline processes 4 seconds of silence. The next post covers StreamedAudioInput with voice activity detection to solve this.
Blocking the event loop. sd.rec and sd.play are blocking calls. In a production system, run them in a thread executor to keep the async event loop responsive:
import asyncio
loop = asyncio.get_event_loop()
audio = await loop.run_in_executor(None, record_audio, 5.0)
Response length. If your agent generates a paragraph, the TTS produces a long audio clip with noticeable generation delay. Optimize your agent instructions to produce concise responses. For voice, 1-3 sentences is ideal.
Sources:
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.