From Text Agent to Voice Agent

If you have built a text-based agent with the OpenAI Agents SDK, you already have 80% of what you need for a voice agent. The VoicePipeline wraps your existing agent in an audio processing layer — speech goes in, speech comes out, and your agent logic stays exactly the same.

In this tutorial, we will build a complete voice agent from scratch. By the end, you will have a Python script that listens to your microphone, processes your speech through an AI agent, and plays the response through your speakers.

Installation

The voice capabilities are packaged as an optional extra in the Agents SDK:

flowchart LR
    CALLER(["Student or Parent"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Education AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Enrollment captured"])
        O2(["Tour scheduled"])
        O3(["Counselor callback"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937

pip install 'openai-agents[voice]'

This installs the core SDK plus the voice module dependencies including numpy, websockets, and the audio processing utilities. You also need a library for microphone access and audio playback:

pip install sounddevice numpy

sounddevice provides cross-platform access to your system's audio devices. It works on macOS, Linux, and Windows without additional drivers.

Make sure your OPENAI_API_KEY environment variable is set:

export OPENAI_API_KEY="sk-..."

Defining the Agent

Start by defining a simple agent. This is identical to any text-based agent — the voice pipeline does not change how you define agent behavior:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

from agents import Agent

agent = Agent(
    name="VoiceAssistant",
    instructions="""You are a friendly voice assistant. Follow these rules:
    - Keep responses to 2-3 sentences maximum
    - Use natural, conversational language
    - Avoid bullet points, markdown, or formatted text
    - Never say "as an AI" or "I'm a language model"
    - If you don't understand something, ask for clarification
    """,
)

Notice the instructions emphasize concise, conversational responses. This is critical for voice agents. A 500-word response that reads well on screen becomes a 3-minute monologue when spoken aloud. Voice agent instructions should always bias toward brevity.

Setting Up the VoicePipeline

The pipeline connects your agent to audio input and output:

from agents.voice import VoicePipeline, SingleAgentVoiceWorkflow

workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)

SingleAgentVoiceWorkflow is the simplest workflow type — one agent handles the entire conversation. The SDK also supports multi-agent workflows with handoffs, but we will start simple.

Capturing Microphone Audio

The pipeline expects audio as a numpy array of 16-bit signed integers at 24kHz mono. Here is how to capture a single utterance from the microphone:

import numpy as np
import sounddevice as sd
from agents.voice import AudioInput

SAMPLE_RATE = 24000
CHANNELS = 1
RECORD_SECONDS = 5

def record_audio(duration: float = RECORD_SECONDS) -> AudioInput:
    """Record audio from the default microphone."""
    print(f"Recording for {duration} seconds...")

    # Record raw audio
    audio_data = sd.rec(
        int(duration * SAMPLE_RATE),
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype=np.int16,
    )
    sd.wait()  # Block until recording is complete

    print("Recording complete.")

    # Flatten to 1D array and wrap in AudioInput
    buffer = audio_data.flatten()
    return AudioInput(buffer=buffer)

The AudioInput class wraps a numpy buffer and tells the pipeline the audio format. The default expectation is 24kHz, mono, int16 — which is exactly what we record.

Running the Pipeline

With audio captured, running the pipeline is a single async call:

import asyncio

async def process_voice(audio: AudioInput):
    """Send audio through the pipeline and collect response audio."""
    result = await pipeline.run(audio)

    response_audio = []
    async for event in result.stream():
        if event.type == "voice_stream_event_audio":
            response_audio.append(event.data)

    if response_audio:
        # Concatenate all audio chunks
        full_audio = np.concatenate(response_audio)
        return full_audio
    return None

The pipeline returns a stream of events. The key event type is voice_stream_event_audio, which carries numpy arrays of synthesized speech. Other event types include lifecycle events (stream start, stream end) and transcript events, but the audio events are what we need for playback.

Playing the Response

Once we have the response audio as a numpy array, playing it through the speakers is straightforward:

def play_audio(audio_data: np.ndarray):
    """Play audio through the default speakers."""
    print("Playing response...")
    sd.play(audio_data, samplerate=SAMPLE_RATE)
    sd.wait()  # Block until playback is complete
    print("Playback complete.")

Putting It All Together

Here is the complete script that records your voice, processes it through the agent, and plays the response:

import asyncio
import numpy as np
import sounddevice as sd
from agents import Agent
from agents.voice import AudioInput, VoicePipeline, SingleAgentVoiceWorkflow

SAMPLE_RATE = 24000
CHANNELS = 1

# Define the agent
agent = Agent(
    name="VoiceAssistant",
    instructions="""You are a friendly voice assistant.
    Keep responses to 2-3 sentences maximum.
    Use natural, conversational language.""",
)

# Create the pipeline
workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)

def record_audio(duration: float = 5.0) -> AudioInput:
    print(f"Listening for {duration} seconds...")
    audio_data = sd.rec(
        int(duration * SAMPLE_RATE),
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype=np.int16,
    )
    sd.wait()
    return AudioInput(buffer=audio_data.flatten())

def play_audio(audio_data: np.ndarray):
    sd.play(audio_data, samplerate=SAMPLE_RATE)
    sd.wait()

async def main():
    print("Voice Assistant Ready")
    print("=" * 40)

    while True:
        input("Press Enter to speak (Ctrl+C to quit)...")

        # Record
        audio = record_audio(duration=5.0)

        # Process through agent
        print("Thinking...")
        result = await pipeline.run(audio)

        # Collect response audio
        chunks = []
        async for event in result.stream():
            if event.type == "voice_stream_event_audio":
                chunks.append(event.data)

        if chunks:
            full_audio = np.concatenate(chunks)
            play_audio(full_audio)
        else:
            print("No audio response received.")

if __name__ == "__main__":
    asyncio.run(main())

Save this as voice_agent.py and run it:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

python voice_agent.py

Press Enter, speak for up to 5 seconds, and the assistant will respond through your speakers.

Adding Tools to the Voice Agent

The agent definition supports tools just like text-based agents. Here is an example with a weather lookup tool:

from agents import Agent, function_tool
import random

@function_tool
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    # Simulated — replace with a real API call
    temp = random.randint(15, 35)
    conditions = random.choice(["sunny", "cloudy", "rainy", "windy"])
    return f"The weather in {city} is {temp} degrees and {conditions}."

agent = Agent(
    name="WeatherVoiceAssistant",
    instructions="""You are a weather assistant. When asked about weather,
    use the get_weather tool. Keep responses brief and conversational.""",
    tools=[get_weather],
)

# The rest of the pipeline code stays exactly the same
workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)

Tools work transparently in the voice pipeline. The STT transcribes "What is the weather in Tokyo?", the agent calls get_weather("Tokyo"), receives the result, formulates a spoken response, and the TTS synthesizes it. The user never knows a tool call happened — they just hear the answer.

Understanding Pipeline Events

The result stream emits several event types beyond audio:

async for event in result.stream():
    if event.type == "voice_stream_event_audio":
        # Audio chunk — numpy array of int16 samples
        chunks.append(event.data)
    elif event.type == "voice_stream_event_lifecycle":
        # Pipeline stage transitions
        print(f"Lifecycle: {event.data}")
    elif event.type == "voice_stream_event_transcript":
        # Agent's text response (before TTS)
        print(f"Transcript: {event.data}")
    elif event.type == "voice_stream_event_error":
        # Something went wrong
        print(f"Error: {event.data}")

The transcript event is especially useful for logging and debugging. It gives you the text that the TTS model will synthesize, so you can verify the agent produced the right response without listening to the audio.

Common Pitfalls

Audio format mismatch. If your microphone records at 48kHz (common default), you need to resample to 24kHz before creating the AudioInput. Use scipy.signal.resample or specify the sample rate in sd.rec.

Long recording durations. Recording a fixed 5-second window is simple but wasteful. If the user speaks for 1 second, the pipeline processes 4 seconds of silence. The next post covers StreamedAudioInput with voice activity detection to solve this.

Blocking the event loop. sd.rec and sd.play are blocking calls. In a production system, run them in a thread executor to keep the async event loop responsive:

import asyncio

loop = asyncio.get_event_loop()
audio = await loop.run_in_executor(None, record_audio, 5.0)

Response length. If your agent generates a paragraph, the TTS produces a long audio clip with noticeable generation delay. Optimize your agent instructions to produce concise responses. For voice, 1-3 sentences is ideal.

Sources:

Building Your First Voice Agent with VoicePipeline

From Text Agent to Voice Agent

Installation

Defining the Agent

Setting Up the VoicePipeline

Capturing Microphone Audio

Running the Pipeline

Playing the Response

Putting It All Together

Adding Tools to the Voice Agent

Understanding Pipeline Events

Common Pitfalls

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Gym + Personal Training Voice Agents: Member Upsells in 2026

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Automotive Voice Agents in May 2026: Sales + Service + BDC

Anthropic's Financial Services Platform: State of Play in May 2026