The Two Sides of Voice

A voice agent pipeline has two audio boundaries: the point where human speech enters the system (STT) and the point where machine-generated speech exits (TTS). How you configure these boundaries determines the voice agent's accuracy, latency, naturalness, and overall user experience.

The OpenAI Agents SDK provides default STT and TTS models that work out of the box, but production voice agents almost always need customization. You may need to support specific languages, reduce transcription errors for domain-specific vocabulary, choose a voice that matches your brand, or optimize for streaming latency.

This post covers the full configuration surface for both STT and TTS in the VoicePipeline.

STT Configuration: Turning Speech into Text

The default STT model in the Agents SDK uses OpenAI's Whisper. You can customize it by creating an OpenAISTTModel instance and passing it to the pipeline:

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937

from agents.voice import VoicePipeline, SingleAgentVoiceWorkflow, OpenAISTTModel
from agents import Agent

agent = Agent(
    name="Assistant",
    instructions="You are a helpful assistant.",
)

stt_model = OpenAISTTModel(
    model="whisper-1",
    language="en",
    prompt="CallSphere, VoicePipeline, WebRTC, agentic AI",
)

workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(
    workflow=workflow,
    stt_model=stt_model,
)

The Model Parameter

whisper-1 is currently the primary model for the OpenAI transcription API. It handles a wide range of languages and accents with strong accuracy. For most applications, the default is sufficient.

Language Hints

Setting language="en" tells Whisper to expect English audio. This is not a hard filter — Whisper will still transcribe other languages if it detects them — but it biases the model toward English, which reduces errors when the audio is ambiguous or noisy.

For multilingual voice agents, you can omit the language parameter and let Whisper auto-detect. Auto-detection works well for clear audio but can misidentify languages in noisy environments or with code-switching speakers.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Prompt-Based Vocabulary Hints

The prompt parameter is one of the most powerful STT tuning tools available. Whisper uses it as a conditioning prefix that biases the transcription toward specific vocabulary, spelling conventions, and formatting patterns.

# Medical domain — guide Whisper toward medical terminology
stt_model = OpenAISTTModel(
    model="whisper-1",
    prompt="metformin, lisinopril, hydrochlorothiazide, A1C, systolic, diastolic",
)

# Customer service — guide toward product names and common queries
stt_model = OpenAISTTModel(
    model="whisper-1",
    prompt="CallSphere, Pro Plan, Enterprise Plan, API credits, webhook, dashboard",
)

Without the prompt, Whisper might transcribe "CallSphere" as "call sphere" or "coal sphere." With the prompt, the model knows the correct spelling and capitalization. This is especially important for proper nouns, brand names, and technical jargon.

Custom STT Models

If you need to use a different STT provider (Deepgram, AssemblyAI, a self-hosted model), you can implement the STTModel protocol:

from agents.voice import STTModel, STTModelSettings
from dataclasses import dataclass

@dataclass
class DeepgramSTTModel:
    api_key: str
    model: str = "nova-2"

    async def transcribe(
        self,
        audio_input,
        settings: STTModelSettings | None = None,
        trace_include: dict | None = None,
        trace_exclude: dict | None = None,
    ) -> str:
        """Transcribe audio using Deepgram's API."""
        import httpx

        audio_bytes = audio_input.buffer.tobytes()

        async with httpx.AsyncClient() as client:
            response = await client.post(
                "https://api.deepgram.com/v1/listen",
                headers={
                    "Authorization": f"Token {self.api_key}",
                    "Content-Type": "audio/raw;encoding=linear16;sample_rate=24000;channels=1",
                },
                params={"model": self.model},
                content=audio_bytes,
            )
            result = response.json()
            return result["results"]["channels"][0]["alternatives"][0]["transcript"]

Then pass it to the pipeline:

stt = DeepgramSTTModel(api_key="your-deepgram-key")
pipeline = VoicePipeline(workflow=workflow, stt_model=stt)

TTS Configuration: Turning Text into Speech

The TTS side controls how your agent sounds. OpenAI offers multiple voices, and the Agents SDK lets you configure the model, voice, and streaming behavior:

from agents.voice import OpenAITTSModel

tts_model = OpenAITTSModel(
    model="tts-1",
    voice="nova",
)

pipeline = VoicePipeline(
    workflow=workflow,
    tts_model=tts_model,
)

Model Selection

OpenAI provides two TTS models:

tts-1: Optimized for low latency. Slightly lower audio quality but faster generation. Best for real-time voice agents where responsiveness matters.
tts-1-hd: Higher audio quality with more natural intonation. Slower generation. Best for pre-recorded content or applications where latency is less critical.

For voice agents, tts-1 is almost always the right choice. The quality difference is subtle, but the latency difference is noticeable in conversation.

Voice Selection

Each model supports multiple voices with distinct characteristics:

Voice	Character
alloy	Neutral, balanced — good default
echo	Warm, conversational
fable	Expressive, storytelling quality
nova	Friendly, upbeat — popular for assistants
onyx	Deep, authoritative
shimmer	Clear, professional

Choose a voice that matches your application's personality. A medical triage agent might use onyx for its authoritative tone. A casual customer service bot might use nova for its friendly energy.

# Professional customer service
tts_professional = OpenAITTSModel(model="tts-1", voice="shimmer")

# Friendly assistant
tts_friendly = OpenAITTSModel(model="tts-1", voice="nova")

# Authoritative medical advisor
tts_medical = OpenAITTSModel(model="tts-1", voice="onyx")

Streaming TTS

The VoicePipeline streams TTS by default. As the agent generates text, the pipeline sends completed sentences or phrases to the TTS model and starts receiving audio before the full response is generated. This significantly reduces perceived latency.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

The streaming flow looks like this:

Agent generates: "The weather in Tokyo is"  --> [buffered]
Agent generates: "25 degrees and sunny."    --> [sent to TTS]
                                              --> [audio chunk 1 received]
Agent generates: "Perfect for a walk"       --> [sent to TTS]
                                              --> [audio chunk 2 received]
Agent generates: "in the park."             --> [sent to TTS]
                                              --> [audio chunk 3 received]

The pipeline buffers text until it hits a sentence boundary (period, exclamation mark, question mark) and then sends that sentence to TTS. This means the user starts hearing audio after the first complete sentence, not after the entire response is generated.

Custom TTS Models

Like STT, you can plug in alternative TTS providers by implementing the TTSModel protocol:

from agents.voice import TTSModel, TTSModelSettings

@dataclass
class ElevenLabsTTSModel:
    api_key: str
    voice_id: str = "21m00Tcm4TlvDq8ikWAM"  # Rachel

    async def run(
        self,
        text: str,
        settings: TTSModelSettings | None = None,
    ):
        """Stream audio from ElevenLabs API."""
        import httpx

        async with httpx.AsyncClient() as client:
            async with client.stream(
                "POST",
                f"https://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}/stream",
                headers={"xi-api-key": self.api_key},
                json={
                    "text": text,
                    "model_id": "eleven_monolingual_v1",
                    "voice_settings": {"stability": 0.5, "similarity_boost": 0.75},
                },
            ) as response:
                async for chunk in response.aiter_bytes(chunk_size=4096):
                    yield chunk

Combining STT and TTS Configuration

Here is a complete pipeline with both STT and TTS customized for a customer service voice agent:

from agents import Agent, function_tool
from agents.voice import (
    VoicePipeline,
    SingleAgentVoiceWorkflow,
    OpenAISTTModel,
    OpenAITTSModel,
)

@function_tool
def lookup_order(order_id: str) -> str:
    """Look up order status by ID."""
    return f"Order {order_id} shipped on March 12, expected delivery March 16."

agent = Agent(
    name="CustomerService",
    instructions="""You are a customer service voice agent for an online store.
    Keep responses under 3 sentences. Use a warm, helpful tone.
    Always confirm the order ID before looking it up.""",
    tools=[lookup_order],
)

stt = OpenAISTTModel(
    model="whisper-1",
    language="en",
    prompt="order number, tracking, refund, exchange, shipping",
)

tts = OpenAITTSModel(
    model="tts-1",
    voice="nova",
)

workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(
    workflow=workflow,
    stt_model=stt,
    tts_model=tts,
)

Measuring and Optimizing Latency

STT and TTS are the two largest contributors to pipeline latency outside of the LLM itself. Here are practical optimizations:

For STT:

Set the language explicitly rather than relying on auto-detection (saves 50-100ms)
Trim silence from the beginning and end of audio before transcription
Use shorter audio chunks — 5 seconds of audio transcribes faster than 30 seconds

For TTS:

Use tts-1 instead of tts-1-hd for real-time applications
Keep agent responses short — TTS generation time scales linearly with text length
Take advantage of streaming — the pipeline sends sentences to TTS as they complete

Measuring latency:

import time

async def timed_pipeline_run(audio):
    t0 = time.perf_counter()
    result = await pipeline.run(audio)

    first_audio_time = None
    chunks = []
    async for event in result.stream():
        if event.type == "voice_stream_event_audio":
            if first_audio_time is None:
                first_audio_time = time.perf_counter()
            chunks.append(event.data)

    total_time = time.perf_counter() - t0
    time_to_first_audio = first_audio_time - t0 if first_audio_time else None

    print(f"Time to first audio: {time_to_first_audio:.3f}s")
    print(f"Total pipeline time: {total_time:.3f}s")
    return chunks

Time to first audio is the metric that matters most for perceived responsiveness. Total pipeline time matters for overall throughput.

Sources:

Speech-to-Text and Text-to-Speech for Voice Agent Pipelines

The Two Sides of Voice

STT Configuration: Turning Speech into Text

The Model Parameter

Language Hints

Prompt-Based Vocabulary Hints

Custom STT Models

TTS Configuration: Turning Text into Speech

Model Selection

Voice Selection

Streaming TTS

Custom TTS Models

Combining STT and TTS Configuration

Measuring and Optimizing Latency

Try CallSphere AI Voice Agents

Related Articles You May Like

Robot Voice TTS in 2026: When the Meme Voice Still Wins

How to Voice Text: Turn Speech to Text and Text to Voice in 2026

Siri Voice Generator: How AI Voice Cloning Actually Works in 2026

Sesame Voice In 2026: What The Model Does And Where It Fits

Robot Text to Speech in 2026: A Founder's Guide to TTS Voices

Speechify App Review: Is It The Best TTS Choice In 2026?