Skip to content
Learn Agentic AI
Learn Agentic AI10 min read6 views

Speech-to-Text and Text-to-Speech for Voice Agent Pipelines

Configure STT and TTS models for OpenAI voice agent pipelines — Whisper integration, language and prompt settings, voice selection, streaming TTS, and custom model implementations.

The Two Sides of Voice

A voice agent pipeline has two audio boundaries: the point where human speech enters the system (STT) and the point where machine-generated speech exits (TTS). How you configure these boundaries determines the voice agent's accuracy, latency, naturalness, and overall user experience.

The OpenAI Agents SDK provides default STT and TTS models that work out of the box, but production voice agents almost always need customization. You may need to support specific languages, reduce transcription errors for domain-specific vocabulary, choose a voice that matches your brand, or optimize for streaming latency.

This post covers the full configuration surface for both STT and TTS in the VoicePipeline.

STT Configuration: Turning Speech into Text

The default STT model in the Agents SDK uses OpenAI's Whisper. You can customize it by creating an OpenAISTTModel instance and passing it to the pipeline:

flowchart TD
    START["Speech-to-Text and Text-to-Speech for Voice Agent…"] --> A
    A["The Two Sides of Voice"]
    A --> B
    B["STT Configuration: Turning Speech into …"]
    B --> C
    C["TTS Configuration: Turning Text into Sp…"]
    C --> D
    D["Combining STT and TTS Configuration"]
    D --> E
    E["Measuring and Optimizing Latency"]
    E --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from agents.voice import VoicePipeline, SingleAgentVoiceWorkflow, OpenAISTTModel
from agents import Agent

agent = Agent(
    name="Assistant",
    instructions="You are a helpful assistant.",
)

stt_model = OpenAISTTModel(
    model="whisper-1",
    language="en",
    prompt="CallSphere, VoicePipeline, WebRTC, agentic AI",
)

workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(
    workflow=workflow,
    stt_model=stt_model,
)

The Model Parameter

whisper-1 is currently the primary model for the OpenAI transcription API. It handles a wide range of languages and accents with strong accuracy. For most applications, the default is sufficient.

Language Hints

Setting language="en" tells Whisper to expect English audio. This is not a hard filter — Whisper will still transcribe other languages if it detects them — but it biases the model toward English, which reduces errors when the audio is ambiguous or noisy.

For multilingual voice agents, you can omit the language parameter and let Whisper auto-detect. Auto-detection works well for clear audio but can misidentify languages in noisy environments or with code-switching speakers.

Prompt-Based Vocabulary Hints

The prompt parameter is one of the most powerful STT tuning tools available. Whisper uses it as a conditioning prefix that biases the transcription toward specific vocabulary, spelling conventions, and formatting patterns.

# Medical domain — guide Whisper toward medical terminology
stt_model = OpenAISTTModel(
    model="whisper-1",
    prompt="metformin, lisinopril, hydrochlorothiazide, A1C, systolic, diastolic",
)

# Customer service — guide toward product names and common queries
stt_model = OpenAISTTModel(
    model="whisper-1",
    prompt="CallSphere, Pro Plan, Enterprise Plan, API credits, webhook, dashboard",
)

Without the prompt, Whisper might transcribe "CallSphere" as "call sphere" or "coal sphere." With the prompt, the model knows the correct spelling and capitalization. This is especially important for proper nouns, brand names, and technical jargon.

Custom STT Models

If you need to use a different STT provider (Deepgram, AssemblyAI, a self-hosted model), you can implement the STTModel protocol:

from agents.voice import STTModel, STTModelSettings
from dataclasses import dataclass

@dataclass
class DeepgramSTTModel:
    api_key: str
    model: str = "nova-2"

    async def transcribe(
        self,
        audio_input,
        settings: STTModelSettings | None = None,
        trace_include: dict | None = None,
        trace_exclude: dict | None = None,
    ) -> str:
        """Transcribe audio using Deepgram's API."""
        import httpx

        audio_bytes = audio_input.buffer.tobytes()

        async with httpx.AsyncClient() as client:
            response = await client.post(
                "https://api.deepgram.com/v1/listen",
                headers={
                    "Authorization": f"Token {self.api_key}",
                    "Content-Type": "audio/raw;encoding=linear16;sample_rate=24000;channels=1",
                },
                params={"model": self.model},
                content=audio_bytes,
            )
            result = response.json()
            return result["results"]["channels"][0]["alternatives"][0]["transcript"]

Then pass it to the pipeline:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

stt = DeepgramSTTModel(api_key="your-deepgram-key")
pipeline = VoicePipeline(workflow=workflow, stt_model=stt)

TTS Configuration: Turning Text into Speech

The TTS side controls how your agent sounds. OpenAI offers multiple voices, and the Agents SDK lets you configure the model, voice, and streaming behavior:

flowchart TD
    ROOT["Speech-to-Text and Text-to-Speech for Voice …"] 
    ROOT --> P0["STT Configuration: Turning Speech into …"]
    P0 --> P0C0["The Model Parameter"]
    P0 --> P0C1["Language Hints"]
    P0 --> P0C2["Prompt-Based Vocabulary Hints"]
    P0 --> P0C3["Custom STT Models"]
    ROOT --> P1["TTS Configuration: Turning Text into Sp…"]
    P1 --> P1C0["Model Selection"]
    P1 --> P1C1["Voice Selection"]
    P1 --> P1C2["Streaming TTS"]
    P1 --> P1C3["Custom TTS Models"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
from agents.voice import OpenAITTSModel

tts_model = OpenAITTSModel(
    model="tts-1",
    voice="nova",
)

pipeline = VoicePipeline(
    workflow=workflow,
    tts_model=tts_model,
)

Model Selection

OpenAI provides two TTS models:

  • tts-1: Optimized for low latency. Slightly lower audio quality but faster generation. Best for real-time voice agents where responsiveness matters.
  • tts-1-hd: Higher audio quality with more natural intonation. Slower generation. Best for pre-recorded content or applications where latency is less critical.

For voice agents, tts-1 is almost always the right choice. The quality difference is subtle, but the latency difference is noticeable in conversation.

Voice Selection

Each model supports multiple voices with distinct characteristics:

Voice Character
alloy Neutral, balanced — good default
echo Warm, conversational
fable Expressive, storytelling quality
nova Friendly, upbeat — popular for assistants
onyx Deep, authoritative
shimmer Clear, professional

Choose a voice that matches your application's personality. A medical triage agent might use onyx for its authoritative tone. A casual customer service bot might use nova for its friendly energy.

# Professional customer service
tts_professional = OpenAITTSModel(model="tts-1", voice="shimmer")

# Friendly assistant
tts_friendly = OpenAITTSModel(model="tts-1", voice="nova")

# Authoritative medical advisor
tts_medical = OpenAITTSModel(model="tts-1", voice="onyx")

Streaming TTS

The VoicePipeline streams TTS by default. As the agent generates text, the pipeline sends completed sentences or phrases to the TTS model and starts receiving audio before the full response is generated. This significantly reduces perceived latency.

The streaming flow looks like this:

Agent generates: "The weather in Tokyo is"  --> [buffered]
Agent generates: "25 degrees and sunny."    --> [sent to TTS]
                                              --> [audio chunk 1 received]
Agent generates: "Perfect for a walk"       --> [sent to TTS]
                                              --> [audio chunk 2 received]
Agent generates: "in the park."             --> [sent to TTS]
                                              --> [audio chunk 3 received]

The pipeline buffers text until it hits a sentence boundary (period, exclamation mark, question mark) and then sends that sentence to TTS. This means the user starts hearing audio after the first complete sentence, not after the entire response is generated.

Custom TTS Models

Like STT, you can plug in alternative TTS providers by implementing the TTSModel protocol:

from agents.voice import TTSModel, TTSModelSettings

@dataclass
class ElevenLabsTTSModel:
    api_key: str
    voice_id: str = "21m00Tcm4TlvDq8ikWAM"  # Rachel

    async def run(
        self,
        text: str,
        settings: TTSModelSettings | None = None,
    ):
        """Stream audio from ElevenLabs API."""
        import httpx

        async with httpx.AsyncClient() as client:
            async with client.stream(
                "POST",
                f"https://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}/stream",
                headers={"xi-api-key": self.api_key},
                json={
                    "text": text,
                    "model_id": "eleven_monolingual_v1",
                    "voice_settings": {"stability": 0.5, "similarity_boost": 0.75},
                },
            ) as response:
                async for chunk in response.aiter_bytes(chunk_size=4096):
                    yield chunk

Combining STT and TTS Configuration

Here is a complete pipeline with both STT and TTS customized for a customer service voice agent:

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Set the language explicitly rather than…"]
    CENTER --> N1["Trim silence from the beginning and end…"]
    CENTER --> N2["Use shorter audio chunks — 5 seconds of…"]
    CENTER --> N3["Use tts-1 instead of tts-1-hd for real-…"]
    CENTER --> N4["Keep agent responses short — TTS genera…"]
    CENTER --> N5["Take advantage of streaming — the pipel…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
from agents import Agent, function_tool
from agents.voice import (
    VoicePipeline,
    SingleAgentVoiceWorkflow,
    OpenAISTTModel,
    OpenAITTSModel,
)

@function_tool
def lookup_order(order_id: str) -> str:
    """Look up order status by ID."""
    return f"Order {order_id} shipped on March 12, expected delivery March 16."

agent = Agent(
    name="CustomerService",
    instructions="""You are a customer service voice agent for an online store.
    Keep responses under 3 sentences. Use a warm, helpful tone.
    Always confirm the order ID before looking it up.""",
    tools=[lookup_order],
)

stt = OpenAISTTModel(
    model="whisper-1",
    language="en",
    prompt="order number, tracking, refund, exchange, shipping",
)

tts = OpenAITTSModel(
    model="tts-1",
    voice="nova",
)

workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(
    workflow=workflow,
    stt_model=stt,
    tts_model=tts,
)

Measuring and Optimizing Latency

STT and TTS are the two largest contributors to pipeline latency outside of the LLM itself. Here are practical optimizations:

For STT:

  • Set the language explicitly rather than relying on auto-detection (saves 50-100ms)
  • Trim silence from the beginning and end of audio before transcription
  • Use shorter audio chunks — 5 seconds of audio transcribes faster than 30 seconds

For TTS:

  • Use tts-1 instead of tts-1-hd for real-time applications
  • Keep agent responses short — TTS generation time scales linearly with text length
  • Take advantage of streaming — the pipeline sends sentences to TTS as they complete

Measuring latency:

import time

async def timed_pipeline_run(audio):
    t0 = time.perf_counter()
    result = await pipeline.run(audio)

    first_audio_time = None
    chunks = []
    async for event in result.stream():
        if event.type == "voice_stream_event_audio":
            if first_audio_time is None:
                first_audio_time = time.perf_counter()
            chunks.append(event.data)

    total_time = time.perf_counter() - t0
    time_to_first_audio = first_audio_time - t0 if first_audio_time else None

    print(f"Time to first audio: {time_to_first_audio:.3f}s")
    print(f"Total pipeline time: {total_time:.3f}s")
    return chunks

Time to first audio is the metric that matters most for perceived responsiveness. Total pipeline time matters for overall throughput.

Sources:

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.