Speech-to-Text and Text-to-Speech for Voice Agent Pipelines
Configure STT and TTS models for OpenAI voice agent pipelines — Whisper integration, language and prompt settings, voice selection, streaming TTS, and custom model implementations.
The Two Sides of Voice
A voice agent pipeline has two audio boundaries: the point where human speech enters the system (STT) and the point where machine-generated speech exits (TTS). How you configure these boundaries determines the voice agent's accuracy, latency, naturalness, and overall user experience.
The OpenAI Agents SDK provides default STT and TTS models that work out of the box, but production voice agents almost always need customization. You may need to support specific languages, reduce transcription errors for domain-specific vocabulary, choose a voice that matches your brand, or optimize for streaming latency.
This post covers the full configuration surface for both STT and TTS in the VoicePipeline.
STT Configuration: Turning Speech into Text
The default STT model in the Agents SDK uses OpenAI's Whisper. You can customize it by creating an OpenAISTTModel instance and passing it to the pipeline:
flowchart TD
START["Speech-to-Text and Text-to-Speech for Voice Agent…"] --> A
A["The Two Sides of Voice"]
A --> B
B["STT Configuration: Turning Speech into …"]
B --> C
C["TTS Configuration: Turning Text into Sp…"]
C --> D
D["Combining STT and TTS Configuration"]
D --> E
E["Measuring and Optimizing Latency"]
E --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
from agents.voice import VoicePipeline, SingleAgentVoiceWorkflow, OpenAISTTModel
from agents import Agent
agent = Agent(
name="Assistant",
instructions="You are a helpful assistant.",
)
stt_model = OpenAISTTModel(
model="whisper-1",
language="en",
prompt="CallSphere, VoicePipeline, WebRTC, agentic AI",
)
workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(
workflow=workflow,
stt_model=stt_model,
)
The Model Parameter
whisper-1 is currently the primary model for the OpenAI transcription API. It handles a wide range of languages and accents with strong accuracy. For most applications, the default is sufficient.
Language Hints
Setting language="en" tells Whisper to expect English audio. This is not a hard filter — Whisper will still transcribe other languages if it detects them — but it biases the model toward English, which reduces errors when the audio is ambiguous or noisy.
For multilingual voice agents, you can omit the language parameter and let Whisper auto-detect. Auto-detection works well for clear audio but can misidentify languages in noisy environments or with code-switching speakers.
Prompt-Based Vocabulary Hints
The prompt parameter is one of the most powerful STT tuning tools available. Whisper uses it as a conditioning prefix that biases the transcription toward specific vocabulary, spelling conventions, and formatting patterns.
# Medical domain — guide Whisper toward medical terminology
stt_model = OpenAISTTModel(
model="whisper-1",
prompt="metformin, lisinopril, hydrochlorothiazide, A1C, systolic, diastolic",
)
# Customer service — guide toward product names and common queries
stt_model = OpenAISTTModel(
model="whisper-1",
prompt="CallSphere, Pro Plan, Enterprise Plan, API credits, webhook, dashboard",
)
Without the prompt, Whisper might transcribe "CallSphere" as "call sphere" or "coal sphere." With the prompt, the model knows the correct spelling and capitalization. This is especially important for proper nouns, brand names, and technical jargon.
Custom STT Models
If you need to use a different STT provider (Deepgram, AssemblyAI, a self-hosted model), you can implement the STTModel protocol:
from agents.voice import STTModel, STTModelSettings
from dataclasses import dataclass
@dataclass
class DeepgramSTTModel:
api_key: str
model: str = "nova-2"
async def transcribe(
self,
audio_input,
settings: STTModelSettings | None = None,
trace_include: dict | None = None,
trace_exclude: dict | None = None,
) -> str:
"""Transcribe audio using Deepgram's API."""
import httpx
audio_bytes = audio_input.buffer.tobytes()
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.deepgram.com/v1/listen",
headers={
"Authorization": f"Token {self.api_key}",
"Content-Type": "audio/raw;encoding=linear16;sample_rate=24000;channels=1",
},
params={"model": self.model},
content=audio_bytes,
)
result = response.json()
return result["results"]["channels"][0]["alternatives"][0]["transcript"]
Then pass it to the pipeline:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
stt = DeepgramSTTModel(api_key="your-deepgram-key")
pipeline = VoicePipeline(workflow=workflow, stt_model=stt)
TTS Configuration: Turning Text into Speech
The TTS side controls how your agent sounds. OpenAI offers multiple voices, and the Agents SDK lets you configure the model, voice, and streaming behavior:
flowchart TD
ROOT["Speech-to-Text and Text-to-Speech for Voice …"]
ROOT --> P0["STT Configuration: Turning Speech into …"]
P0 --> P0C0["The Model Parameter"]
P0 --> P0C1["Language Hints"]
P0 --> P0C2["Prompt-Based Vocabulary Hints"]
P0 --> P0C3["Custom STT Models"]
ROOT --> P1["TTS Configuration: Turning Text into Sp…"]
P1 --> P1C0["Model Selection"]
P1 --> P1C1["Voice Selection"]
P1 --> P1C2["Streaming TTS"]
P1 --> P1C3["Custom TTS Models"]
style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
from agents.voice import OpenAITTSModel
tts_model = OpenAITTSModel(
model="tts-1",
voice="nova",
)
pipeline = VoicePipeline(
workflow=workflow,
tts_model=tts_model,
)
Model Selection
OpenAI provides two TTS models:
- tts-1: Optimized for low latency. Slightly lower audio quality but faster generation. Best for real-time voice agents where responsiveness matters.
- tts-1-hd: Higher audio quality with more natural intonation. Slower generation. Best for pre-recorded content or applications where latency is less critical.
For voice agents, tts-1 is almost always the right choice. The quality difference is subtle, but the latency difference is noticeable in conversation.
Voice Selection
Each model supports multiple voices with distinct characteristics:
| Voice | Character |
|---|---|
| alloy | Neutral, balanced — good default |
| echo | Warm, conversational |
| fable | Expressive, storytelling quality |
| nova | Friendly, upbeat — popular for assistants |
| onyx | Deep, authoritative |
| shimmer | Clear, professional |
Choose a voice that matches your application's personality. A medical triage agent might use onyx for its authoritative tone. A casual customer service bot might use nova for its friendly energy.
# Professional customer service
tts_professional = OpenAITTSModel(model="tts-1", voice="shimmer")
# Friendly assistant
tts_friendly = OpenAITTSModel(model="tts-1", voice="nova")
# Authoritative medical advisor
tts_medical = OpenAITTSModel(model="tts-1", voice="onyx")
Streaming TTS
The VoicePipeline streams TTS by default. As the agent generates text, the pipeline sends completed sentences or phrases to the TTS model and starts receiving audio before the full response is generated. This significantly reduces perceived latency.
The streaming flow looks like this:
Agent generates: "The weather in Tokyo is" --> [buffered]
Agent generates: "25 degrees and sunny." --> [sent to TTS]
--> [audio chunk 1 received]
Agent generates: "Perfect for a walk" --> [sent to TTS]
--> [audio chunk 2 received]
Agent generates: "in the park." --> [sent to TTS]
--> [audio chunk 3 received]
The pipeline buffers text until it hits a sentence boundary (period, exclamation mark, question mark) and then sends that sentence to TTS. This means the user starts hearing audio after the first complete sentence, not after the entire response is generated.
Custom TTS Models
Like STT, you can plug in alternative TTS providers by implementing the TTSModel protocol:
from agents.voice import TTSModel, TTSModelSettings
@dataclass
class ElevenLabsTTSModel:
api_key: str
voice_id: str = "21m00Tcm4TlvDq8ikWAM" # Rachel
async def run(
self,
text: str,
settings: TTSModelSettings | None = None,
):
"""Stream audio from ElevenLabs API."""
import httpx
async with httpx.AsyncClient() as client:
async with client.stream(
"POST",
f"https://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}/stream",
headers={"xi-api-key": self.api_key},
json={
"text": text,
"model_id": "eleven_monolingual_v1",
"voice_settings": {"stability": 0.5, "similarity_boost": 0.75},
},
) as response:
async for chunk in response.aiter_bytes(chunk_size=4096):
yield chunk
Combining STT and TTS Configuration
Here is a complete pipeline with both STT and TTS customized for a customer service voice agent:
flowchart TD
CENTER(("Core Concepts"))
CENTER --> N0["Set the language explicitly rather than…"]
CENTER --> N1["Trim silence from the beginning and end…"]
CENTER --> N2["Use shorter audio chunks — 5 seconds of…"]
CENTER --> N3["Use tts-1 instead of tts-1-hd for real-…"]
CENTER --> N4["Keep agent responses short — TTS genera…"]
CENTER --> N5["Take advantage of streaming — the pipel…"]
style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
from agents import Agent, function_tool
from agents.voice import (
VoicePipeline,
SingleAgentVoiceWorkflow,
OpenAISTTModel,
OpenAITTSModel,
)
@function_tool
def lookup_order(order_id: str) -> str:
"""Look up order status by ID."""
return f"Order {order_id} shipped on March 12, expected delivery March 16."
agent = Agent(
name="CustomerService",
instructions="""You are a customer service voice agent for an online store.
Keep responses under 3 sentences. Use a warm, helpful tone.
Always confirm the order ID before looking it up.""",
tools=[lookup_order],
)
stt = OpenAISTTModel(
model="whisper-1",
language="en",
prompt="order number, tracking, refund, exchange, shipping",
)
tts = OpenAITTSModel(
model="tts-1",
voice="nova",
)
workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(
workflow=workflow,
stt_model=stt,
tts_model=tts,
)
Measuring and Optimizing Latency
STT and TTS are the two largest contributors to pipeline latency outside of the LLM itself. Here are practical optimizations:
For STT:
- Set the language explicitly rather than relying on auto-detection (saves 50-100ms)
- Trim silence from the beginning and end of audio before transcription
- Use shorter audio chunks — 5 seconds of audio transcribes faster than 30 seconds
For TTS:
- Use
tts-1instead oftts-1-hdfor real-time applications - Keep agent responses short — TTS generation time scales linearly with text length
- Take advantage of streaming — the pipeline sends sentences to TTS as they complete
Measuring latency:
import time
async def timed_pipeline_run(audio):
t0 = time.perf_counter()
result = await pipeline.run(audio)
first_audio_time = None
chunks = []
async for event in result.stream():
if event.type == "voice_stream_event_audio":
if first_audio_time is None:
first_audio_time = time.perf_counter()
chunks.append(event.data)
total_time = time.perf_counter() - t0
time_to_first_audio = first_audio_time - t0 if first_audio_time else None
print(f"Time to first audio: {time_to_first_audio:.3f}s")
print(f"Total pipeline time: {total_time:.3f}s")
return chunks
Time to first audio is the metric that matters most for perceived responsiveness. Total pipeline time matters for overall throughput.
Sources:
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.