Build a Voice Agent with Soniox v4 Real-Time STT (60+ Languages, 2026)
Soniox v4 ships speaker-native accuracy across 60+ languages with sub-200ms latency. Build a multilingual voice agent — Python WebSocket, code-switching, pitfalls.
TL;DR — Soniox
stt-rt-v4(GA Feb 2026) is the new bar for multilingual real-time STT: native accuracy across 60+ languages, code-switching mid-sentence, real-time translation, and full backward compat with v3 (just change the model name). Best fit when your callers don't all speak English.
What you'll build
A Python voice agent that auto-detects language per turn, transcribes English/Spanish/Hindi mid-sentence ("I'm gonna recoger los niños"), and feeds GPT-4o while showing live English translation alongside the original.
Architecture
flowchart LR
MIC[Caller mic] -- PCM 16k --> WS[Soniox WS API]
WS -- final tokens --> SV[Server]
WS -- translation tokens --> SV
SV --> LLM[GPT-4o]
LLM --> TTS[ElevenLabs Multilingual]
TTS --> MIC
Step 1 — Install
```bash pip install soniox websockets sounddevice openai export SONIOX_API_KEY=... ```
Step 2 — Open the v4 WebSocket
```python import asyncio, json, os, websockets URL = "wss://stt-rt.soniox.com/transcribe-websocket"
async def stream(audio_iter): async with websockets.connect(URL) as ws: await ws.send(json.dumps({ "api_key": os.environ["SONIOX_API_KEY"], "model": "stt-rt-v4", "audio_format": "pcm_s16le", "sample_rate": 16000, "num_channels": 1, "language_hints": ["en", "es", "hi"], "enable_language_identification": True, "translation": {"type": "one_way", "target_language": "en"}, "enable_endpoint_detection": True, })) async def send_audio(): async for chunk in audio_iter: await ws.send(chunk) await ws.send(b"") # EOF async def recv(): async for msg in ws: yield json.loads(msg) asyncio.create_task(send_audio()) async for evt in recv(): yield evt ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — Mic capture
```python import sounddevice as sd, queue q: queue.Queue[bytes] = queue.Queue() def cb(indata, *_): q.put(bytes(indata)) sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", callback=cb, blocksize=1600).start() async def mic(): while True: yield q.get() ```
Step 4 — Drive the LLM on endpoints
```python from openai import AsyncOpenAI oa = AsyncOpenAI() history = [{"role": "system", "content": "You speak whichever language the caller used last."}]
async def main():
final_text = ""
async for evt in stream(mic()):
if evt.get("error"):
print("err", evt); break
for tok in evt.get("tokens", []):
if tok.get("is_final"):
final_text += tok["text"]
if tok.get("text") == "
Step 5 — Code-switching diagnostics
Each token in v4 carries a language field. Log tok["language"] to verify mid-sentence switches:
```python print(f"{tok['text']:>10} lang={tok.get('language')} fin={tok.get('is_final')}") ```
Step 6 — Translation channel
When translation.type == "one_way", v4 emits separate translated tokens with source_language + text. Render both columns to the agent UI for live caption + English gloss.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 7 — Migrate from v3
```diff
- "model": "stt-rt-v3"
- "model": "stt-rt-v4" ```
That's it — v4 is fully backward-compatible. After Feb 28, 2026, v3 requests automatically routed to v4.
Pitfalls
- Endpoint detection:
<end>token is the real turn boundary — don't VAD on silence alone or you'll cut Hindi vowels. - Language hints: Always pass top-3 expected languages even with auto-ID — improves accuracy 5-15%.
- Punctuation: v4 punctuates aggressively; for downstream NLP that hates commas, set
enable_punctuation: false. - Audio format:
pcm_s16le16kHz mono — Opus and 8kHz get re-encoded server-side and add ~30ms.
How CallSphere does this
CallSphere routes the OneRoof real-estate vertical (US/Mex/India lines) through Soniox v4 to handle Spanglish + Hinglish callers, feeding 37 agents · 90+ tools · 115+ DB tables · 6 verticals. $149/$499/$1,499 · 14-day trial · 22% affiliate.
FAQ
Pricing? $0.0033/min for stt-rt-v4 — among the cheapest premium STT in 2026.
Diarization? Yes via enable_speaker_diarization: true + num_speakers.
SDKs? Web, Node, Python, Go, .NET — all wrap the same WS contract.
EU residency? Pin stt-rt-eu.soniox.com for EU-hosted endpoints.
Sources
- Soniox Blog - v4 Real-Time: New Standard - https://soniox.com/blog/2026-02-05-soniox-v4-real-time
- Soniox Docs - Real-time Transcription - https://soniox.com/docs/stt/rt/real-time-transcription
- Soniox Docs - WebSocket API - https://soniox.com/docs/stt/api-reference/websocket-api
- Soniox Docs - Real-time Translation - https://soniox.com/docs/stt/rt/real-time-translation
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.