By Sagar Shankaran, Founder of CallSphere
Soniox v4 ships speaker-native accuracy across 60+ languages with sub-200ms latency. Build a multilingual voice agent — Python WebSocket, code-switching, pitfalls.
Key takeaways
TL;DR — Soniox
stt-rt-v4(GA Feb 2026) is the new bar for multilingual real-time STT: native accuracy across 60+ languages, code-switching mid-sentence, real-time translation, and full backward compat with v3 (just change the model name). Best fit when your callers don't all speak English.
A Python voice agent that auto-detects language per turn, transcribes English/Spanish/Hindi mid-sentence ("I'm gonna recoger los niños"), and feeds GPT-4o while showing live English translation alongside the original.
flowchart LR
MIC[Caller mic] -- PCM 16k --> WS[Soniox WS API]
WS -- final tokens --> SV[Server]
WS -- translation tokens --> SV
SV --> LLM[GPT-4o]
LLM --> TTS[ElevenLabs Multilingual]
TTS --> MIC
```bash pip install soniox websockets sounddevice openai export SONIOX_API_KEY=... ```
```python import asyncio, json, os, websockets URL = "wss://stt-rt.soniox.com/transcribe-websocket"
async def stream(audio_iter): async with websockets.connect(URL) as ws: await ws.send(json.dumps({ "api_key": os.environ["SONIOX_API_KEY"], "model": "stt-rt-v4", "audio_format": "pcm_s16le", "sample_rate": 16000, "num_channels": 1, "language_hints": ["en", "es", "hi"], "enable_language_identification": True, "translation": {"type": "one_way", "target_language": "en"}, "enable_endpoint_detection": True, })) async def send_audio(): async for chunk in audio_iter: await ws.send(chunk) await ws.send(b"") # EOF async def recv(): async for msg in ws: yield json.loads(msg) asyncio.create_task(send_audio()) async for evt in recv(): yield evt ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
```python import sounddevice as sd, queue q: queue.Queue[bytes] = queue.Queue() def cb(indata, *_): q.put(bytes(indata)) sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", callback=cb, blocksize=1600).start() async def mic(): while True: yield q.get() ```
```python from openai import AsyncOpenAI oa = AsyncOpenAI() history = [{"role": "system", "content": "You speak whichever language the caller used last."}]
async def main():
final_text = ""
async for evt in stream(mic()):
if evt.get("error"):
print("err", evt); break
for tok in evt.get("tokens", []):
if tok.get("is_final"):
final_text += tok["text"]
if tok.get("text") == "
Each token in v4 carries a language field. Log tok["language"] to verify mid-sentence switches:
```python print(f"{tok['text']:>10} lang={tok.get('language')} fin={tok.get('is_final')}") ```
When translation.type == "one_way", v4 emits separate translated tokens with source_language + text. Render both columns to the agent UI for live caption + English gloss.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```diff
That's it — v4 is fully backward-compatible. After Feb 28, 2026, v3 requests automatically routed to v4.
<end> token is the real turn boundary — don't VAD on silence alone or you'll cut Hindi vowels.enable_punctuation: false.pcm_s16le 16kHz mono — Opus and 8kHz get re-encoded server-side and add ~30ms.CallSphere routes the OneRoof real-estate vertical (US/Mex/India lines) through Soniox v4 to handle Spanglish + Hinglish callers, feeding 37 agents · 90+ tools · 115+ DB tables · 6 verticals. $149/$499/$1,499 · 14-day trial · 22% affiliate.
Pricing? $0.0033/min for stt-rt-v4 — among the cheapest premium STT in 2026.
Diarization? Yes via enable_speaker_diarization: true + num_speakers.
SDKs? Web, Node, Python, Go, .NET — all wrap the same WS contract.
EU residency? Pin stt-rt-eu.soniox.com for EU-hosted endpoints.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How to voice text in 2026: best apps, the API stack behind them, and how I use the same tech inside CallSphere's 57+ language voice agents.
What changed for builders after OpenAI's GPT-Realtime-Translate launch on May 7, 2026. The new multilingual voice stack and who it disrupts.
The voice AI market hits $47.5B by 2034. For gyms and PT studios, voice agents now make economic sense for member intake, upsells, and reactivation campaigns.
A working ROI model for adding live translation to a call center using GPT-Realtime-Translate. Abandon-rate reduction, TAM expansion, payback math.
With the voice AI market at $47.5B by 2034 and OpenAI's realtime release this week, every dealership and service shop should be evaluating voice agents. Here's how.
Spring 2026 AC season starts now. With the voice AI market at $47.5B by 2034, HVAC shops without after-hours voice agents will lose to those that have them.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI