Skip to content
AI Voice Agents
AI Voice Agents11 min read0 views

Build a Voice Agent with Soniox v4 Real-Time STT (60+ Languages, 2026)

Soniox v4 ships speaker-native accuracy across 60+ languages with sub-200ms latency. Build a multilingual voice agent — Python WebSocket, code-switching, pitfalls.

TL;DR — Soniox stt-rt-v4 (GA Feb 2026) is the new bar for multilingual real-time STT: native accuracy across 60+ languages, code-switching mid-sentence, real-time translation, and full backward compat with v3 (just change the model name). Best fit when your callers don't all speak English.

What you'll build

A Python voice agent that auto-detects language per turn, transcribes English/Spanish/Hindi mid-sentence ("I'm gonna recoger los niños"), and feeds GPT-4o while showing live English translation alongside the original.

Architecture

flowchart LR
  MIC[Caller mic] -- PCM 16k --> WS[Soniox WS API]
  WS -- final tokens --> SV[Server]
  WS -- translation tokens --> SV
  SV --> LLM[GPT-4o]
  LLM --> TTS[ElevenLabs Multilingual]
  TTS --> MIC

Step 1 — Install

```bash pip install soniox websockets sounddevice openai export SONIOX_API_KEY=... ```

Step 2 — Open the v4 WebSocket

```python import asyncio, json, os, websockets URL = "wss://stt-rt.soniox.com/transcribe-websocket"

async def stream(audio_iter): async with websockets.connect(URL) as ws: await ws.send(json.dumps({ "api_key": os.environ["SONIOX_API_KEY"], "model": "stt-rt-v4", "audio_format": "pcm_s16le", "sample_rate": 16000, "num_channels": 1, "language_hints": ["en", "es", "hi"], "enable_language_identification": True, "translation": {"type": "one_way", "target_language": "en"}, "enable_endpoint_detection": True, })) async def send_audio(): async for chunk in audio_iter: await ws.send(chunk) await ws.send(b"") # EOF async def recv(): async for msg in ws: yield json.loads(msg) asyncio.create_task(send_audio()) async for evt in recv(): yield evt ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Mic capture

```python import sounddevice as sd, queue q: queue.Queue[bytes] = queue.Queue() def cb(indata, *_): q.put(bytes(indata)) sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", callback=cb, blocksize=1600).start() async def mic(): while True: yield q.get() ```

Step 4 — Drive the LLM on endpoints

```python from openai import AsyncOpenAI oa = AsyncOpenAI() history = [{"role": "system", "content": "You speak whichever language the caller used last."}]

async def main(): final_text = "" async for evt in stream(mic()): if evt.get("error"): print("err", evt); break for tok in evt.get("tokens", []): if tok.get("is_final"): final_text += tok["text"] if tok.get("text") == "": if final_text.strip(): history.append({"role": "user", "content": final_text}) r = await oa.chat.completions.create( model="gpt-4o", messages=history) reply = r.choices[0].message.content history.append({"role": "assistant", "content": reply}) print("AGENT:", reply) final_text = "" ```

Step 5 — Code-switching diagnostics

Each token in v4 carries a language field. Log tok["language"] to verify mid-sentence switches:

```python print(f"{tok['text']:>10} lang={tok.get('language')} fin={tok.get('is_final')}") ```

Step 6 — Translation channel

When translation.type == "one_way", v4 emits separate translated tokens with source_language + text. Render both columns to the agent UI for live caption + English gloss.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 7 — Migrate from v3

```diff

  • "model": "stt-rt-v3"
  • "model": "stt-rt-v4" ```

That's it — v4 is fully backward-compatible. After Feb 28, 2026, v3 requests automatically routed to v4.

Pitfalls

  • Endpoint detection: <end> token is the real turn boundary — don't VAD on silence alone or you'll cut Hindi vowels.
  • Language hints: Always pass top-3 expected languages even with auto-ID — improves accuracy 5-15%.
  • Punctuation: v4 punctuates aggressively; for downstream NLP that hates commas, set enable_punctuation: false.
  • Audio format: pcm_s16le 16kHz mono — Opus and 8kHz get re-encoded server-side and add ~30ms.

How CallSphere does this

CallSphere routes the OneRoof real-estate vertical (US/Mex/India lines) through Soniox v4 to handle Spanglish + Hinglish callers, feeding 37 agents · 90+ tools · 115+ DB tables · 6 verticals. $149/$499/$1,499 · 14-day trial · 22% affiliate.

FAQ

Pricing? $0.0033/min for stt-rt-v4 — among the cheapest premium STT in 2026.

Diarization? Yes via enable_speaker_diarization: true + num_speakers.

SDKs? Web, Node, Python, Go, .NET — all wrap the same WS contract.

EU residency? Pin stt-rt-eu.soniox.com for EU-hosted endpoints.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.