---
title: "Build a Voice Agent with Soniox v4 Real-Time STT (60+ Languages, 2026)"
description: "Soniox v4 ships speaker-native accuracy across 60+ languages with sub-200ms latency. Build a multilingual voice agent — Python WebSocket, code-switching, pitfalls."
canonical: https://callsphere.ai/blog/vw9h-build-voice-agent-soniox-v4-real-time-stt-2026
category: "AI Voice Agents"
tags: ["Soniox", "STT", "Real-Time", "Multilingual", "Voice Agent"]
author: "CallSphere Team"
published: 2026-04-06T00:00:00.000Z
updated: 2026-05-08T03:13:54.746Z
---

# Build a Voice Agent with Soniox v4 Real-Time STT (60+ Languages, 2026)

> Soniox v4 ships speaker-native accuracy across 60+ languages with sub-200ms latency. Build a multilingual voice agent — Python WebSocket, code-switching, pitfalls.

> **TL;DR** — Soniox `stt-rt-v4` (GA Feb 2026) is the new bar for multilingual real-time STT: native accuracy across 60+ languages, code-switching mid-sentence, real-time translation, and full backward compat with v3 (just change the model name). Best fit when your callers don't all speak English.

## What you'll build

A Python voice agent that auto-detects language per turn, transcribes English/Spanish/Hindi mid-sentence ("I'm gonna recoger los niños"), and feeds GPT-4o while showing live English translation alongside the original.

## Architecture

```mermaid
flowchart LR
  MIC[Caller mic] -- PCM 16k --> WS[Soniox WS API]
  WS -- final tokens --> SV[Server]
  WS -- translation tokens --> SV
  SV --> LLM[GPT-4o]
  LLM --> TTS[ElevenLabs Multilingual]
  TTS --> MIC
```

## Step 1 — Install

```bash
pip install soniox websockets sounddevice openai
export SONIOX_API_KEY=...
```

## Step 2 — Open the v4 WebSocket

```python
import asyncio, json, os, websockets
URL = "wss://stt-rt.soniox.com/transcribe-websocket"

async def stream(audio_iter):
    async with websockets.connect(URL) as ws:
        await ws.send(json.dumps({
            "api_key": os.environ["SONIOX_API_KEY"],
            "model": "stt-rt-v4",
            "audio_format": "pcm_s16le",
            "sample_rate": 16000,
            "num_channels": 1,
            "language_hints": ["en", "es", "hi"],
            "enable_language_identification": True,
            "translation": {"type": "one_way", "target_language": "en"},
            "enable_endpoint_detection": True,
        }))
        async def send_audio():
            async for chunk in audio_iter: await ws.send(chunk)
            await ws.send(b"")  # EOF
        async def recv():
            async for msg in ws:
                yield json.loads(msg)
        asyncio.create_task(send_audio())
        async for evt in recv(): yield evt
```

## Step 3 — Mic capture

```python
import sounddevice as sd, queue
q: queue.Queue[bytes] = queue.Queue()
def cb(indata, *_): q.put(bytes(indata))
sd.RawInputStream(samplerate=16000, channels=1, dtype="int16",
                  callback=cb, blocksize=1600).start()
async def mic():
    while True: yield q.get()
```

## Step 4 — Drive the LLM on endpoints

```python
from openai import AsyncOpenAI
oa = AsyncOpenAI()
history = [{"role": "system", "content":
            "You speak whichever language the caller used last."}]

async def main():
    final_text = ""
    async for evt in stream(mic()):
        if evt.get("error"):
            print("err", evt); break
        for tok in evt.get("tokens", []):
            if tok.get("is_final"):
                final_text += tok["text"]
            if tok.get("text") == "":
                if final_text.strip():
                    history.append({"role": "user", "content": final_text})
                    r = await oa.chat.completions.create(
                            model="gpt-4o", messages=history)
                    reply = r.choices[0].message.content
                    history.append({"role": "assistant", "content": reply})
                    print("AGENT:", reply)
                    final_text = ""
```

## Step 5 — Code-switching diagnostics

Each token in v4 carries a `language` field. Log `tok["language"]` to verify mid-sentence switches:

```python
print(f"{tok['text']:>10}  lang={tok.get('language')}  fin={tok.get('is_final')}")
```

## Step 6 — Translation channel

When `translation.type == "one_way"`, v4 emits separate translated tokens with `source_language` + `text`. Render both columns to the agent UI for live caption + English gloss.

## Step 7 — Migrate from v3

```diff

- "model": "stt-rt-v3"

- "model": "stt-rt-v4"
```

That's it — v4 is fully backward-compatible. After Feb 28, 2026, v3 requests automatically routed to v4.

## Pitfalls

- **Endpoint detection**: `` token is the real turn boundary — don't VAD on silence alone or you'll cut Hindi vowels.
- **Language hints**: Always pass top-3 expected languages even with auto-ID — improves accuracy 5-15%.
- **Punctuation**: v4 punctuates aggressively; for downstream NLP that hates commas, set `enable_punctuation: false`.
- **Audio format**: `pcm_s16le` 16kHz mono — Opus and 8kHz get re-encoded server-side and add ~30ms.

## How CallSphere does this

CallSphere routes the OneRoof real-estate vertical (US/Mex/India lines) through Soniox v4 to handle Spanglish + Hinglish callers, feeding **37 agents · 90+ tools · 115+ DB tables · 6 verticals**. **$149/$499/$1,499 · 14-day trial · 22% affiliate**.

## FAQ

**Pricing?** $0.0033/min for stt-rt-v4 — among the cheapest premium STT in 2026.

**Diarization?** Yes via `enable_speaker_diarization: true` + `num_speakers`.

**SDKs?** Web, Node, Python, Go, .NET — all wrap the same WS contract.

**EU residency?** Pin `stt-rt-eu.soniox.com` for EU-hosted endpoints.

## Sources

- Soniox Blog - v4 Real-Time: New Standard - [https://soniox.com/blog/2026-02-05-soniox-v4-real-time](https://soniox.com/blog/2026-02-05-soniox-v4-real-time)
- Soniox Docs - Real-time Transcription - [https://soniox.com/docs/stt/rt/real-time-transcription](https://soniox.com/docs/stt/rt/real-time-transcription)
- Soniox Docs - WebSocket API - [https://soniox.com/docs/stt/api-reference/websocket-api](https://soniox.com/docs/stt/api-reference/websocket-api)
- Soniox Docs - Real-time Translation - [https://soniox.com/docs/stt/rt/real-time-translation](https://soniox.com/docs/stt/rt/real-time-translation)

---

Source: https://callsphere.ai/blog/vw9h-build-voice-agent-soniox-v4-real-time-stt-2026