Build a Voice Agent with Mistral Voxtral (Local, 2026 Release)
Mistral's Voxtral 4B-TTS (March 26 2026) is open-weights, 3 GB quantized, and runs on a single 16 GB GPU. Here's the full local voice agent build using Voxtral Mini for STT and Voxtral TTS.
TL;DR — Mistral shipped Voxtral TTS on March 26, 2026 — open-weights, 4B params, 9 languages, function-calling from voice. Pair it with Voxtral-Mini-3B for STT/understanding and you get an end-to-end open-source voice stack from one vendor.
What you'll build
A FastAPI voice agent: client streams audio over WebSocket, server runs Voxtral-Mini-3B for STT + intent extraction, calls a Llama-style chat model, then synthesizes the reply with Voxtral-4B-TTS-2603. Total VRAM: ~16 GB.
Prerequisites
- NVIDIA GPU with 16 GB+ VRAM (3090, 4080, A4000, etc.).
- Python 3.11,
pip install transformers torch fastapi uvicorn websockets soundfile. - Hugging Face token with access to
mistralaigated repos. huggingface-cli loginthenhuggingface-cli download mistralai/Voxtral-Mini-3B-2507andmistralai/Voxtral-4B-TTS-2603.
Architecture
flowchart LR
CL[Client WSS] --> API[FastAPI Bridge]
API --> VOXSTT[Voxtral-Mini-3B STT+Intent]
VOXSTT -->|tool call| LLM[Llama 3.1 8B]
LLM --> VOXTTS[Voxtral-4B-TTS-2603]
VOXTTS --> CL
Step 1 — Load Voxtral-Mini for STT
```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" proc_stt = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507") stt = VoxtralForConditionalGeneration.from_pretrained( "mistralai/Voxtral-Mini-3B-2507", torch_dtype=torch.bfloat16).to(device)
def transcribe(wav_path): inputs = proc_stt.apply_transcrition_request(language="en", audio=wav_path, model_id="mistralai/Voxtral-Mini-3B-2507") inputs = inputs.to(device, dtype=torch.bfloat16) out = stt.generate(**inputs, max_new_tokens=200) return proc_stt.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0] ```
Voxtral's killer feature is function calling from voice — the model can emit tool-call JSON directly from audio without a separate LLM hop, when prompted accordingly.
Step 2 — Load Voxtral-4B-TTS
```python from transformers import AutoModelForCausalLM, AutoTokenizer proc_tts = AutoProcessor.from_pretrained("mistralai/Voxtral-4B-TTS-2603") tts = AutoModelForCausalLM.from_pretrained( "mistralai/Voxtral-4B-TTS-2603", torch_dtype=torch.bfloat16).to(device)
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
def synthesize(text, voice_ref="voices/amy_15s.wav"): inputs = proc_tts(text=text, audio=voice_ref, return_tensors="pt").to(device) audio = tts.generate(**inputs, max_new_tokens=2048) return proc_tts.decode_audio(audio) ```
Voxtral TTS supports zero-shot cross-lingual cloning: pass an English reference and synthesize French in the same voice.
Step 3 — Quantize for 12 GB cards
```python from transformers import BitsAndBytesConfig bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) tts = AutoModelForCausalLM.from_pretrained( "mistralai/Voxtral-4B-TTS-2603", quantization_config=bnb) ```
Quantized footprint drops to ~3 GB; quality loss is small for English.
Step 4 — Wire the FastAPI WebSocket
```python from fastapi import FastAPI, WebSocket import soundfile as sf, tempfile, ollama app = FastAPI()
@app.websocket("/voxtral") async def voxtral_ws(ws: WebSocket): await ws.accept() history = [{"role":"system","content":"Be concise."}] while True: data = await ws.receive_bytes() f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name sf.write(f, _bytes_to_pcm(data), 16000) text = transcribe(f) history.append({"role":"user","content":text}) r = ollama.chat(model="llama3.1:8b", messages=history) reply = r["message"]["content"] history.append({"role":"assistant","content":reply}) audio = synthesize(reply) await ws.send_bytes(_pcm_to_bytes(audio)) ```
Step 5 — Use Voxtral's voice tool-calling
```python TOOLS = [{"name":"book_demo", "description":"Book a sales demo", "parameters":{"date":"string","email":"string"}}]
def voice_to_tool(wav_path): inputs = proc_stt.apply_chat_template([ {"role":"system","content":f"Tools: {TOOLS}. Output JSON tool call only."}, {"role":"user","content":[{"type":"audio","path":wav_path}]}], return_tensors="pt").to(device, dtype=torch.bfloat16) out = stt.generate(**inputs, max_new_tokens=180) return proc_stt.batch_decode(out, skip_special_tokens=True)[0] ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
This skips the STT→LLM hop entirely for routine intents and cuts latency by 200–400 ms.
Step 6 — Multi-language demo
```python synthesize("Bonjour, comment puis-je vous aider ?", voice_ref="voices/amy_en.wav")
Voxtral renders French in Amy's English voice — zero-shot cross-lingual.
```
Common pitfalls
- Gated repo. Without HF token + access request, downloads silently 401.
- bfloat16 only. Voxtral does not support fp16 cleanly — expect NaNs.
- Voice reference quality. Use a clean 10–15 s clip; Voxtral is sensitive to room reverb.
How CallSphere does this in production
CallSphere uses cloud TTS for live calls (latency, reliability) but evaluates Voxtral for self-serve and on-prem deployments. Our 37 agents across 6 verticals (Healthcare 14 tools / FastAPI :8084 / OpenAI Realtime, OneRoof 10 specialists, plus Salon, Dental, F&B, Behavioral) sit behind 90+ tools and 115+ Postgres tables. Pricing flat $149 / $499 / $1499. 14-day trial · 22% affiliate · /demo.
FAQ
License? Apache 2.0 — full commercial use.
Languages? 9 (EN/FR/DE/ES/NL/PT/IT/HI/AR).
Latency target? ~600 ms on a 4090 with quantization.
Voice cloning ethics? Always get consent; Mistral provides a content-provenance watermark by default.
Realtime barge-in? Not built-in — wrap with WebRTC + a VAD like Silero.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.