TL;DR — Mistral shipped Voxtral TTS on March 26, 2026 — open-weights, 4B params, 9 languages, function-calling from voice. Pair it with Voxtral-Mini-3B for STT/understanding and you get an end-to-end open-source voice stack from one vendor.

What you'll build

A FastAPI voice agent: client streams audio over WebSocket, server runs Voxtral-Mini-3B for STT + intent extraction, calls a Llama-style chat model, then synthesizes the reply with Voxtral-4B-TTS-2603. Total VRAM: ~16 GB.

Prerequisites

NVIDIA GPU with 16 GB+ VRAM (3090, 4080, A4000, etc.).
Python 3.11, pip install transformers torch fastapi uvicorn websockets soundfile.
Hugging Face token with access to mistralai gated repos.
huggingface-cli login then huggingface-cli download mistralai/Voxtral-Mini-3B-2507 and mistralai/Voxtral-4B-TTS-2603.

Architecture

flowchart LR
  CL[Client WSS] --> API[FastAPI Bridge]
  API --> VOXSTT[Voxtral-Mini-3B STT+Intent]
  VOXSTT -->|tool call| LLM[Llama 3.1 8B]
  LLM --> VOXTTS[Voxtral-4B-TTS-2603]
  VOXTTS --> CL

Step 1 — Load Voxtral-Mini for STT

```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" proc_stt = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507") stt = VoxtralForConditionalGeneration.from_pretrained( "mistralai/Voxtral-Mini-3B-2507", torch_dtype=torch.bfloat16).to(device)

def transcribe(wav_path): inputs = proc_stt.apply_transcrition_request(language="en", audio=wav_path, model_id="mistralai/Voxtral-Mini-3B-2507") inputs = inputs.to(device, dtype=torch.bfloat16) out = stt.generate(**inputs, max_new_tokens=200) return proc_stt.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0] ```

Voxtral's killer feature is function calling from voice — the model can emit tool-call JSON directly from audio without a separate LLM hop, when prompted accordingly.

Step 2 — Load Voxtral-4B-TTS

```python from transformers import AutoModelForCausalLM, AutoTokenizer proc_tts = AutoProcessor.from_pretrained("mistralai/Voxtral-4B-TTS-2603") tts = AutoModelForCausalLM.from_pretrained( "mistralai/Voxtral-4B-TTS-2603", torch_dtype=torch.bfloat16).to(device)

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

def synthesize(text, voice_ref="voices/amy_15s.wav"): inputs = proc_tts(text=text, audio=voice_ref, return_tensors="pt").to(device) audio = tts.generate(**inputs, max_new_tokens=2048) return proc_tts.decode_audio(audio) ```

Voxtral TTS supports zero-shot cross-lingual cloning: pass an English reference and synthesize French in the same voice.

Step 3 — Quantize for 12 GB cards

```python from transformers import BitsAndBytesConfig bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) tts = AutoModelForCausalLM.from_pretrained( "mistralai/Voxtral-4B-TTS-2603", quantization_config=bnb) ```

Quantized footprint drops to ~3 GB; quality loss is small for English.

Step 4 — Wire the FastAPI WebSocket

```python from fastapi import FastAPI, WebSocket import soundfile as sf, tempfile, ollama app = FastAPI()

@app.websocket("/voxtral") async def voxtral_ws(ws: WebSocket): await ws.accept() history = [{"role":"system","content":"Be concise."}] while True: data = await ws.receive_bytes() f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name sf.write(f, _bytes_to_pcm(data), 16000) text = transcribe(f) history.append({"role":"user","content":text}) r = ollama.chat(model="llama3.1:8b", messages=history) reply = r["message"]["content"] history.append({"role":"assistant","content":reply}) audio = synthesize(reply) await ws.send_bytes(_pcm_to_bytes(audio)) ```

Step 5 — Use Voxtral's voice tool-calling

```python TOOLS = [{"name":"book_demo", "description":"Book a sales demo", "parameters":{"date":"string","email":"string"}}]

def voice_to_tool(wav_path): inputs = proc_stt.apply_chat_template([ {"role":"system","content":f"Tools: {TOOLS}. Output JSON tool call only."}, {"role":"user","content":[{"type":"audio","path":wav_path}]}], return_tensors="pt").to(device, dtype=torch.bfloat16) out = stt.generate(**inputs, max_new_tokens=180) return proc_stt.batch_decode(out, skip_special_tokens=True)[0] ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

This skips the STT→LLM hop entirely for routine intents and cuts latency by 200–400 ms.

Step 6 — Multi-language demo

```python synthesize("Bonjour, comment puis-je vous aider ?", voice_ref="voices/amy_en.wav")

Voxtral renders French in Amy's English voice — zero-shot cross-lingual.

```

Common pitfalls

Gated repo. Without HF token + access request, downloads silently 401.
bfloat16 only. Voxtral does not support fp16 cleanly — expect NaNs.
Voice reference quality. Use a clean 10–15 s clip; Voxtral is sensitive to room reverb.

How CallSphere does this in production

CallSphere uses cloud TTS for live calls (latency, reliability) but evaluates Voxtral for self-serve and on-prem deployments. Our 37 agents across 6 verticals (Healthcare 14 tools / FastAPI :8084 / OpenAI Realtime, OneRoof 10 specialists, plus Salon, Dental, F&B, Behavioral) sit behind 90+ tools and 115+ Postgres tables. Pricing flat $149 / $499 / $1499. 14-day trial · 22% affiliate · /demo.

FAQ

License? Apache 2.0 — full commercial use.

Languages? 9 (EN/FR/DE/ES/NL/PT/IT/HI/AR).

Latency target? ~600 ms on a 4090 with quantization.

Voice cloning ethics? Always get consent; Mistral provides a content-provenance watermark by default.

Realtime barge-in? Not built-in — wrap with WebRTC + a VAD like Silero.

Build a Voice Agent with Mistral Voxtral (Local, 2026 Release)

What you'll build

Prerequisites

Architecture

Step 1 — Load Voxtral-Mini for STT

Step 2 — Load Voxtral-4B-TTS

Step 3 — Quantize for 12 GB cards

Step 4 — Wire the FastAPI WebSocket

Step 5 — Use Voxtral's voice tool-calling

Step 6 — Multi-language demo

Voxtral renders French in Amy's English voice — zero-shot cross-lingual.

Common pitfalls

How CallSphere does this in production

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

Build a CallSphere-Style Outbound Voice Campaign Tool

Mistral Agents API vs OpenAI Assistants vs Bedrock AgentCore

Build a CallSphere-Style Multi-Agent for HVAC Dispatch