---
title: "Build a Voice Agent with Mistral Voxtral (Local, 2026 Release)"
description: "Mistral's Voxtral 4B-TTS (March 26 2026) is open-weights, 3 GB quantized, and runs on a single 16 GB GPU. Here's the full local voice agent build using Voxtral Mini for STT and Voxtral TTS."
canonical: https://callsphere.ai/blog/vw4h-build-voice-agent-mistral-voxtral-local
category: "AI Voice Agents"
tags: ["Voxtral", "Mistral", "Local AI", "Voice Agent", "Tutorial"]
author: "CallSphere Team"
published: 2026-03-28T00:00:00.000Z
updated: 2026-05-07T16:13:45.171Z
---

# Build a Voice Agent with Mistral Voxtral (Local, 2026 Release)

> Mistral's Voxtral 4B-TTS (March 26 2026) is open-weights, 3 GB quantized, and runs on a single 16 GB GPU. Here's the full local voice agent build using Voxtral Mini for STT and Voxtral TTS.

> **TL;DR** — Mistral shipped Voxtral TTS on March 26, 2026 — open-weights, 4B params, 9 languages, function-calling from voice. Pair it with Voxtral-Mini-3B for STT/understanding and you get an end-to-end open-source voice stack from one vendor.

## What you'll build

A FastAPI voice agent: client streams audio over WebSocket, server runs Voxtral-Mini-3B for STT + intent extraction, calls a Llama-style chat model, then synthesizes the reply with Voxtral-4B-TTS-2603. Total VRAM: ~16 GB.

## Prerequisites

1. NVIDIA GPU with 16 GB+ VRAM (3090, 4080, A4000, etc.).
2. Python 3.11, `pip install transformers torch fastapi uvicorn websockets soundfile`.
3. Hugging Face token with access to `mistralai` gated repos.
4. `huggingface-cli login` then `huggingface-cli download mistralai/Voxtral-Mini-3B-2507` and `mistralai/Voxtral-4B-TTS-2603`.

## Architecture

```mermaid
flowchart LR
  CL[Client WSS] --> API[FastAPI Bridge]
  API --> VOXSTT[Voxtral-Mini-3B STT+Intent]
  VOXSTT -->|tool call| LLM[Llama 3.1 8B]
  LLM --> VOXTTS[Voxtral-4B-TTS-2603]
  VOXTTS --> CL
```

## Step 1 — Load Voxtral-Mini for STT

```python
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
proc_stt = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
stt = VoxtralForConditionalGeneration.from_pretrained(
    "mistralai/Voxtral-Mini-3B-2507", torch_dtype=torch.bfloat16).to(device)

def transcribe(wav_path):
    inputs = proc_stt.apply_transcrition_request(language="en", audio=wav_path,
                                                  model_id="mistralai/Voxtral-Mini-3B-2507")
    inputs = inputs.to(device, dtype=torch.bfloat16)
    out = stt.generate(**inputs, max_new_tokens=200)
    return proc_stt.batch_decode(out[:, inputs.input_ids.shape[1]:],
                                  skip_special_tokens=True)[0]
```

Voxtral's killer feature is *function calling from voice* — the model can emit tool-call JSON directly from audio without a separate LLM hop, when prompted accordingly.

## Step 2 — Load Voxtral-4B-TTS

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
proc_tts = AutoProcessor.from_pretrained("mistralai/Voxtral-4B-TTS-2603")
tts = AutoModelForCausalLM.from_pretrained(
    "mistralai/Voxtral-4B-TTS-2603", torch_dtype=torch.bfloat16).to(device)

def synthesize(text, voice_ref="voices/amy_15s.wav"):
    inputs = proc_tts(text=text, audio=voice_ref, return_tensors="pt").to(device)
    audio = tts.generate(**inputs, max_new_tokens=2048)
    return proc_tts.decode_audio(audio)
```

Voxtral TTS supports zero-shot cross-lingual cloning: pass an English reference and synthesize French in the same voice.

## Step 3 — Quantize for 12 GB cards

```python
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tts = AutoModelForCausalLM.from_pretrained(
    "mistralai/Voxtral-4B-TTS-2603", quantization_config=bnb)
```

Quantized footprint drops to ~3 GB; quality loss is small for English.

## Step 4 — Wire the FastAPI WebSocket

```python
from fastapi import FastAPI, WebSocket
import soundfile as sf, tempfile, ollama
app = FastAPI()

@app.websocket("/voxtral")
async def voxtral_ws(ws: WebSocket):
    await ws.accept()
    history = [{"role":"system","content":"Be concise."}]
    while True:
        data = await ws.receive_bytes()
        f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name
        sf.write(f, _bytes_to_pcm(data), 16000)
        text = transcribe(f)
        history.append({"role":"user","content":text})
        r = ollama.chat(model="llama3.1:8b", messages=history)
        reply = r["message"]["content"]
        history.append({"role":"assistant","content":reply})
        audio = synthesize(reply)
        await ws.send_bytes(_pcm_to_bytes(audio))
```

## Step 5 — Use Voxtral's voice tool-calling

```python
TOOLS = [{"name":"book_demo",
  "description":"Book a sales demo",
  "parameters":{"date":"string","email":"string"}}]

def voice_to_tool(wav_path):
    inputs = proc_stt.apply_chat_template([
      {"role":"system","content":f"Tools: {TOOLS}. Output JSON tool call only."},
      {"role":"user","content":[{"type":"audio","path":wav_path}]}],
      return_tensors="pt").to(device, dtype=torch.bfloat16)
    out = stt.generate(**inputs, max_new_tokens=180)
    return proc_stt.batch_decode(out, skip_special_tokens=True)[0]
```

This skips the STT→LLM hop entirely for routine intents and cuts latency by 200–400 ms.

## Step 6 — Multi-language demo

```python
synthesize("Bonjour, comment puis-je vous aider ?", voice_ref="voices/amy_en.wav")

# Voxtral renders French in Amy's English voice — zero-shot cross-lingual.

```

## Common pitfalls

- **Gated repo.** Without HF token + access request, downloads silently 401.
- **bfloat16 only.** Voxtral does not support fp16 cleanly — expect NaNs.
- **Voice reference quality.** Use a clean 10–15 s clip; Voxtral is sensitive to room reverb.

## How CallSphere does this in production

CallSphere uses cloud TTS for live calls (latency, reliability) but evaluates Voxtral for self-serve and on-prem deployments. Our 37 agents across 6 verticals (Healthcare 14 tools / FastAPI :8084 / OpenAI Realtime, OneRoof 10 specialists, plus Salon, Dental, F&B, Behavioral) sit behind 90+ tools and 115+ Postgres tables. Pricing flat $149 / $499 / $1499. [14-day trial](/trial) · [22% affiliate](/affiliate) · [/demo](/demo).

## FAQ

**License?** Apache 2.0 — full commercial use.

**Languages?** 9 (EN/FR/DE/ES/NL/PT/IT/HI/AR).

**Latency target?** ~600 ms on a 4090 with quantization.

**Voice cloning ethics?** Always get consent; Mistral provides a content-provenance watermark by default.

**Realtime barge-in?** Not built-in — wrap with WebRTC + a VAD like Silero.

## Sources

- [Voxtral TTS announcement](https://mistral.ai/news/voxtral-tts)
- [Voxtral-4B-TTS-2603 model card](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
- [Voxtral TTS guide on DataCamp](https://www.datacamp.com/blog/voxtral-tts)
- [TechCrunch: Mistral Voxtral release](https://techcrunch.com/2026/03/26/mistral-releases-a-new-open-source-model-for-speech-generation/)

---

Source: https://callsphere.ai/blog/vw4h-build-voice-agent-mistral-voxtral-local
