By Sagar Shankaran, Founder of CallSphere
Mistral's Voxtral 4B-TTS (March 26 2026) is open-weights, 3 GB quantized, and runs on a single 16 GB GPU. Here's the full local voice agent build using Voxtral Mini for STT and Voxtral TTS.
Key takeaways
TL;DR — Mistral shipped Voxtral TTS on March 26, 2026 — open-weights, 4B params, 9 languages, function-calling from voice. Pair it with Voxtral-Mini-3B for STT/understanding and you get an end-to-end open-source voice stack from one vendor.
A FastAPI voice agent: client streams audio over WebSocket, server runs Voxtral-Mini-3B for STT + intent extraction, calls a Llama-style chat model, then synthesizes the reply with Voxtral-4B-TTS-2603. Total VRAM: ~16 GB.
pip install transformers torch fastapi uvicorn websockets soundfile.mistralai gated repos.huggingface-cli login then huggingface-cli download mistralai/Voxtral-Mini-3B-2507 and mistralai/Voxtral-4B-TTS-2603.flowchart LR
CL[Client WSS] --> API[FastAPI Bridge]
API --> VOXSTT[Voxtral-Mini-3B STT+Intent]
VOXSTT -->|tool call| LLM[Llama 3.1 8B]
LLM --> VOXTTS[Voxtral-4B-TTS-2603]
VOXTTS --> CL
```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" proc_stt = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507") stt = VoxtralForConditionalGeneration.from_pretrained( "mistralai/Voxtral-Mini-3B-2507", torch_dtype=torch.bfloat16).to(device)
def transcribe(wav_path): inputs = proc_stt.apply_transcrition_request(language="en", audio=wav_path, model_id="mistralai/Voxtral-Mini-3B-2507") inputs = inputs.to(device, dtype=torch.bfloat16) out = stt.generate(**inputs, max_new_tokens=200) return proc_stt.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0] ```
Voxtral's killer feature is function calling from voice — the model can emit tool-call JSON directly from audio without a separate LLM hop, when prompted accordingly.
```python from transformers import AutoModelForCausalLM, AutoTokenizer proc_tts = AutoProcessor.from_pretrained("mistralai/Voxtral-4B-TTS-2603") tts = AutoModelForCausalLM.from_pretrained( "mistralai/Voxtral-4B-TTS-2603", torch_dtype=torch.bfloat16).to(device)
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
def synthesize(text, voice_ref="voices/amy_15s.wav"): inputs = proc_tts(text=text, audio=voice_ref, return_tensors="pt").to(device) audio = tts.generate(**inputs, max_new_tokens=2048) return proc_tts.decode_audio(audio) ```
Voxtral TTS supports zero-shot cross-lingual cloning: pass an English reference and synthesize French in the same voice.
```python from transformers import BitsAndBytesConfig bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) tts = AutoModelForCausalLM.from_pretrained( "mistralai/Voxtral-4B-TTS-2603", quantization_config=bnb) ```
Quantized footprint drops to ~3 GB; quality loss is small for English.
```python from fastapi import FastAPI, WebSocket import soundfile as sf, tempfile, ollama app = FastAPI()
@app.websocket("/voxtral") async def voxtral_ws(ws: WebSocket): await ws.accept() history = [{"role":"system","content":"Be concise."}] while True: data = await ws.receive_bytes() f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name sf.write(f, _bytes_to_pcm(data), 16000) text = transcribe(f) history.append({"role":"user","content":text}) r = ollama.chat(model="llama3.1:8b", messages=history) reply = r["message"]["content"] history.append({"role":"assistant","content":reply}) audio = synthesize(reply) await ws.send_bytes(_pcm_to_bytes(audio)) ```
```python TOOLS = [{"name":"book_demo", "description":"Book a sales demo", "parameters":{"date":"string","email":"string"}}]
def voice_to_tool(wav_path): inputs = proc_stt.apply_chat_template([ {"role":"system","content":f"Tools: {TOOLS}. Output JSON tool call only."}, {"role":"user","content":[{"type":"audio","path":wav_path}]}], return_tensors="pt").to(device, dtype=torch.bfloat16) out = stt.generate(**inputs, max_new_tokens=180) return proc_stt.batch_decode(out, skip_special_tokens=True)[0] ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
This skips the STT→LLM hop entirely for routine intents and cuts latency by 200–400 ms.
```python synthesize("Bonjour, comment puis-je vous aider ?", voice_ref="voices/amy_en.wav")
```
CallSphere uses cloud TTS for live calls (latency, reliability) but evaluates Voxtral for self-serve and on-prem deployments. Our 37 agents across 6 verticals (Healthcare 14 tools / FastAPI :8084 / OpenAI Realtime, OneRoof 10 specialists, plus Salon, Dental, F&B, Behavioral) sit behind 90+ tools and 115+ Postgres tables. Pricing flat $149 / $499 / $1499. 14-day trial · 22% affiliate · /demo.
License? Apache 2.0 — full commercial use.
Languages? 9 (EN/FR/DE/ES/NL/PT/IT/HI/AR).
Latency target? ~600 ms on a 4090 with quantization.
Voice cloning ethics? Always get consent; Mistral provides a content-provenance watermark by default.
Realtime barge-in? Not built-in — wrap with WebRTC + a VAD like Silero.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
The voice AI market hits $47.5B by 2034. For gyms and PT studios, voice agents now make economic sense for member intake, upsells, and reactivation campaigns.
With the voice AI market at $47.5B by 2034 and OpenAI's realtime release this week, every dealership and service shop should be evaluating voice agents. Here's how.
Spring 2026 AC season starts now. With the voice AI market at $47.5B by 2034, HVAC shops without after-hours voice agents will lose to those that have them.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
OpenAI's GPT-Realtime-Translate handles 70 input languages live at $0.034/min. Here is what that means for multilingual restaurant takeout — and how CallSphere ships it.
OpenAI's GPT-Realtime-Translate hits 70 languages at $0.034/min. For dental practices in diverse metros, this changes who picks up the phone — and who books the appointment.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI