Skip to content
AI Voice Agents
AI Voice Agents12 min read0 views

Build a Voice Agent with Coqui TTS XTTS-v2 (Voice Cloning, Local)

XTTS-v2 clones a voice from 6 seconds of audio and speaks 17 languages. Here's how to wire it into a real voice agent with faster-whisper STT and a local LLM — no API keys.

TL;DR — XTTS-v2 is the open voice-cloning model worth running. The original Coqui org wound down, but coqui-tts (a community fork) is on 0.28 with prebuilt wheels for macOS and Windows. Six seconds of clean audio gives you a usable clone.

What you'll build

A voice agent that answers in your voice. Mic in → faster-whisper → Ollama → XTTS-v2 (cloning your reference clip) → speaker out. Useful for accessibility, language tutoring, and on-brand IVR demos.

Prerequisites

  1. Python 3.11 (XTTS pinned wheels do not yet build cleanly on 3.13).
  2. pip install coqui-tts faster-whisper sounddevice numpy ollama.
  3. NVIDIA GPU with 6 GB+ VRAM strongly recommended (XTTS on CPU = 12x realtime).
  4. A 6–15 second WAV of the voice you want to clone (clean, mono, 22050 Hz, no music).
  5. Ollama running with a small model (ollama pull llama3.2:3b).

Architecture

flowchart LR
  MIC[Microphone] --> STT[faster-whisper]
  STT --> LLM[Ollama llama3.2:3b]
  LLM --> XTTS[XTTS-v2 + speaker.wav]
  XTTS --> SPK[Speaker]

Step 1 — Install the maintained fork

```bash python3.11 -m venv .venv && source .venv/bin/activate pip install -U coqui-tts torch torchaudio ```

Avoid the abandoned TTS package on PyPI — it pins old transformers and numpy versions that conflict with everything in 2026.

Step 2 — Verify the clone with a 30-second test

```python import torch from TTS.api import TTS device = "cuda" if torch.cuda.is_available() else "cpu" tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device) tts.tts_to_file( text="Hello, this is the cloned voice running fully locally.", speaker_wav="my_voice_6s.wav", language="en", file_path="clone_test.wav") ```

If clone_test.wav sounds recognisable as you (not a generic narrator), the clone took.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Cache the speaker embedding (critical for latency)

XTTS computes a speaker embedding on every call by default. Pre-compute and reuse it:

```python gpt_cond, speaker_emb = tts.synthesizer.tts_model.get_conditioning_latents( audio_path=["my_voice_6s.wav"]) ```

Now each subsequent call drops the conditioning step (~1.2 s saved per utterance).

Step 4 — Stream synthesis with inference_stream

```python import sounddevice as sd, numpy as np

def speak(text): chunks = tts.synthesizer.tts_model.inference_stream( text, "en", gpt_cond, speaker_emb, stream_chunk_size=20) # smaller = lower TTFB with sd.OutputStream(samplerate=24000, channels=1, dtype="float32") as out: for chunk in chunks: out.write(chunk.cpu().numpy().astype(np.float32)) ```

stream_chunk_size=20 gives ~250 ms time-to-first-audio on an RTX 4090.

Step 5 — STT + LLM glue

```python from faster_whisper import WhisperModel import ollama stt = WhisperModel("small.en", device="cuda", compute_type="float16") history = [{"role":"system","content":"You are a friendly, brief voice assistant."}]

def turn(audio_int16): audio = audio_int16.astype(np.float32) / 32768 segs, _ = stt.transcribe(audio, language="en", vad_filter=True) user = " ".join(s.text for s in segs).strip() if not user: return history.append({"role":"user","content":user}) r = ollama.chat(model="llama3.2:3b", messages=history, options={"num_predict":140}) history.append(r["message"]) speak(r["message"]["content"]) ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Mic loop with VAD

```python def record(threshold=0.012, max_s=8): frames, silent = [], 0 with sd.InputStream(samplerate=16000, channels=1, dtype="int16") as s: while silent < 9000 and len(frames) * 1600 < 16000 * max_s: chunk, _ = s.read(1600); frames.append(chunk) rms = np.sqrt(np.mean((chunk.astype(np.float32)/32768)**2)) silent = silent + 1600 if rms < threshold else 0 return np.concatenate(frames).flatten()

while True: turn(record()) ```

Common pitfalls

  • CPU is too slow. XTTS is 1x realtime on a 4090, ~0.25x on M2 Max — needs GPU for live agents.
  • License. XTTS-v2 weights are CPML (non-commercial). Use ElevenLabs or Voxtral TTS for commercial production.
  • Speaker embedding drift. Cache and reuse — recomputing per turn destroys latency.

How CallSphere does this in production

CallSphere's 37 agents across 6 verticals use commercial voice models (ElevenLabs, OpenAI) for production calls because XTTS's licence excludes commercial use. We use XTTS for internal demo personas and offline UX research only. Healthcare's 14-tool FastAPI :8084 stack uses OpenAI Realtime; OneRoof's 10 specialists use ElevenLabs over WebRTC. Pricing $149/$499/$1499 flat — 14-day trial · 22% affiliate · /pricing.

FAQ

Is XTTS-v2 commercially usable? No — Coqui Public Model License is non-commercial. For paid SaaS, switch to Voxtral or ElevenLabs.

How much reference audio do I need? 6 seconds works; 15+ is better.

Can it do emotion? Limited — it tracks the reference's tone. For real emotion control, use prompt-driven prosody.

Languages? 17 (EN/ES/FR/DE/IT/PT/PL/TR/RU/NL/CS/AR/ZH/JA/HU/KO/HI).

Streaming TTS? Yes — inference_stream since 0.22.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.