---
title: "Build a Voice Agent with Coqui TTS XTTS-v2 (Voice Cloning, Local)"
description: "XTTS-v2 clones a voice from 6 seconds of audio and speaks 17 languages. Here's how to wire it into a real voice agent with faster-whisper STT and a local LLM — no API keys."
canonical: https://callsphere.ai/blog/vw4h-build-voice-agent-coqui-tts-xtts-v2
category: "AI Voice Agents"
tags: ["Coqui TTS", "XTTS-v2", "Voice Cloning", "Local AI", "Tutorial"]
author: "CallSphere Team"
published: 2026-03-24T00:00:00.000Z
updated: 2026-05-07T16:13:44.930Z
---

# Build a Voice Agent with Coqui TTS XTTS-v2 (Voice Cloning, Local)

> XTTS-v2 clones a voice from 6 seconds of audio and speaks 17 languages. Here's how to wire it into a real voice agent with faster-whisper STT and a local LLM — no API keys.

> **TL;DR** — XTTS-v2 is the open voice-cloning model worth running. The original Coqui org wound down, but `coqui-tts` (a community fork) is on 0.28 with prebuilt wheels for macOS and Windows. Six seconds of clean audio gives you a usable clone.

## What you'll build

A voice agent that answers in *your* voice. Mic in → faster-whisper → Ollama → XTTS-v2 (cloning your reference clip) → speaker out. Useful for accessibility, language tutoring, and on-brand IVR demos.

## Prerequisites

1. Python 3.11 (XTTS pinned wheels do not yet build cleanly on 3.13).
2. `pip install coqui-tts faster-whisper sounddevice numpy ollama`.
3. NVIDIA GPU with 6 GB+ VRAM strongly recommended (XTTS on CPU = 12x realtime).
4. A 6–15 second WAV of the voice you want to clone (clean, mono, 22050 Hz, no music).
5. Ollama running with a small model (`ollama pull llama3.2:3b`).

## Architecture

```mermaid
flowchart LR
  MIC[Microphone] --> STT[faster-whisper]
  STT --> LLM[Ollama llama3.2:3b]
  LLM --> XTTS[XTTS-v2 + speaker.wav]
  XTTS --> SPK[Speaker]
```

## Step 1 — Install the maintained fork

```bash
python3.11 -m venv .venv && source .venv/bin/activate
pip install -U coqui-tts torch torchaudio
```

Avoid the abandoned `TTS` package on PyPI — it pins old `transformers` and `numpy` versions that conflict with everything in 2026.

## Step 2 — Verify the clone with a 30-second test

```python
import torch
from TTS.api import TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts.tts_to_file(
    text="Hello, this is the cloned voice running fully locally.",
    speaker_wav="my_voice_6s.wav",
    language="en",
    file_path="clone_test.wav")
```

If `clone_test.wav` sounds recognisable as you (not a generic narrator), the clone took.

## Step 3 — Cache the speaker embedding (critical for latency)

XTTS computes a speaker embedding on every call by default. Pre-compute and reuse it:

```python
gpt_cond, speaker_emb = tts.synthesizer.tts_model.get_conditioning_latents(
    audio_path=["my_voice_6s.wav"])
```

Now each subsequent call drops the conditioning step (~1.2 s saved per utterance).

## Step 4 — Stream synthesis with `inference_stream`

```python
import sounddevice as sd, numpy as np

def speak(text):
    chunks = tts.synthesizer.tts_model.inference_stream(
        text, "en", gpt_cond, speaker_emb,
        stream_chunk_size=20)  # smaller = lower TTFB
    with sd.OutputStream(samplerate=24000, channels=1, dtype="float32") as out:
        for chunk in chunks:
            out.write(chunk.cpu().numpy().astype(np.float32))
```

`stream_chunk_size=20` gives ~250 ms time-to-first-audio on an RTX 4090.

## Step 5 — STT + LLM glue

```python
from faster_whisper import WhisperModel
import ollama
stt = WhisperModel("small.en", device="cuda", compute_type="float16")
history = [{"role":"system","content":"You are a friendly, brief voice assistant."}]

def turn(audio_int16):
    audio = audio_int16.astype(np.float32) / 32768
    segs, _ = stt.transcribe(audio, language="en", vad_filter=True)
    user = " ".join(s.text for s in segs).strip()
    if not user: return
    history.append({"role":"user","content":user})
    r = ollama.chat(model="llama3.2:3b", messages=history,
                    options={"num_predict":140})
    history.append(r["message"])
    speak(r["message"]["content"])
```

## Step 6 — Mic loop with VAD

```python
def record(threshold=0.012, max_s=8):
    frames, silent = [], 0
    with sd.InputStream(samplerate=16000, channels=1, dtype="int16") as s:
        while silent < 9000 and len(frames) * 1600 < 16000 * max_s:
            chunk, _ = s.read(1600); frames.append(chunk)
            rms = np.sqrt(np.mean((chunk.astype(np.float32)/32768)**2))
            silent = silent + 1600 if rms < threshold else 0
    return np.concatenate(frames).flatten()

while True: turn(record())
```

## Common pitfalls

- **CPU is too slow.** XTTS is 1x realtime on a 4090, ~0.25x on M2 Max — needs GPU for live agents.
- **License.** XTTS-v2 weights are CPML (non-commercial). Use ElevenLabs or Voxtral TTS for commercial production.
- **Speaker embedding drift.** Cache and reuse — recomputing per turn destroys latency.

## How CallSphere does this in production

CallSphere's 37 agents across 6 verticals use commercial voice models (ElevenLabs, OpenAI) for production calls because XTTS's licence excludes commercial use. We use XTTS for internal demo personas and offline UX research only. Healthcare's 14-tool FastAPI :8084 stack uses OpenAI Realtime; OneRoof's 10 specialists use ElevenLabs over WebRTC. Pricing $149/$499/$1499 flat — [14-day trial](/trial) · [22% affiliate](/affiliate) · [/pricing](/pricing).

## FAQ

**Is XTTS-v2 commercially usable?** No — Coqui Public Model License is non-commercial. For paid SaaS, switch to Voxtral or ElevenLabs.

**How much reference audio do I need?** 6 seconds works; 15+ is better.

**Can it do emotion?** Limited — it tracks the reference's tone. For real emotion control, use prompt-driven prosody.

**Languages?** 17 (EN/ES/FR/DE/IT/PT/PL/TR/RU/NL/CS/AR/ZH/JA/HU/KO/HI).

**Streaming TTS?** Yes — `inference_stream` since 0.22.

## Sources

- [coqui-ai/TTS on GitHub](https://github.com/coqui-ai/TTS)
- [coqui-tts on PyPI](https://pypi.org/project/coqui-tts/)
- [XTTS-v2 model card](https://huggingface.co/coqui/XTTS-v2)
- [coqui-tts docs](https://coqui-tts.readthedocs.io/)

---

Source: https://callsphere.ai/blog/vw4h-build-voice-agent-coqui-tts-xtts-v2
