By Sagar Shankaran, Founder of CallSphere
XTTS-v2 clones a voice from 6 seconds of audio and speaks 17 languages. Here's how to wire it into a real voice agent with faster-whisper STT and a local LLM — no API keys.
Key takeaways
TL;DR — XTTS-v2 is the open voice-cloning model worth running. The original Coqui org wound down, but
coqui-tts(a community fork) is on 0.28 with prebuilt wheels for macOS and Windows. Six seconds of clean audio gives you a usable clone.
A voice agent that answers in your voice. Mic in → faster-whisper → Ollama → XTTS-v2 (cloning your reference clip) → speaker out. Useful for accessibility, language tutoring, and on-brand IVR demos.
pip install coqui-tts faster-whisper sounddevice numpy ollama.ollama pull llama3.2:3b).flowchart LR
MIC[Microphone] --> STT[faster-whisper]
STT --> LLM[Ollama llama3.2:3b]
LLM --> XTTS[XTTS-v2 + speaker.wav]
XTTS --> SPK[Speaker]
```bash python3.11 -m venv .venv && source .venv/bin/activate pip install -U coqui-tts torch torchaudio ```
Avoid the abandoned TTS package on PyPI — it pins old transformers and numpy versions that conflict with everything in 2026.
```python import torch from TTS.api import TTS device = "cuda" if torch.cuda.is_available() else "cpu" tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device) tts.tts_to_file( text="Hello, this is the cloned voice running fully locally.", speaker_wav="my_voice_6s.wav", language="en", file_path="clone_test.wav") ```
If clone_test.wav sounds recognisable as you (not a generic narrator), the clone took.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
XTTS computes a speaker embedding on every call by default. Pre-compute and reuse it:
```python gpt_cond, speaker_emb = tts.synthesizer.tts_model.get_conditioning_latents( audio_path=["my_voice_6s.wav"]) ```
Now each subsequent call drops the conditioning step (~1.2 s saved per utterance).
inference_stream```python import sounddevice as sd, numpy as np
def speak(text): chunks = tts.synthesizer.tts_model.inference_stream( text, "en", gpt_cond, speaker_emb, stream_chunk_size=20) # smaller = lower TTFB with sd.OutputStream(samplerate=24000, channels=1, dtype="float32") as out: for chunk in chunks: out.write(chunk.cpu().numpy().astype(np.float32)) ```
stream_chunk_size=20 gives ~250 ms time-to-first-audio on an RTX 4090.
```python from faster_whisper import WhisperModel import ollama stt = WhisperModel("small.en", device="cuda", compute_type="float16") history = [{"role":"system","content":"You are a friendly, brief voice assistant."}]
def turn(audio_int16): audio = audio_int16.astype(np.float32) / 32768 segs, _ = stt.transcribe(audio, language="en", vad_filter=True) user = " ".join(s.text for s in segs).strip() if not user: return history.append({"role":"user","content":user}) r = ollama.chat(model="llama3.2:3b", messages=history, options={"num_predict":140}) history.append(r["message"]) speak(r["message"]["content"]) ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```python def record(threshold=0.012, max_s=8): frames, silent = [], 0 with sd.InputStream(samplerate=16000, channels=1, dtype="int16") as s: while silent < 9000 and len(frames) * 1600 < 16000 * max_s: chunk, _ = s.read(1600); frames.append(chunk) rms = np.sqrt(np.mean((chunk.astype(np.float32)/32768)**2)) silent = silent + 1600 if rms < threshold else 0 return np.concatenate(frames).flatten()
while True: turn(record()) ```
CallSphere's 37 agents across 6 verticals use commercial voice models (ElevenLabs, OpenAI) for production calls because XTTS's licence excludes commercial use. We use XTTS for internal demo personas and offline UX research only. Healthcare's 14-tool FastAPI :8084 stack uses OpenAI Realtime; OneRoof's 10 specialists use ElevenLabs over WebRTC. Pricing $149/$499/$1499 flat — 14-day trial · 22% affiliate · /pricing.
Is XTTS-v2 commercially usable? No — Coqui Public Model License is non-commercial. For paid SaaS, switch to Voxtral or ElevenLabs.
How much reference audio do I need? 6 seconds works; 15+ is better.
Can it do emotion? Limited — it tracks the reference's tone. For real emotion control, use prompt-driven prosody.
Languages? 17 (EN/ES/FR/DE/IT/PT/PL/TR/RU/NL/CS/AR/ZH/JA/HU/KO/HI).
Streaming TTS? Yes — inference_stream since 0.22.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the Siri voice generator landscape: how AI voice cloning works, what is legal, and how CallSphere uses 57+ voices in production.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.
Replace expensive outbound SDR tooling with a self-hosted dialer that runs OpenAI Realtime agents at 100 concurrent calls. Full architecture and code.
HVAC companies miss 40–60% of inbound. Build a 4-agent dispatch (intake, scheduling, parts, emergency) that integrates with ServiceTitan in 600 lines.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI