Build a Voice Agent with Piper TTS — Local, Free, and Fast (2026)
Piper 1.4.2 ships ONNX voices that synthesize on a Raspberry Pi 5 in real time. Here's a full Python voice agent with Piper, faster-whisper, and Ollama — no GPU required.
TL;DR — Piper is the fastest open neural TTS that still sounds human. Version 1.4.2 (April 2 2026) added CUDA inference via ONNX Runtime and a unified
piperCLI that streams raw PCM. Pair it with faster-whisper for STT and Ollama for the LLM and you have a CPU-friendly voice agent.
What you'll build
A Python service that listens for an utterance, transcribes it with faster-whisper (small.en), gets a reply from a local llama3.1:8b via Ollama, and speaks it through Piper using en_US-amy-medium. End-to-end latency under 2 s on a Raspberry Pi 5 8GB.
Prerequisites
- Python 3.11+ with
pip install piper-tts faster-whisper sounddevice numpy ollama. - Ollama installed and running (
ollama serve). - Pull a model:
ollama pull llama3.1:8b. - Download a Piper voice:
python -m piper.download_voices en_US-amy-medium.
Architecture
flowchart LR
MIC[Microphone] --> FW[faster-whisper small.en]
FW -->|text| OLL[Ollama HTTP :11434]
OLL -->|text| PIPER[piper en_US-amy-medium]
PIPER -->|PCM 22050| SPK[Speaker]
Step 1 — Verify Piper from the CLI
```bash echo "Hello from a fully local voice agent." | \ piper --model en_US-amy-medium --output-raw | \ aplay -r 22050 -f S16_LE -t raw - ```
If you hear audio, you're done with the hard part.
Step 2 — Wrap Piper as a streaming Python class
```python import subprocess, sounddevice as sd, numpy as np
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
class Piper: def init(self, model="en_US-amy-medium.onnx"): self.model = model self.sr = 22050 def speak(self, text: str): proc = subprocess.Popen( ["piper", "--model", self.model, "--output-raw"], stdin=subprocess.PIPE, stdout=subprocess.PIPE) proc.stdin.write(text.encode()); proc.stdin.close() with sd.OutputStream(samplerate=self.sr, channels=1, dtype="int16") as out: while True: chunk = proc.stdout.read(4096) if not chunk: break out.write(np.frombuffer(chunk, dtype=np.int16)) ```
The trick is --output-raw plus a streaming OutputStream: you start hearing audio while later phonemes are still being synthesized.
Step 3 — STT with faster-whisper
```python from faster_whisper import WhisperModel stt = WhisperModel("small.en", device="cpu", compute_type="int8")
def transcribe(pcm_int16, sr=16000): audio = pcm_int16.astype(np.float32) / 32768.0 segments, _ = stt.transcribe(audio, language="en", vad_filter=True) return " ".join(s.text.strip() for s in segments) ```
Step 4 — LLM via Ollama
```python import ollama SYSTEM = "You are Amy, a helpful voice assistant. Keep replies under 2 sentences."
def reply(history, user_text): history.append({"role": "user", "content": user_text}) r = ollama.chat(model="llama3.1:8b", messages=[{"role":"system","content":SYSTEM}, *history], options={"temperature": 0.4, "num_predict": 160}) msg = r["message"]["content"] history.append({"role":"assistant","content":msg}) return msg ```
Step 5 — Mic capture loop with simple VAD
```python def record(threshold=0.012, max_s=8): frames, silent = [], 0 with sd.InputStream(samplerate=16000, channels=1, dtype="int16") as s: while silent < 9000 and len(frames) * 1600 < 16000 * max_s: chunk, _ = s.read(1600); frames.append(chunk) rms = np.sqrt(np.mean((chunk.astype(np.float32)/32768)**2)) silent = silent + 1600 if rms < threshold else 0 return np.concatenate(frames).flatten() ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Tie it together
```python piper, history = Piper(), [] piper.speak("Hi, I'm Amy. How can I help?") while True: pcm = record() text = transcribe(pcm) if not text.strip(): continue print("USER:", text) out = reply(history, text) print("BOT :", out) piper.speak(out) ```
Common pitfalls
- Wrong sample rate. Piper voices ship at 16/22.05/24 kHz — read
config.jsonnext to the .onnx. - Pi 5 thermal throttling. Add a fan or reduce to
tiny.enWhisper. piper-ttsvspiperpackages. Usepip install piper-tts(the actively maintained 2026 fork).
How CallSphere does this in production
CallSphere serves 37 specialist agents across 6 verticals (Healthcare 14 tools / OpenAI Realtime / FastAPI :8084, OneRoof Property 10 specialists, Salon, Dental, F&B, Behavioral) with cloud TTS for emotional warmth and a Pion-based WebRTC mesh. We use Piper internally for cost-sensitive workflows and offline kiosk demos. Pricing is flat $149 / $499 / $1499, 14-day trial, 22% affiliate. 115+ DB tables back 90+ tools. See it on /demo.
FAQ
How does Piper compare to ElevenLabs? ElevenLabs wins on emotion; Piper wins on cost and privacy.
Can Piper clone voices? Not directly — train a new voice with Piper's training pipeline (~3 hours of clean audio).
Does Piper run on Android? Yes, via piper-android ONNX Runtime build.
Latency target? Sub-300 ms first audio on a desktop CPU.
Real-time on a Pi Zero 2? Use x_low quality voices only.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.