Skip to content
AI Voice Agents
AI Voice Agents11 min read0 views

Build a Voice Agent with Piper TTS — Local, Free, and Fast (2026)

Piper 1.4.2 ships ONNX voices that synthesize on a Raspberry Pi 5 in real time. Here's a full Python voice agent with Piper, faster-whisper, and Ollama — no GPU required.

TL;DR — Piper is the fastest open neural TTS that still sounds human. Version 1.4.2 (April 2 2026) added CUDA inference via ONNX Runtime and a unified piper CLI that streams raw PCM. Pair it with faster-whisper for STT and Ollama for the LLM and you have a CPU-friendly voice agent.

What you'll build

A Python service that listens for an utterance, transcribes it with faster-whisper (small.en), gets a reply from a local llama3.1:8b via Ollama, and speaks it through Piper using en_US-amy-medium. End-to-end latency under 2 s on a Raspberry Pi 5 8GB.

Prerequisites

  1. Python 3.11+ with pip install piper-tts faster-whisper sounddevice numpy ollama.
  2. Ollama installed and running (ollama serve).
  3. Pull a model: ollama pull llama3.1:8b.
  4. Download a Piper voice: python -m piper.download_voices en_US-amy-medium.

Architecture

flowchart LR
  MIC[Microphone] --> FW[faster-whisper small.en]
  FW -->|text| OLL[Ollama HTTP :11434]
  OLL -->|text| PIPER[piper en_US-amy-medium]
  PIPER -->|PCM 22050| SPK[Speaker]

Step 1 — Verify Piper from the CLI

```bash echo "Hello from a fully local voice agent." | \ piper --model en_US-amy-medium --output-raw | \ aplay -r 22050 -f S16_LE -t raw - ```

If you hear audio, you're done with the hard part.

Step 2 — Wrap Piper as a streaming Python class

```python import subprocess, sounddevice as sd, numpy as np

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

class Piper: def init(self, model="en_US-amy-medium.onnx"): self.model = model self.sr = 22050 def speak(self, text: str): proc = subprocess.Popen( ["piper", "--model", self.model, "--output-raw"], stdin=subprocess.PIPE, stdout=subprocess.PIPE) proc.stdin.write(text.encode()); proc.stdin.close() with sd.OutputStream(samplerate=self.sr, channels=1, dtype="int16") as out: while True: chunk = proc.stdout.read(4096) if not chunk: break out.write(np.frombuffer(chunk, dtype=np.int16)) ```

The trick is --output-raw plus a streaming OutputStream: you start hearing audio while later phonemes are still being synthesized.

Step 3 — STT with faster-whisper

```python from faster_whisper import WhisperModel stt = WhisperModel("small.en", device="cpu", compute_type="int8")

def transcribe(pcm_int16, sr=16000): audio = pcm_int16.astype(np.float32) / 32768.0 segments, _ = stt.transcribe(audio, language="en", vad_filter=True) return " ".join(s.text.strip() for s in segments) ```

Step 4 — LLM via Ollama

```python import ollama SYSTEM = "You are Amy, a helpful voice assistant. Keep replies under 2 sentences."

def reply(history, user_text): history.append({"role": "user", "content": user_text}) r = ollama.chat(model="llama3.1:8b", messages=[{"role":"system","content":SYSTEM}, *history], options={"temperature": 0.4, "num_predict": 160}) msg = r["message"]["content"] history.append({"role":"assistant","content":msg}) return msg ```

Step 5 — Mic capture loop with simple VAD

```python def record(threshold=0.012, max_s=8): frames, silent = [], 0 with sd.InputStream(samplerate=16000, channels=1, dtype="int16") as s: while silent < 9000 and len(frames) * 1600 < 16000 * max_s: chunk, _ = s.read(1600); frames.append(chunk) rms = np.sqrt(np.mean((chunk.astype(np.float32)/32768)**2)) silent = silent + 1600 if rms < threshold else 0 return np.concatenate(frames).flatten() ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Tie it together

```python piper, history = Piper(), [] piper.speak("Hi, I'm Amy. How can I help?") while True: pcm = record() text = transcribe(pcm) if not text.strip(): continue print("USER:", text) out = reply(history, text) print("BOT :", out) piper.speak(out) ```

Common pitfalls

  • Wrong sample rate. Piper voices ship at 16/22.05/24 kHz — read config.json next to the .onnx.
  • Pi 5 thermal throttling. Add a fan or reduce to tiny.en Whisper.
  • piper-tts vs piper packages. Use pip install piper-tts (the actively maintained 2026 fork).

How CallSphere does this in production

CallSphere serves 37 specialist agents across 6 verticals (Healthcare 14 tools / OpenAI Realtime / FastAPI :8084, OneRoof Property 10 specialists, Salon, Dental, F&B, Behavioral) with cloud TTS for emotional warmth and a Pion-based WebRTC mesh. We use Piper internally for cost-sensitive workflows and offline kiosk demos. Pricing is flat $149 / $499 / $1499, 14-day trial, 22% affiliate. 115+ DB tables back 90+ tools. See it on /demo.

FAQ

How does Piper compare to ElevenLabs? ElevenLabs wins on emotion; Piper wins on cost and privacy.

Can Piper clone voices? Not directly — train a new voice with Piper's training pipeline (~3 hours of clean audio).

Does Piper run on Android? Yes, via piper-android ONNX Runtime build.

Latency target? Sub-300 ms first audio on a desktop CPU.

Real-time on a Pi Zero 2? Use x_low quality voices only.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.