Skip to content
AI Engineering
AI Engineering13 min read0 views

Build a Fully Local Voice Agent with whisper.cpp + llama.cpp (2026)

Zero-cloud voice agent on a single laptop: whisper.cpp for STT, llama.cpp for the LLM, and Piper for TTS. No telemetry, no API keys, no per-minute bill — full working code.

TL;DR — whisper.cpp and llama.cpp are pure-C++ runtimes that run Whisper and Llama-family models on CPU/Metal/CUDA with no Python in the hot path. Glue them with a small Python loop and Piper, and you get a working voice agent on a 2024 MacBook Air with zero outbound traffic.

What you'll build

A single Python process that captures microphone audio in 1-second windows, transcribes with whisper.cpp (base.en quantized), routes the text to a 7B Llama model served by llama-server over its OpenAI-compatible /v1/chat/completions endpoint, and speaks the reply through Piper. Total RAM: ~6 GB. Total network calls: 0.

Prerequisites

  1. macOS (Apple Silicon) or Linux with 8 GB+ RAM.
  2. cmake, make, and a recent gcc/clang.
  3. Python 3.11 with sounddevice, numpy, requests, piper-tts.
  4. ~6 GB free disk for models (Whisper base.en + Llama 3.1 8B Q4_K_M + Piper en_US-amy-medium).

Architecture

flowchart LR
  MIC[Microphone] -->|PCM 16kHz| WCPP[whisper.cpp main]
  WCPP -->|text| LOOP[Python loop]
  LOOP -->|HTTP /v1/chat/completions| LSRV[llama-server :8080]
  LSRV -->|text| LOOP
  LOOP -->|text| PIPER[piper-tts]
  PIPER -->|PCM| SPK[Speaker]

Step 1 — Build whisper.cpp and llama.cpp

```bash git clone https://github.com/ggml-org/whisper.cpp && cd whisper.cpp cmake -B build -DGGML_METAL=1 && cmake --build build -j bash ./models/download-ggml-model.sh base.en cd .. && git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp cmake -B build -DGGML_METAL=1 && cmake --build build -j ```

On Linux + NVIDIA, swap -DGGML_METAL=1 for -DGGML_CUDA=1. The binaries you care about are build/bin/whisper-cli and build/bin/llama-server.

Step 2 — Start llama-server with an OpenAI-compatible API

```bash

Download a Q4_K_M GGUF of Llama 3.1 8B Instruct from HF first

./build/bin/llama-server \ -m models/llama-3.1-8b-instruct-q4_k_m.gguf \ --host 127.0.0.1 --port 8080 \ --ctx-size 8192 --n-gpu-layers 99 \ --chat-template llama3 ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

llama-server exposes /v1/chat/completions, /v1/completions and /v1/embeddings on the same port — drop-in for the OpenAI Python SDK.

Step 3 — Capture audio and run whisper.cpp on each utterance

```python import sounddevice as sd, numpy as np, subprocess, tempfile, wave, requests SAMPLE_RATE = 16000

def record_until_silence(threshold=0.01, max_seconds=8): frames, silent = [], 0 with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16") as s: while silent < int(SAMPLE_RATE * 0.6) and len(frames) < SAMPLE_RATE * max_seconds: chunk, _ = s.read(1600) frames.append(chunk) rms = np.sqrt(np.mean((chunk.astype(np.float32) / 32768) ** 2)) silent = silent + 1600 if rms < threshold else 0 return np.concatenate(frames)

def transcribe(pcm): f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name with wave.open(f, "wb") as w: w.setnchannels(1); w.setsampwidth(2); w.setframerate(SAMPLE_RATE) w.writeframes(pcm.tobytes()) out = subprocess.check_output([ "whisper.cpp/build/bin/whisper-cli", "-m", "whisper.cpp/models/ggml-base.en.bin", "-f", f, "-nt", "-otxt"], text=True) return out.strip() ```

Step 4 — Talk to llama-server and stream the reply through Piper

```python import json SYSTEM = "You are a concise local voice assistant. Reply in 1-2 sentences."

def chat(history, user): history.append({"role": "user", "content": user}) r = requests.post("http://127.0.0.1:8080/v1/chat/completions", json={"model": "local", "messages": [{"role":"system","content":SYSTEM}, *history], "temperature": 0.4, "max_tokens": 160}).json() reply = r["choices"][0]["message"]["content"] history.append({"role": "assistant", "content": reply}) return reply

def speak(text): p = subprocess.Popen(["piper", "--model", "en_US-amy-medium.onnx", "--output-raw"], stdin=subprocess.PIPE, stdout=subprocess.PIPE) raw, _ = p.communicate(text.encode()) sd.play(np.frombuffer(raw, dtype=np.int16), 22050); sd.wait() ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 5 — Glue it into a loop

```python history = [] while True: pcm = record_until_silence() text = transcribe(pcm) if not text or len(text) < 3: continue print("USER:", text) reply = chat(history, text) print("BOT :", reply) speak(reply) ```

Step 6 — Quantize aggressively for older laptops

If you're on an Intel Mac or 8 GB box, swap to Q4_0 or IQ3_M Llama quants and use tiny.en for Whisper. End-to-end latency on M1 base = 1.4 s; on a 2018 Intel i7 = 4.8 s.

Common pitfalls

  • Sample-rate mismatch. Whisper expects 16 kHz mono; Piper outputs 22.05 kHz. Don't share the audio device handle.
  • llama-server chat template. Llama 3 needs --chat-template llama3; without it, you'll see role-leakage.
  • Metal OOM. --n-gpu-layers 99 offloads everything; reduce to 24 if your unified memory pressure spikes.

How CallSphere does this in production

CallSphere runs 37 specialist agents across 6 verticals. Healthcare uses 14 HIPAA-aligned tools on a FastAPI service at port 8084 backed by OpenAI Realtime; OneRoof Property routes 10 specialists over WebRTC; Salon, Dental, F&B and Behavioral round out the suite. Pricing is flat $149 / $499 / $1499 with a 14-day trial, a 22% affiliate program, and 115+ Postgres tables behind 90+ tools. Local stacks like the one above are great for prototyping — but voice quality, barge-in and SOC 2 logging is what separates a demo from production. See it live on /demo.

FAQ

Why not just use the OpenAI Realtime API? You will when you go to production. Local is for privacy-sensitive prototyping, on-prem POCs, and offline kiosks.

Can I add tools / function calling? Yes — Llama 3.1 + llama-server supports OpenAI-style tools and tool_choice since b3982.

What's the cheapest TTS that doesn't sound robotic? Piper en_US-amy-medium for English; en_GB-alan-medium is also surprisingly good.

Can this scale? Single-user yes; multi-tenant no — llama-server serializes generations.

Is it really HIPAA-able? Closer than cloud, but you still need a BAA-grade audit trail. CallSphere's Healthcare stack handles that for you.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.