---
title: "Build a Fully Local Voice Agent with whisper.cpp + llama.cpp (2026)"
description: "Zero-cloud voice agent on a single laptop: whisper.cpp for STT, llama.cpp for the LLM, and Piper for TTS. No telemetry, no API keys, no per-minute bill — full working code."
canonical: https://callsphere.ai/blog/vw4h-build-fully-local-voice-agent-whisper-cpp-llama-cpp
category: "AI Engineering"
tags: ["whisper.cpp", "llama.cpp", "Local AI", "Voice Agent", "Tutorial"]
author: "CallSphere Team"
published: 2026-03-15T00:00:00.000Z
updated: 2026-05-07T16:13:44.370Z
---

# Build a Fully Local Voice Agent with whisper.cpp + llama.cpp (2026)

> Zero-cloud voice agent on a single laptop: whisper.cpp for STT, llama.cpp for the LLM, and Piper for TTS. No telemetry, no API keys, no per-minute bill — full working code.

> **TL;DR** — whisper.cpp and llama.cpp are pure-C++ runtimes that run Whisper and Llama-family models on CPU/Metal/CUDA with no Python in the hot path. Glue them with a small Python loop and Piper, and you get a working voice agent on a 2024 MacBook Air with zero outbound traffic.

## What you'll build

A single Python process that captures microphone audio in 1-second windows, transcribes with `whisper.cpp` (`base.en` quantized), routes the text to a 7B Llama model served by `llama-server` over its OpenAI-compatible `/v1/chat/completions` endpoint, and speaks the reply through Piper. Total RAM: ~6 GB. Total network calls: 0.

## Prerequisites

1. macOS (Apple Silicon) or Linux with 8 GB+ RAM.
2. `cmake`, `make`, and a recent `gcc`/`clang`.
3. Python 3.11 with `sounddevice`, `numpy`, `requests`, `piper-tts`.
4. ~6 GB free disk for models (Whisper base.en + Llama 3.1 8B Q4_K_M + Piper en_US-amy-medium).

## Architecture

```mermaid
flowchart LR
  MIC[Microphone] -->|PCM 16kHz| WCPP[whisper.cpp main]
  WCPP -->|text| LOOP[Python loop]
  LOOP -->|HTTP /v1/chat/completions| LSRV[llama-server :8080]
  LSRV -->|text| LOOP
  LOOP -->|text| PIPER[piper-tts]
  PIPER -->|PCM| SPK[Speaker]
```

## Step 1 — Build whisper.cpp and llama.cpp

```bash
git clone [https://github.com/ggml-org/whisper.cpp](https://github.com/ggml-org/whisper.cpp) && cd whisper.cpp
cmake -B build -DGGML_METAL=1 && cmake --build build -j
bash ./models/download-ggml-model.sh base.en
cd .. && git clone [https://github.com/ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) && cd llama.cpp
cmake -B build -DGGML_METAL=1 && cmake --build build -j
```

On Linux + NVIDIA, swap `-DGGML_METAL=1` for `-DGGML_CUDA=1`. The binaries you care about are `build/bin/whisper-cli` and `build/bin/llama-server`.

## Step 2 — Start llama-server with an OpenAI-compatible API

```bash

# Download a Q4_K_M GGUF of Llama 3.1 8B Instruct from HF first

./build/bin/llama-server \
  -m models/llama-3.1-8b-instruct-q4_k_m.gguf \
  --host 127.0.0.1 --port 8080 \
  --ctx-size 8192 --n-gpu-layers 99 \
  --chat-template llama3
```

`llama-server` exposes `/v1/chat/completions`, `/v1/completions` and `/v1/embeddings` on the same port — drop-in for the OpenAI Python SDK.

## Step 3 — Capture audio and run whisper.cpp on each utterance

```python
import sounddevice as sd, numpy as np, subprocess, tempfile, wave, requests
SAMPLE_RATE = 16000

def record_until_silence(threshold=0.01, max_seconds=8):
    frames, silent = [], 0
    with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16") as s:
        while silent < int(SAMPLE_RATE * 0.6) and len(frames) < SAMPLE_RATE * max_seconds:
            chunk, _ = s.read(1600)
            frames.append(chunk)
            rms = np.sqrt(np.mean((chunk.astype(np.float32) / 32768) ** 2))
            silent = silent + 1600 if rms < threshold else 0
    return np.concatenate(frames)

def transcribe(pcm):
    f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name
    with wave.open(f, "wb") as w:
        w.setnchannels(1); w.setsampwidth(2); w.setframerate(SAMPLE_RATE)
        w.writeframes(pcm.tobytes())
    out = subprocess.check_output([
        "whisper.cpp/build/bin/whisper-cli", "-m",
        "whisper.cpp/models/ggml-base.en.bin", "-f", f, "-nt", "-otxt"], text=True)
    return out.strip()
```

## Step 4 — Talk to llama-server and stream the reply through Piper

```python
import json
SYSTEM = "You are a concise local voice assistant. Reply in 1-2 sentences."

def chat(history, user):
    history.append({"role": "user", "content": user})
    r = requests.post("[http://127.0.0.1:8080/v1/chat/completions](http://127.0.0.1:8080/v1/chat/completions)",
        json={"model": "local", "messages": [{"role":"system","content":SYSTEM}, *history],
              "temperature": 0.4, "max_tokens": 160}).json()
    reply = r["choices"][0]["message"]["content"]
    history.append({"role": "assistant", "content": reply})
    return reply

def speak(text):
    p = subprocess.Popen(["piper", "--model", "en_US-amy-medium.onnx",
                          "--output-raw"], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
    raw, _ = p.communicate(text.encode())
    sd.play(np.frombuffer(raw, dtype=np.int16), 22050); sd.wait()
```

## Step 5 — Glue it into a loop

```python
history = []
while True:
    pcm = record_until_silence()
    text = transcribe(pcm)
    if not text or len(text) < 3: continue
    print("USER:", text)
    reply = chat(history, text)
    print("BOT :", reply)
    speak(reply)
```

## Step 6 — Quantize aggressively for older laptops

If you're on an Intel Mac or 8 GB box, swap to `Q4_0` or `IQ3_M` Llama quants and use `tiny.en` for Whisper. End-to-end latency on M1 base = 1.4 s; on a 2018 Intel i7 = 4.8 s.

## Common pitfalls

- **Sample-rate mismatch.** Whisper expects 16 kHz mono; Piper outputs 22.05 kHz. Don't share the audio device handle.
- **`llama-server` chat template.** Llama 3 needs `--chat-template llama3`; without it, you'll see role-leakage.
- **Metal OOM.** `--n-gpu-layers 99` offloads everything; reduce to 24 if your unified memory pressure spikes.

## How CallSphere does this in production

CallSphere runs 37 specialist agents across 6 verticals. Healthcare uses 14 HIPAA-aligned tools on a FastAPI service at port 8084 backed by OpenAI Realtime; OneRoof Property routes 10 specialists over WebRTC; Salon, Dental, F&B and Behavioral round out the suite. Pricing is flat $149 / $499 / $1499 with a [14-day trial](/trial), a [22% affiliate program](/affiliate), and 115+ Postgres tables behind 90+ tools. Local stacks like the one above are great for prototyping — but voice quality, barge-in and SOC 2 logging is what separates a demo from production. See it live on [/demo](/demo).

## FAQ

**Why not just use the OpenAI Realtime API?** You will when you go to production. Local is for privacy-sensitive prototyping, on-prem POCs, and offline kiosks.

**Can I add tools / function calling?** Yes — Llama 3.1 + llama-server supports OpenAI-style `tools` and `tool_choice` since b3982.

**What's the cheapest TTS that doesn't sound robotic?** Piper `en_US-amy-medium` for English; `en_GB-alan-medium` is also surprisingly good.

**Can this scale?** Single-user yes; multi-tenant no — `llama-server` serializes generations.

**Is it really HIPAA-able?** Closer than cloud, but you still need a BAA-grade audit trail. CallSphere's [Healthcare](/industries/real-estate) stack handles that for you.

## Sources

- [whisper.cpp on GitHub](https://github.com/ggml-org/whisper.cpp)
- [llama.cpp server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
- [Piper TTS](https://github.com/rhasspy/piper)
- [llama.cpp 2026 guide](https://weavai.app/blog/en/2026/04/24/llama-cpp-2026-guide-local-ai-inference-setup/)

---

Source: https://callsphere.ai/blog/vw4h-build-fully-local-voice-agent-whisper-cpp-llama-cpp
