By Sagar Shankaran, Founder of CallSphere
Zero-cloud voice agent on a single laptop: whisper.cpp for STT, llama.cpp for the LLM, and Piper for TTS. No telemetry, no API keys, no per-minute bill — full working code.
Key takeaways
TL;DR — whisper.cpp and llama.cpp are pure-C++ runtimes that run Whisper and Llama-family models on CPU/Metal/CUDA with no Python in the hot path. Glue them with a small Python loop and Piper, and you get a working voice agent on a 2024 MacBook Air with zero outbound traffic.
A single Python process that captures microphone audio in 1-second windows, transcribes with whisper.cpp (base.en quantized), routes the text to a 7B Llama model served by llama-server over its OpenAI-compatible /v1/chat/completions endpoint, and speaks the reply through Piper. Total RAM: ~6 GB. Total network calls: 0.
cmake, make, and a recent gcc/clang.sounddevice, numpy, requests, piper-tts.flowchart LR
MIC[Microphone] -->|PCM 16kHz| WCPP[whisper.cpp main]
WCPP -->|text| LOOP[Python loop]
LOOP -->|HTTP /v1/chat/completions| LSRV[llama-server :8080]
LSRV -->|text| LOOP
LOOP -->|text| PIPER[piper-tts]
PIPER -->|PCM| SPK[Speaker]
```bash git clone https://github.com/ggml-org/whisper.cpp && cd whisper.cpp cmake -B build -DGGML_METAL=1 && cmake --build build -j bash ./models/download-ggml-model.sh base.en cd .. && git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp cmake -B build -DGGML_METAL=1 && cmake --build build -j ```
On Linux + NVIDIA, swap -DGGML_METAL=1 for -DGGML_CUDA=1. The binaries you care about are build/bin/whisper-cli and build/bin/llama-server.
```bash
./build/bin/llama-server \ -m models/llama-3.1-8b-instruct-q4_k_m.gguf \ --host 127.0.0.1 --port 8080 \ --ctx-size 8192 --n-gpu-layers 99 \ --chat-template llama3 ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
llama-server exposes /v1/chat/completions, /v1/completions and /v1/embeddings on the same port — drop-in for the OpenAI Python SDK.
```python import sounddevice as sd, numpy as np, subprocess, tempfile, wave, requests SAMPLE_RATE = 16000
def record_until_silence(threshold=0.01, max_seconds=8): frames, silent = [], 0 with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16") as s: while silent < int(SAMPLE_RATE * 0.6) and len(frames) < SAMPLE_RATE * max_seconds: chunk, _ = s.read(1600) frames.append(chunk) rms = np.sqrt(np.mean((chunk.astype(np.float32) / 32768) ** 2)) silent = silent + 1600 if rms < threshold else 0 return np.concatenate(frames)
def transcribe(pcm): f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name with wave.open(f, "wb") as w: w.setnchannels(1); w.setsampwidth(2); w.setframerate(SAMPLE_RATE) w.writeframes(pcm.tobytes()) out = subprocess.check_output([ "whisper.cpp/build/bin/whisper-cli", "-m", "whisper.cpp/models/ggml-base.en.bin", "-f", f, "-nt", "-otxt"], text=True) return out.strip() ```
```python import json SYSTEM = "You are a concise local voice assistant. Reply in 1-2 sentences."
def chat(history, user): history.append({"role": "user", "content": user}) r = requests.post("http://127.0.0.1:8080/v1/chat/completions", json={"model": "local", "messages": [{"role":"system","content":SYSTEM}, *history], "temperature": 0.4, "max_tokens": 160}).json() reply = r["choices"][0]["message"]["content"] history.append({"role": "assistant", "content": reply}) return reply
def speak(text): p = subprocess.Popen(["piper", "--model", "en_US-amy-medium.onnx", "--output-raw"], stdin=subprocess.PIPE, stdout=subprocess.PIPE) raw, _ = p.communicate(text.encode()) sd.play(np.frombuffer(raw, dtype=np.int16), 22050); sd.wait() ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```python history = [] while True: pcm = record_until_silence() text = transcribe(pcm) if not text or len(text) < 3: continue print("USER:", text) reply = chat(history, text) print("BOT :", reply) speak(reply) ```
If you're on an Intel Mac or 8 GB box, swap to Q4_0 or IQ3_M Llama quants and use tiny.en for Whisper. End-to-end latency on M1 base = 1.4 s; on a 2018 Intel i7 = 4.8 s.
llama-server chat template. Llama 3 needs --chat-template llama3; without it, you'll see role-leakage.--n-gpu-layers 99 offloads everything; reduce to 24 if your unified memory pressure spikes.CallSphere runs 37 specialist agents across 6 verticals. Healthcare uses 14 HIPAA-aligned tools on a FastAPI service at port 8084 backed by OpenAI Realtime; OneRoof Property routes 10 specialists over WebRTC; Salon, Dental, F&B and Behavioral round out the suite. Pricing is flat $149 / $499 / $1499 with a 14-day trial, a 22% affiliate program, and 115+ Postgres tables behind 90+ tools. Local stacks like the one above are great for prototyping — but voice quality, barge-in and SOC 2 logging is what separates a demo from production. See it live on /demo.
Why not just use the OpenAI Realtime API? You will when you go to production. Local is for privacy-sensitive prototyping, on-prem POCs, and offline kiosks.
Can I add tools / function calling? Yes — Llama 3.1 + llama-server supports OpenAI-style tools and tool_choice since b3982.
What's the cheapest TTS that doesn't sound robotic? Piper en_US-amy-medium for English; en_GB-alan-medium is also surprisingly good.
Can this scale? Single-user yes; multi-tenant no — llama-server serializes generations.
Is it really HIPAA-able? Closer than cloud, but you still need a BAA-grade audit trail. CallSphere's Healthcare stack handles that for you.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
The voice AI market hits $47.5B by 2034. For gyms and PT studios, voice agents now make economic sense for member intake, upsells, and reactivation campaigns.
With the voice AI market at $47.5B by 2034 and OpenAI's realtime release this week, every dealership and service shop should be evaluating voice agents. Here's how.
Spring 2026 AC season starts now. With the voice AI market at $47.5B by 2034, HVAC shops without after-hours voice agents will lose to those that have them.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
OpenAI's GPT-Realtime-Translate handles 70 input languages live at $0.034/min. Here is what that means for multilingual restaurant takeout — and how CallSphere ships it.
OpenAI's GPT-Realtime-Translate hits 70 languages at $0.034/min. For dental practices in diverse metros, this changes who picks up the phone — and who books the appointment.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI