---
title: "Build a Voice Agent with OpenAI Whisper + Llama 3.x via Ollama"
description: "Pure-Python voice agent: openai-whisper for STT, Ollama serving Llama 3.3 70B for the LLM, edge-tts for TTS. Zero API keys, runs on a single workstation."
canonical: https://callsphere.ai/blog/vw4h-build-voice-agent-whisper-llama-3-ollama
category: "AI Engineering"
tags: ["Whisper", "Ollama", "Llama 3", "Voice Agent", "Tutorial"]
author: "CallSphere Team"
published: 2026-04-03T00:00:00.000Z
updated: 2026-05-07T16:13:45.627Z
---

# Build a Voice Agent with OpenAI Whisper + Llama 3.x via Ollama

> Pure-Python voice agent: openai-whisper for STT, Ollama serving Llama 3.3 70B for the LLM, edge-tts for TTS. Zero API keys, runs on a single workstation.

> **TL;DR** — If you want a voice agent today and don't care about the absolute lowest latency, OpenAI's reference `whisper` + Ollama (Llama 3.3 70B if you have the VRAM, 3.2 3B if you don't) + `edge-tts` (free Microsoft voices) is the most boring, most reliable build. Six imports, ~120 lines of code.

## What you'll build

A console voice agent: hold the spacebar to talk, release to send. Whisper transcribes, Ollama replies, `edge-tts` speaks. Works on Windows, macOS, and Linux with one Python venv.

## Prerequisites

1. Python 3.11+, `pip install openai-whisper sounddevice numpy keyboard ollama edge-tts pydub`.
2. ffmpeg in PATH (Whisper needs it).
3. Ollama installed; `ollama pull llama3.2:3b` (or `llama3.3:70b-instruct-q4_K_M` if 48 GB VRAM).
4. Speakers and a microphone.

## Architecture

```mermaid
flowchart LR
  KEY[Spacebar] --> REC[sounddevice]
  REC --> W[openai-whisper base.en]
  W -->|text| O[Ollama HTTP :11434]
  O --> ETTS[edge-tts MS Aria]
  ETTS --> SPK[Speaker]
```

## Step 1 — Smoke-test the pieces

```bash
ollama run llama3.2:3b "Say hi in five words"
edge-tts --voice en-US-AriaNeural --text "Hi" --write-media hi.mp3 && \
  ffplay -nodisp -autoexit hi.mp3
python -c "import whisper; whisper.load_model('base.en')"
```

Three independent green lights = you're good.

## Step 2 — Push-to-talk recorder

```python
import sounddevice as sd, numpy as np, keyboard
SR = 16000
def push_to_talk():
    print("Hold SPACE to speak..."); keyboard.wait("space")
    frames = []
    with sd.InputStream(samplerate=SR, channels=1, dtype="float32") as s:
        while keyboard.is_pressed("space"):
            ck, _ = s.read(1600); frames.append(ck)
    print("Got", len(frames), "frames")
    return np.concatenate(frames).flatten()
```

PTT avoids VAD tuning entirely — perfect for desktop assistants.

## Step 3 — Whisper transcription

```python
import whisper
model = whisper.load_model("base.en")
def transcribe(audio_f32):
    return model.transcribe(audio_f32, fp16=False, language="en")["text"].strip()
```

Use `base.en` for English-only; it's 4x faster than `small` for similar quality on short utterances.

## Step 4 — Ollama chat

```python
import ollama
SYSTEM = "You are a friendly, concise desktop voice assistant. Reply in 1-2 sentences."

def reply(history, text):
    history.append({"role":"user","content":text})
    r = ollama.chat(model="llama3.2:3b",
        messages=[{"role":"system","content":SYSTEM}, *history],
        options={"temperature":0.4, "num_predict":160})
    history.append(r["message"])
    return r["message"]["content"]
```

## Step 5 — edge-tts streaming

```python
import asyncio, edge_tts, io
from pydub import AudioSegment, playback

async def speak_async(text, voice="en-US-AriaNeural"):
    comm = edge_tts.Communicate(text, voice)
    buf = io.BytesIO()
    async for chunk in comm.stream():
        if chunk["type"] == "audio": buf.write(chunk["data"])
    buf.seek(0)
    playback.play(AudioSegment.from_file(buf, format="mp3"))

def speak(text): asyncio.run(speak_async(text))
```

`edge-tts` is unofficial but stable since 2022; it uses Microsoft's free Edge browser voices.

## Step 6 — Glue + main loop

```python
history = []
speak("Hi, I'm Aria. Hold the spacebar to talk.")
while True:
    audio = push_to_talk()
    text = transcribe(audio)
    if not text: continue
    print("YOU:", text)
    out = reply(history, text)
    print("BOT:", out)
    speak(out)
```

## Common pitfalls

- **ffmpeg missing.** `whisper` silently fails on macOS without `brew install ffmpeg`.
- **Ollama not running.** `ollama serve` is auto-started on macOS but not all Linux distros.
- **edge-tts rate limit.** Don't loop synthesis without a backoff or Microsoft will throttle.

## How CallSphere does this in production

CallSphere's production path uses OpenAI Realtime + ElevenLabs for sub-500ms voice; this OSS stack is the right call for desktop assistants and offline demos. We run 37 specialists across 6 verticals — Healthcare's 14 HIPAA tools on FastAPI :8084, OneRoof's 10 specialists on WebRTC, plus Salon, Dental, F&B, Behavioral — backed by 90+ tools and 115+ Postgres tables. Flat $149/$499/$1499. [14-day trial](/trial) · [22% affiliate](/affiliate) · [/pricing](/pricing).

## FAQ

**Llama 3.3 70B on consumer hardware?** Q4_K_M fits in 48 GB; on 24 GB use Q3_K_S or Llama 3.1 8B.

**Whisper accuracy?** `base.en` ~7% WER on noisy speech; `large-v3` ~3.5%.

**Push-to-talk vs VAD?** PTT for desktop, VAD for telephony.

**Can I use OpenAI's hosted Whisper API?** Yes — but then you're back on cloud egress.

**Tools / function-calling?** Llama 3.x in Ollama supports OpenAI-style tools since 0.4.

## Sources

- [Ollama on GitHub](https://github.com/ollama/ollama)
- [DEV: Voice agent with Whisper + Ollama](https://dev.to/nayana_shaji_m/building-a-voice-controlled-local-ai-agent-using-whisper-and-ollama-3mca)
- [Real-time voice agent in Python](https://medium.com/@TechSnazAI/building-a-real-time-voice-agent-in-python-whisper-ollama-vad-streamlit-3a19c5e91b15)
- [Ollama tutorial 2026](https://tech-insider.org/ollama-tutorial-run-llm-locally-2026/)

---

Source: https://callsphere.ai/blog/vw4h-build-voice-agent-whisper-llama-3-ollama