Skip to content
AI Engineering
AI Engineering11 min read0 views

Build a Voice Agent with OpenAI Whisper + Llama 3.x via Ollama

Pure-Python voice agent: openai-whisper for STT, Ollama serving Llama 3.3 70B for the LLM, edge-tts for TTS. Zero API keys, runs on a single workstation.

TL;DR — If you want a voice agent today and don't care about the absolute lowest latency, OpenAI's reference whisper + Ollama (Llama 3.3 70B if you have the VRAM, 3.2 3B if you don't) + edge-tts (free Microsoft voices) is the most boring, most reliable build. Six imports, ~120 lines of code.

What you'll build

A console voice agent: hold the spacebar to talk, release to send. Whisper transcribes, Ollama replies, edge-tts speaks. Works on Windows, macOS, and Linux with one Python venv.

Prerequisites

  1. Python 3.11+, pip install openai-whisper sounddevice numpy keyboard ollama edge-tts pydub.
  2. ffmpeg in PATH (Whisper needs it).
  3. Ollama installed; ollama pull llama3.2:3b (or llama3.3:70b-instruct-q4_K_M if 48 GB VRAM).
  4. Speakers and a microphone.

Architecture

flowchart LR
  KEY[Spacebar] --> REC[sounddevice]
  REC --> W[openai-whisper base.en]
  W -->|text| O[Ollama HTTP :11434]
  O --> ETTS[edge-tts MS Aria]
  ETTS --> SPK[Speaker]

Step 1 — Smoke-test the pieces

```bash ollama run llama3.2:3b "Say hi in five words" edge-tts --voice en-US-AriaNeural --text "Hi" --write-media hi.mp3 && \ ffplay -nodisp -autoexit hi.mp3 python -c "import whisper; whisper.load_model('base.en')" ```

Three independent green lights = you're good.

Step 2 — Push-to-talk recorder

```python import sounddevice as sd, numpy as np, keyboard SR = 16000 def push_to_talk(): print("Hold SPACE to speak..."); keyboard.wait("space") frames = [] with sd.InputStream(samplerate=SR, channels=1, dtype="float32") as s: while keyboard.is_pressed("space"): ck, _ = s.read(1600); frames.append(ck) print("Got", len(frames), "frames") return np.concatenate(frames).flatten() ```

PTT avoids VAD tuning entirely — perfect for desktop assistants.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Whisper transcription

```python import whisper model = whisper.load_model("base.en") def transcribe(audio_f32): return model.transcribe(audio_f32, fp16=False, language="en")["text"].strip() ```

Use base.en for English-only; it's 4x faster than small for similar quality on short utterances.

Step 4 — Ollama chat

```python import ollama SYSTEM = "You are a friendly, concise desktop voice assistant. Reply in 1-2 sentences."

def reply(history, text): history.append({"role":"user","content":text}) r = ollama.chat(model="llama3.2:3b", messages=[{"role":"system","content":SYSTEM}, *history], options={"temperature":0.4, "num_predict":160}) history.append(r["message"]) return r["message"]["content"] ```

Step 5 — edge-tts streaming

```python import asyncio, edge_tts, io from pydub import AudioSegment, playback

async def speak_async(text, voice="en-US-AriaNeural"): comm = edge_tts.Communicate(text, voice) buf = io.BytesIO() async for chunk in comm.stream(): if chunk["type"] == "audio": buf.write(chunk["data"]) buf.seek(0) playback.play(AudioSegment.from_file(buf, format="mp3"))

def speak(text): asyncio.run(speak_async(text)) ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

edge-tts is unofficial but stable since 2022; it uses Microsoft's free Edge browser voices.

Step 6 — Glue + main loop

```python history = [] speak("Hi, I'm Aria. Hold the spacebar to talk.") while True: audio = push_to_talk() text = transcribe(audio) if not text: continue print("YOU:", text) out = reply(history, text) print("BOT:", out) speak(out) ```

Common pitfalls

  • ffmpeg missing. whisper silently fails on macOS without brew install ffmpeg.
  • Ollama not running. ollama serve is auto-started on macOS but not all Linux distros.
  • edge-tts rate limit. Don't loop synthesis without a backoff or Microsoft will throttle.

How CallSphere does this in production

CallSphere's production path uses OpenAI Realtime + ElevenLabs for sub-500ms voice; this OSS stack is the right call for desktop assistants and offline demos. We run 37 specialists across 6 verticals — Healthcare's 14 HIPAA tools on FastAPI :8084, OneRoof's 10 specialists on WebRTC, plus Salon, Dental, F&B, Behavioral — backed by 90+ tools and 115+ Postgres tables. Flat $149/$499/$1499. 14-day trial · 22% affiliate · /pricing.

FAQ

Llama 3.3 70B on consumer hardware? Q4_K_M fits in 48 GB; on 24 GB use Q3_K_S or Llama 3.1 8B.

Whisper accuracy? base.en ~7% WER on noisy speech; large-v3 ~3.5%.

Push-to-talk vs VAD? PTT for desktop, VAD for telephony.

Can I use OpenAI's hosted Whisper API? Yes — but then you're back on cloud egress.

Tools / function-calling? Llama 3.x in Ollama supports OpenAI-style tools since 0.4.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.