By Sagar Shankaran, Founder of CallSphere
Pure-Python voice agent: openai-whisper for STT, Ollama serving Llama 3.3 70B for the LLM, edge-tts for TTS. Zero API keys, runs on a single workstation.
Key takeaways
TL;DR — If you want a voice agent today and don't care about the absolute lowest latency, OpenAI's reference
whisper+ Ollama (Llama 3.3 70B if you have the VRAM, 3.2 3B if you don't) +edge-tts(free Microsoft voices) is the most boring, most reliable build. Six imports, ~120 lines of code.
A console voice agent: hold the spacebar to talk, release to send. Whisper transcribes, Ollama replies, edge-tts speaks. Works on Windows, macOS, and Linux with one Python venv.
pip install openai-whisper sounddevice numpy keyboard ollama edge-tts pydub.ollama pull llama3.2:3b (or llama3.3:70b-instruct-q4_K_M if 48 GB VRAM).flowchart LR
KEY[Spacebar] --> REC[sounddevice]
REC --> W[openai-whisper base.en]
W -->|text| O[Ollama HTTP :11434]
O --> ETTS[edge-tts MS Aria]
ETTS --> SPK[Speaker]
```bash ollama run llama3.2:3b "Say hi in five words" edge-tts --voice en-US-AriaNeural --text "Hi" --write-media hi.mp3 && \ ffplay -nodisp -autoexit hi.mp3 python -c "import whisper; whisper.load_model('base.en')" ```
Three independent green lights = you're good.
```python import sounddevice as sd, numpy as np, keyboard SR = 16000 def push_to_talk(): print("Hold SPACE to speak..."); keyboard.wait("space") frames = [] with sd.InputStream(samplerate=SR, channels=1, dtype="float32") as s: while keyboard.is_pressed("space"): ck, _ = s.read(1600); frames.append(ck) print("Got", len(frames), "frames") return np.concatenate(frames).flatten() ```
PTT avoids VAD tuning entirely — perfect for desktop assistants.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
```python import whisper model = whisper.load_model("base.en") def transcribe(audio_f32): return model.transcribe(audio_f32, fp16=False, language="en")["text"].strip() ```
Use base.en for English-only; it's 4x faster than small for similar quality on short utterances.
```python import ollama SYSTEM = "You are a friendly, concise desktop voice assistant. Reply in 1-2 sentences."
def reply(history, text): history.append({"role":"user","content":text}) r = ollama.chat(model="llama3.2:3b", messages=[{"role":"system","content":SYSTEM}, *history], options={"temperature":0.4, "num_predict":160}) history.append(r["message"]) return r["message"]["content"] ```
```python import asyncio, edge_tts, io from pydub import AudioSegment, playback
async def speak_async(text, voice="en-US-AriaNeural"): comm = edge_tts.Communicate(text, voice) buf = io.BytesIO() async for chunk in comm.stream(): if chunk["type"] == "audio": buf.write(chunk["data"]) buf.seek(0) playback.play(AudioSegment.from_file(buf, format="mp3"))
def speak(text): asyncio.run(speak_async(text)) ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
edge-tts is unofficial but stable since 2022; it uses Microsoft's free Edge browser voices.
```python history = [] speak("Hi, I'm Aria. Hold the spacebar to talk.") while True: audio = push_to_talk() text = transcribe(audio) if not text: continue print("YOU:", text) out = reply(history, text) print("BOT:", out) speak(out) ```
whisper silently fails on macOS without brew install ffmpeg.ollama serve is auto-started on macOS but not all Linux distros.CallSphere's production path uses OpenAI Realtime + ElevenLabs for sub-500ms voice; this OSS stack is the right call for desktop assistants and offline demos. We run 37 specialists across 6 verticals — Healthcare's 14 HIPAA tools on FastAPI :8084, OneRoof's 10 specialists on WebRTC, plus Salon, Dental, F&B, Behavioral — backed by 90+ tools and 115+ Postgres tables. Flat $149/$499/$1499. 14-day trial · 22% affiliate · /pricing.
Llama 3.3 70B on consumer hardware? Q4_K_M fits in 48 GB; on 24 GB use Q3_K_S or Llama 3.1 8B.
Whisper accuracy? base.en ~7% WER on noisy speech; large-v3 ~3.5%.
Push-to-talk vs VAD? PTT for desktop, VAD for telephony.
Can I use OpenAI's hosted Whisper API? Yes — but then you're back on cloud egress.
Tools / function-calling? Llama 3.x in Ollama supports OpenAI-style tools since 0.4.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
The voice AI market hits $47.5B by 2034. For gyms and PT studios, voice agents now make economic sense for member intake, upsells, and reactivation campaigns.
With the voice AI market at $47.5B by 2034 and OpenAI's realtime release this week, every dealership and service shop should be evaluating voice agents. Here's how.
Spring 2026 AC season starts now. With the voice AI market at $47.5B by 2034, HVAC shops without after-hours voice agents will lose to those that have them.
OpenAI's GPT-Realtime-Whisper launches at $0.017/min for streaming STT. Side-by-side latency, accuracy, and cost math vs Deepgram and the field.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
OpenAI's GPT-Realtime-Translate handles 70 input languages live at $0.034/min. Here is what that means for multilingual restaurant takeout — and how CallSphere ships it.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI