Build a Voice Agent with WhisperX + RAG via Haystack (2026)
WhisperX gives you word-level timestamps and speaker diarization. Haystack lets you wire a RAG pipeline behind it. Together: a meeting voice agent that knows who said what.
TL;DR — WhisperX = Whisper + word-level timestamps + Pyannote 3.1 speaker diarization. Haystack 2.x is the OSS pipeline framework that does retrieval cleanly. Combine them and your voice agent can answer "what did the customer agree to in the May 7 call?" with citations down to the second.
What you'll build
A FastAPI service that ingests call recordings, transcribes with WhisperX (diarized), indexes turn-level chunks into a Haystack InMemoryDocumentStore, then exposes a voice-driven RAG endpoint. Spoken question → faster-whisper STT → Haystack retrieve → Ollama generate → Piper TTS.
Prerequisites
- Python 3.11,
pip install whisperx haystack-ai ollama-haystack faster-whisper piper-tts fastapi uvicorn. - NVIDIA GPU with 8 GB+ VRAM for WhisperX.
- Hugging Face token + accept the Pyannote 3.1 license.
- Ollama running with
llama3.1:8b.
Architecture
flowchart LR
REC[Call WAV] --> WX[WhisperX + Pyannote]
WX -->|turns w/ speaker| HS[Haystack Pipeline]
HS --> DS[(InMemoryDocStore)]
Q[Voice Question] --> FW[faster-whisper]
FW --> HS2[Haystack RAG]
HS2 --> LLM[Ollama llama3.1:8b]
LLM --> P[Piper TTS]
Step 1 — Transcribe and diarize a call
```python import whisperx, torch device = "cuda" model = whisperx.load_model("large-v3", device, compute_type="float16") def transcribe_diarize(wav_path, hf_token): audio = whisperx.load_audio(wav_path) result = model.transcribe(audio, batch_size=16, language="en") align_model, meta = whisperx.load_align_model(language_code="en", device=device) result = whisperx.align(result["segments"], align_model, meta, audio, device, return_char_alignments=False) diarize = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device) diar = diarize(audio) return whisperx.assign_word_speakers(diar, result) ```
segments now contains {start, end, text, speaker: "SPEAKER_01"}.
Step 2 — Index turns into Haystack
```python from haystack import Document from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.components.embedders import SentenceTransformersDocumentEmbedder from haystack import Pipeline
store = InMemoryDocumentStore(embedding_similarity_function="cosine")
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
def index_segments(segments, call_id): docs = [Document(content=s["text"], meta={"call_id":call_id, "speaker":s["speaker"], "start":s["start"], "end":s["end"]}) for s in segments if s.get("text")] p = Pipeline() p.add_component("emb", SentenceTransformersDocumentEmbedder( model="sentence-transformers/all-MiniLM-L6-v2")) p.add_component("write", store.writer()) p.connect("emb.documents", "write.documents") p.run({"emb":{"documents":docs}}) ```
Step 3 — Build the retrieval-augmented chat pipeline
```python from haystack.components.embedders import SentenceTransformersTextEmbedder from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever from haystack.components.builders import PromptBuilder from haystack_integrations.components.generators.ollama import OllamaGenerator
PROMPT = """Answer using only the call excerpts below. Cite as [{{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s]. Context: {% for doc in documents %}- {{ doc.content }} ({{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s) {% endfor %} Question: {{ question }} Answer:"""
rag = Pipeline() rag.add_component("qemb", SentenceTransformersTextEmbedder( model="sentence-transformers/all-MiniLM-L6-v2")) rag.add_component("retr", InMemoryEmbeddingRetriever(document_store=store, top_k=5)) rag.add_component("prompt", PromptBuilder(template=PROMPT)) rag.add_component("llm", OllamaGenerator(model="llama3.1:8b", url="http://127.0.0.1:11434")) rag.connect("qemb.embedding", "retr.query_embedding") rag.connect("retr.documents", "prompt.documents") rag.connect("prompt.prompt", "llm.prompt") ```
Step 4 — Voice-driven query
```python from faster_whisper import WhisperModel stt = WhisperModel("small.en", device="cuda", compute_type="float16")
def voice_query(audio_f32, history): segs, _ = stt.transcribe(audio_f32, language="en", vad_filter=True) q = " ".join(s.text for s in segs).strip() out = rag.run({"qemb":{"text":q},"retr":{"query":q},"prompt":{"question":q}}) return q, out["llm"]["replies"][0] ```
Step 5 — FastAPI ingest + ask endpoints
```python from fastapi import FastAPI, UploadFile, File import uuid, soundfile as sf, io, numpy as np app = FastAPI() HF = "hf_xxx"
@app.post("/ingest") async def ingest(f: UploadFile = File(...)): p = f"/tmp/{uuid.uuid4()}.wav" open(p,"wb").write(await f.read()) segs = transcribe_diarize(p, HF) index_segments(segs["segments"], call_id=f.filename) return {"ok":True,"turns":len(segs["segments"])}
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
@app.post("/ask") async def ask(f: UploadFile = File(...)): pcm, sr = sf.read(io.BytesIO(await f.read()), dtype="float32") if sr != 16000: pcm = _resample(pcm, sr, 16000) q, a = voice_query(pcm, []) return {"question":q, "answer":a} ```
Step 6 — Speak the answer
```python import subprocess def speak(text): subprocess.run(["bash","-c", f'echo "{text}" | piper --model en_US-amy-medium --output_file /tmp/o.wav && aplay /tmp/o.wav']) ```
Common pitfalls
- Pyannote license. Skipping the HF accept step throws a cryptic 401 on first run.
- Speaker labels reset per call. Don't assume
SPEAKER_00is the same person across calls; cluster embeddings yourself. - MiniLM dimension mismatch. If you change embedders, re-index everything.
How CallSphere does this in production
CallSphere ingests every call into a HIPAA-aware transcription + diarization pipeline; downstream, our 37 agents across 6 verticals (Healthcare 14 tools / FastAPI :8084 / OpenAI Realtime, OneRoof 10 specialists / WebRTC, plus Salon, Dental, F&B, Behavioral) use call history for context. 90+ tools and 115+ Postgres tables tie everything together. Flat $149/$499/$1499 · 14-day trial · 22% affiliate · /demo.
FAQ
WhisperX speed? ~70x realtime on a 4090 with batched inference.
Speaker count? WhisperX auto-detects up to ~10 cleanly.
Can I use Postgres + pgvector instead of InMemory? Yes — swap InMemoryDocumentStore for PgvectorDocumentStore.
Real-time diarization? Pyannote real-time module is in beta — viable for short windows.
Best Haystack version? 2.7+ (released February 2026) for stable Ollama integration.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.