Skip to content
AI Engineering
AI Engineering13 min read0 views

Build a Voice Agent with WhisperX + RAG via Haystack (2026)

WhisperX gives you word-level timestamps and speaker diarization. Haystack lets you wire a RAG pipeline behind it. Together: a meeting voice agent that knows who said what.

TL;DR — WhisperX = Whisper + word-level timestamps + Pyannote 3.1 speaker diarization. Haystack 2.x is the OSS pipeline framework that does retrieval cleanly. Combine them and your voice agent can answer "what did the customer agree to in the May 7 call?" with citations down to the second.

What you'll build

A FastAPI service that ingests call recordings, transcribes with WhisperX (diarized), indexes turn-level chunks into a Haystack InMemoryDocumentStore, then exposes a voice-driven RAG endpoint. Spoken question → faster-whisper STT → Haystack retrieve → Ollama generate → Piper TTS.

Prerequisites

  1. Python 3.11, pip install whisperx haystack-ai ollama-haystack faster-whisper piper-tts fastapi uvicorn.
  2. NVIDIA GPU with 8 GB+ VRAM for WhisperX.
  3. Hugging Face token + accept the Pyannote 3.1 license.
  4. Ollama running with llama3.1:8b.

Architecture

flowchart LR
  REC[Call WAV] --> WX[WhisperX + Pyannote]
  WX -->|turns w/ speaker| HS[Haystack Pipeline]
  HS --> DS[(InMemoryDocStore)]
  Q[Voice Question] --> FW[faster-whisper]
  FW --> HS2[Haystack RAG]
  HS2 --> LLM[Ollama llama3.1:8b]
  LLM --> P[Piper TTS]

Step 1 — Transcribe and diarize a call

```python import whisperx, torch device = "cuda" model = whisperx.load_model("large-v3", device, compute_type="float16") def transcribe_diarize(wav_path, hf_token): audio = whisperx.load_audio(wav_path) result = model.transcribe(audio, batch_size=16, language="en") align_model, meta = whisperx.load_align_model(language_code="en", device=device) result = whisperx.align(result["segments"], align_model, meta, audio, device, return_char_alignments=False) diarize = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device) diar = diarize(audio) return whisperx.assign_word_speakers(diar, result) ```

segments now contains {start, end, text, speaker: "SPEAKER_01"}.

Step 2 — Index turns into Haystack

```python from haystack import Document from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.components.embedders import SentenceTransformersDocumentEmbedder from haystack import Pipeline

store = InMemoryDocumentStore(embedding_similarity_function="cosine")

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

def index_segments(segments, call_id): docs = [Document(content=s["text"], meta={"call_id":call_id, "speaker":s["speaker"], "start":s["start"], "end":s["end"]}) for s in segments if s.get("text")] p = Pipeline() p.add_component("emb", SentenceTransformersDocumentEmbedder( model="sentence-transformers/all-MiniLM-L6-v2")) p.add_component("write", store.writer()) p.connect("emb.documents", "write.documents") p.run({"emb":{"documents":docs}}) ```

Step 3 — Build the retrieval-augmented chat pipeline

```python from haystack.components.embedders import SentenceTransformersTextEmbedder from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever from haystack.components.builders import PromptBuilder from haystack_integrations.components.generators.ollama import OllamaGenerator

PROMPT = """Answer using only the call excerpts below. Cite as [{{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s]. Context: {% for doc in documents %}- {{ doc.content }} ({{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s) {% endfor %} Question: {{ question }} Answer:"""

rag = Pipeline() rag.add_component("qemb", SentenceTransformersTextEmbedder( model="sentence-transformers/all-MiniLM-L6-v2")) rag.add_component("retr", InMemoryEmbeddingRetriever(document_store=store, top_k=5)) rag.add_component("prompt", PromptBuilder(template=PROMPT)) rag.add_component("llm", OllamaGenerator(model="llama3.1:8b", url="http://127.0.0.1:11434")) rag.connect("qemb.embedding", "retr.query_embedding") rag.connect("retr.documents", "prompt.documents") rag.connect("prompt.prompt", "llm.prompt") ```

Step 4 — Voice-driven query

```python from faster_whisper import WhisperModel stt = WhisperModel("small.en", device="cuda", compute_type="float16")

def voice_query(audio_f32, history): segs, _ = stt.transcribe(audio_f32, language="en", vad_filter=True) q = " ".join(s.text for s in segs).strip() out = rag.run({"qemb":{"text":q},"retr":{"query":q},"prompt":{"question":q}}) return q, out["llm"]["replies"][0] ```

Step 5 — FastAPI ingest + ask endpoints

```python from fastapi import FastAPI, UploadFile, File import uuid, soundfile as sf, io, numpy as np app = FastAPI() HF = "hf_xxx"

@app.post("/ingest") async def ingest(f: UploadFile = File(...)): p = f"/tmp/{uuid.uuid4()}.wav" open(p,"wb").write(await f.read()) segs = transcribe_diarize(p, HF) index_segments(segs["segments"], call_id=f.filename) return {"ok":True,"turns":len(segs["segments"])}

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

@app.post("/ask") async def ask(f: UploadFile = File(...)): pcm, sr = sf.read(io.BytesIO(await f.read()), dtype="float32") if sr != 16000: pcm = _resample(pcm, sr, 16000) q, a = voice_query(pcm, []) return {"question":q, "answer":a} ```

Step 6 — Speak the answer

```python import subprocess def speak(text): subprocess.run(["bash","-c", f'echo "{text}" | piper --model en_US-amy-medium --output_file /tmp/o.wav && aplay /tmp/o.wav']) ```

Common pitfalls

  • Pyannote license. Skipping the HF accept step throws a cryptic 401 on first run.
  • Speaker labels reset per call. Don't assume SPEAKER_00 is the same person across calls; cluster embeddings yourself.
  • MiniLM dimension mismatch. If you change embedders, re-index everything.

How CallSphere does this in production

CallSphere ingests every call into a HIPAA-aware transcription + diarization pipeline; downstream, our 37 agents across 6 verticals (Healthcare 14 tools / FastAPI :8084 / OpenAI Realtime, OneRoof 10 specialists / WebRTC, plus Salon, Dental, F&B, Behavioral) use call history for context. 90+ tools and 115+ Postgres tables tie everything together. Flat $149/$499/$1499 · 14-day trial · 22% affiliate · /demo.

FAQ

WhisperX speed? ~70x realtime on a 4090 with batched inference.

Speaker count? WhisperX auto-detects up to ~10 cleanly.

Can I use Postgres + pgvector instead of InMemory? Yes — swap InMemoryDocumentStore for PgvectorDocumentStore.

Real-time diarization? Pyannote real-time module is in beta — viable for short windows.

Best Haystack version? 2.7+ (released February 2026) for stable Ollama integration.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.