TL;DR — WhisperX = Whisper + word-level timestamps + Pyannote 3.1 speaker diarization. Haystack 2.x is the OSS pipeline framework that does retrieval cleanly. Combine them and your voice agent can answer "what did the customer agree to in the May 7 call?" with citations down to the second.

What you'll build

A FastAPI service that ingests call recordings, transcribes with WhisperX (diarized), indexes turn-level chunks into a Haystack InMemoryDocumentStore, then exposes a voice-driven RAG endpoint. Spoken question → faster-whisper STT → Haystack retrieve → Ollama generate → Piper TTS.

Prerequisites

Python 3.11, pip install whisperx haystack-ai ollama-haystack faster-whisper piper-tts fastapi uvicorn.
NVIDIA GPU with 8 GB+ VRAM for WhisperX.
Hugging Face token + accept the Pyannote 3.1 license.
Ollama running with llama3.1:8b.

Architecture

flowchart LR
  REC[Call WAV] --> WX[WhisperX + Pyannote]
  WX -->|turns w/ speaker| HS[Haystack Pipeline]
  HS --> DS[(InMemoryDocStore)]
  Q[Voice Question] --> FW[faster-whisper]
  FW --> HS2[Haystack RAG]
  HS2 --> LLM[Ollama llama3.1:8b]
  LLM --> P[Piper TTS]

Step 1 — Transcribe and diarize a call

```python import whisperx, torch device = "cuda" model = whisperx.load_model("large-v3", device, compute_type="float16") def transcribe_diarize(wav_path, hf_token): audio = whisperx.load_audio(wav_path) result = model.transcribe(audio, batch_size=16, language="en") align_model, meta = whisperx.load_align_model(language_code="en", device=device) result = whisperx.align(result["segments"], align_model, meta, audio, device, return_char_alignments=False) diarize = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device) diar = diarize(audio) return whisperx.assign_word_speakers(diar, result) ```

segments now contains {start, end, text, speaker: "SPEAKER_01"}.

Step 2 — Index turns into Haystack

```python from haystack import Document from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.components.embedders import SentenceTransformersDocumentEmbedder from haystack import Pipeline

store = InMemoryDocumentStore(embedding_similarity_function="cosine")

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

def index_segments(segments, call_id): docs = [Document(content=s["text"], meta={"call_id":call_id, "speaker":s["speaker"], "start":s["start"], "end":s["end"]}) for s in segments if s.get("text")] p = Pipeline() p.add_component("emb", SentenceTransformersDocumentEmbedder( model="sentence-transformers/all-MiniLM-L6-v2")) p.add_component("write", store.writer()) p.connect("emb.documents", "write.documents") p.run({"emb":{"documents":docs}}) ```

Step 3 — Build the retrieval-augmented chat pipeline

```python from haystack.components.embedders import SentenceTransformersTextEmbedder from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever from haystack.components.builders import PromptBuilder from haystack_integrations.components.generators.ollama import OllamaGenerator

PROMPT = """Answer using only the call excerpts below. Cite as [{{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s]. Context: {% for doc in documents %}- {{ doc.content }} ({{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s) {% endfor %} Question: {{ question }} Answer:"""

rag = Pipeline() rag.add_component("qemb", SentenceTransformersTextEmbedder( model="sentence-transformers/all-MiniLM-L6-v2")) rag.add_component("retr", InMemoryEmbeddingRetriever(document_store=store, top_k=5)) rag.add_component("prompt", PromptBuilder(template=PROMPT)) rag.add_component("llm", OllamaGenerator(model="llama3.1:8b", url="http://127.0.0.1:11434")) rag.connect("qemb.embedding", "retr.query_embedding") rag.connect("retr.documents", "prompt.documents") rag.connect("prompt.prompt", "llm.prompt") ```

Step 4 — Voice-driven query

```python from faster_whisper import WhisperModel stt = WhisperModel("small.en", device="cuda", compute_type="float16")

def voice_query(audio_f32, history): segs, _ = stt.transcribe(audio_f32, language="en", vad_filter=True) q = " ".join(s.text for s in segs).strip() out = rag.run({"qemb":{"text":q},"retr":{"query":q},"prompt":{"question":q}}) return q, out["llm"]["replies"][0] ```

Step 5 — FastAPI ingest + ask endpoints

```python from fastapi import FastAPI, UploadFile, File import uuid, soundfile as sf, io, numpy as np app = FastAPI() HF = "hf_xxx"

@app.post("/ingest") async def ingest(f: UploadFile = File(...)): p = f"/tmp/{uuid.uuid4()}.wav" open(p,"wb").write(await f.read()) segs = transcribe_diarize(p, HF) index_segments(segs["segments"], call_id=f.filename) return {"ok":True,"turns":len(segs["segments"])}

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

@app.post("/ask") async def ask(f: UploadFile = File(...)): pcm, sr = sf.read(io.BytesIO(await f.read()), dtype="float32") if sr != 16000: pcm = _resample(pcm, sr, 16000) q, a = voice_query(pcm, []) return {"question":q, "answer":a} ```

Step 6 — Speak the answer

```python import subprocess def speak(text): subprocess.run(["bash","-c", f'echo "{text}" | piper --model en_US-amy-medium --output_file /tmp/o.wav && aplay /tmp/o.wav']) ```

Common pitfalls

Pyannote license. Skipping the HF accept step throws a cryptic 401 on first run.
Speaker labels reset per call. Don't assume SPEAKER_00 is the same person across calls; cluster embeddings yourself.
MiniLM dimension mismatch. If you change embedders, re-index everything.

How CallSphere does this in production

CallSphere ingests every call into a HIPAA-aware transcription + diarization pipeline; downstream, our 37 agents across 6 verticals (Healthcare 14 tools / FastAPI :8084 / OpenAI Realtime, OneRoof 10 specialists / WebRTC, plus Salon, Dental, F&B, Behavioral) use call history for context. 90+ tools and 115+ Postgres tables tie everything together. Flat $149/$499/$1499 · 14-day trial · 22% affiliate · /demo.

FAQ

WhisperX speed? ~70x realtime on a 4090 with batched inference.

Speaker count? WhisperX auto-detects up to ~10 cleanly.

Can I use Postgres + pgvector instead of InMemory? Yes — swap InMemoryDocumentStore for PgvectorDocumentStore.

Real-time diarization? Pyannote real-time module is in beta — viable for short windows.

Best Haystack version? 2.7+ (released February 2026) for stable Ollama integration.

Build a Voice Agent with WhisperX + RAG via Haystack (2026)

What you'll build

Prerequisites

Architecture

Step 1 — Transcribe and diarize a call

Step 2 — Index turns into Haystack

Step 3 — Build the retrieval-augmented chat pipeline

Step 4 — Voice-driven query

Step 5 — FastAPI ingest + ask endpoints

Step 6 — Speak the answer

Common pitfalls

How CallSphere does this in production

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

Build a CallSphere-Style Outbound Voice Campaign Tool