---
title: "Build a Voice Agent with WhisperX + RAG via Haystack (2026)"
description: "WhisperX gives you word-level timestamps and speaker diarization. Haystack lets you wire a RAG pipeline behind it. Together: a meeting voice agent that knows who said what."
canonical: https://callsphere.ai/blog/vw4h-build-voice-agent-whisperx-haystack-rag
category: "AI Engineering"
tags: ["WhisperX", "Haystack", "RAG", "Diarization", "Tutorial"]
author: "CallSphere Team"
published: 2026-04-06T00:00:00.000Z
updated: 2026-05-07T16:13:45.860Z
---

# Build a Voice Agent with WhisperX + RAG via Haystack (2026)

> WhisperX gives you word-level timestamps and speaker diarization. Haystack lets you wire a RAG pipeline behind it. Together: a meeting voice agent that knows who said what.

> **TL;DR** — WhisperX = Whisper + word-level timestamps + Pyannote 3.1 speaker diarization. Haystack 2.x is the OSS pipeline framework that does retrieval cleanly. Combine them and your voice agent can answer "what did the customer agree to in the May 7 call?" with citations down to the second.

## What you'll build

A FastAPI service that ingests call recordings, transcribes with WhisperX (diarized), indexes turn-level chunks into a Haystack `InMemoryDocumentStore`, then exposes a voice-driven RAG endpoint. Spoken question → faster-whisper STT → Haystack retrieve → Ollama generate → Piper TTS.

## Prerequisites

1. Python 3.11, `pip install whisperx haystack-ai ollama-haystack faster-whisper piper-tts fastapi uvicorn`.
2. NVIDIA GPU with 8 GB+ VRAM for WhisperX.
3. Hugging Face token + accept the Pyannote 3.1 license.
4. Ollama running with `llama3.1:8b`.

## Architecture

```mermaid
flowchart LR
  REC[Call WAV] --> WX[WhisperX + Pyannote]
  WX -->|turns w/ speaker| HS[Haystack Pipeline]
  HS --> DS[(InMemoryDocStore)]
  Q[Voice Question] --> FW[faster-whisper]
  FW --> HS2[Haystack RAG]
  HS2 --> LLM[Ollama llama3.1:8b]
  LLM --> P[Piper TTS]
```

## Step 1 — Transcribe and diarize a call

```python
import whisperx, torch
device = "cuda"
model = whisperx.load_model("large-v3", device, compute_type="float16")
def transcribe_diarize(wav_path, hf_token):
    audio = whisperx.load_audio(wav_path)
    result = model.transcribe(audio, batch_size=16, language="en")
    align_model, meta = whisperx.load_align_model(language_code="en", device=device)
    result = whisperx.align(result["segments"], align_model, meta, audio, device,
                            return_char_alignments=False)
    diarize = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device)
    diar = diarize(audio)
    return whisperx.assign_word_speakers(diar, result)
```

`segments` now contains `{start, end, text, speaker: "SPEAKER_01"}`.

## Step 2 — Index turns into Haystack

```python
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Pipeline

store = InMemoryDocumentStore(embedding_similarity_function="cosine")

def index_segments(segments, call_id):
    docs = [Document(content=s["text"],
              meta={"call_id":call_id, "speaker":s["speaker"],
                    "start":s["start"], "end":s["end"]})
            for s in segments if s.get("text")]
    p = Pipeline()
    p.add_component("emb", SentenceTransformersDocumentEmbedder(
        model="sentence-transformers/all-MiniLM-L6-v2"))
    p.add_component("write", store.writer())
    p.connect("emb.documents", "write.documents")
    p.run({"emb":{"documents":docs}})
```

## Step 3 — Build the retrieval-augmented chat pipeline

```python
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.generators.ollama import OllamaGenerator

PROMPT = """Answer using only the call excerpts below. Cite as [{{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s].
Context: {% for doc in documents %}- {{ doc.content }} ({{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s)
{% endfor %}
Question: {{ question }}
Answer:"""

rag = Pipeline()
rag.add_component("qemb", SentenceTransformersTextEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"))
rag.add_component("retr", InMemoryEmbeddingRetriever(document_store=store, top_k=5))
rag.add_component("prompt", PromptBuilder(template=PROMPT))
rag.add_component("llm", OllamaGenerator(model="llama3.1:8b", url="[http://127.0.0.1:11434](http://127.0.0.1:11434)"))
rag.connect("qemb.embedding", "retr.query_embedding")
rag.connect("retr.documents", "prompt.documents")
rag.connect("prompt.prompt", "llm.prompt")
```

## Step 4 — Voice-driven query

```python
from faster_whisper import WhisperModel
stt = WhisperModel("small.en", device="cuda", compute_type="float16")

def voice_query(audio_f32, history):
    segs, _ = stt.transcribe(audio_f32, language="en", vad_filter=True)
    q = " ".join(s.text for s in segs).strip()
    out = rag.run({"qemb":{"text":q},"retr":{"query":q},"prompt":{"question":q}})
    return q, out["llm"]["replies"][0]
```

## Step 5 — FastAPI ingest + ask endpoints

```python
from fastapi import FastAPI, UploadFile, File
import uuid, soundfile as sf, io, numpy as np
app = FastAPI()
HF = "hf_xxx"

@app.post("/ingest")
async def ingest(f: UploadFile = File(...)):
    p = f"/tmp/{uuid.uuid4()}.wav"
    open(p,"wb").write(await f.read())
    segs = transcribe_diarize(p, HF)
    index_segments(segs["segments"], call_id=f.filename)
    return {"ok":True,"turns":len(segs["segments"])}

@app.post("/ask")
async def ask(f: UploadFile = File(...)):
    pcm, sr = sf.read(io.BytesIO(await f.read()), dtype="float32")
    if sr != 16000: pcm = _resample(pcm, sr, 16000)
    q, a = voice_query(pcm, [])
    return {"question":q, "answer":a}
```

## Step 6 — Speak the answer

```python
import subprocess
def speak(text):
    subprocess.run(["bash","-c", f'echo "{text}" | piper --model en_US-amy-medium --output_file /tmp/o.wav && aplay /tmp/o.wav'])
```

## Common pitfalls

- **Pyannote license.** Skipping the HF accept step throws a cryptic 401 on first run.
- **Speaker labels reset per call.** Don't assume `SPEAKER_00` is the same person across calls; cluster embeddings yourself.
- **MiniLM dimension mismatch.** If you change embedders, re-index everything.

## How CallSphere does this in production

CallSphere ingests every call into a HIPAA-aware transcription + diarization pipeline; downstream, our 37 agents across 6 verticals (Healthcare 14 tools / FastAPI :8084 / OpenAI Realtime, OneRoof 10 specialists / WebRTC, plus Salon, Dental, F&B, Behavioral) use call history for context. 90+ tools and 115+ Postgres tables tie everything together. Flat $149/$499/$1499 · [14-day trial](/trial) · [22% affiliate](/affiliate) · [/demo](/demo).

## FAQ

**WhisperX speed?** ~70x realtime on a 4090 with batched inference.

**Speaker count?** WhisperX auto-detects up to ~10 cleanly.

**Can I use Postgres + pgvector instead of InMemory?** Yes — swap `InMemoryDocumentStore` for `PgvectorDocumentStore`.

**Real-time diarization?** Pyannote real-time module is in beta — viable for short windows.

**Best Haystack version?** 2.7+ (released February 2026) for stable Ollama integration.

## Sources

- [WhisperX on GitHub](https://github.com/m-bain/whisperx)
- [Haystack speaker diarization blog](https://haystack.deepset.ai/blog/level-up-rag-with-speaker-diarization)
- [Ollama Haystack integration](https://haystack.deepset.ai/integrations/ollama)
- [WhisperX 2026 transcription guide](https://johal.in/whisperx-transcription-diarization-and-alignment-for-audio-processing-2026/)

---

Source: https://callsphere.ai/blog/vw4h-build-voice-agent-whisperx-haystack-rag