By Sagar Shankaran, Founder of CallSphere
WhisperX gives you word-level timestamps and speaker diarization. Haystack lets you wire a RAG pipeline behind it. Together: a meeting voice agent that knows who said what.
Key takeaways
TL;DR — WhisperX = Whisper + word-level timestamps + Pyannote 3.1 speaker diarization. Haystack 2.x is the OSS pipeline framework that does retrieval cleanly. Combine them and your voice agent can answer "what did the customer agree to in the May 7 call?" with citations down to the second.
A FastAPI service that ingests call recordings, transcribes with WhisperX (diarized), indexes turn-level chunks into a Haystack InMemoryDocumentStore, then exposes a voice-driven RAG endpoint. Spoken question → faster-whisper STT → Haystack retrieve → Ollama generate → Piper TTS.
pip install whisperx haystack-ai ollama-haystack faster-whisper piper-tts fastapi uvicorn.llama3.1:8b.flowchart LR
REC[Call WAV] --> WX[WhisperX + Pyannote]
WX -->|turns w/ speaker| HS[Haystack Pipeline]
HS --> DS[(InMemoryDocStore)]
Q[Voice Question] --> FW[faster-whisper]
FW --> HS2[Haystack RAG]
HS2 --> LLM[Ollama llama3.1:8b]
LLM --> P[Piper TTS]
```python import whisperx, torch device = "cuda" model = whisperx.load_model("large-v3", device, compute_type="float16") def transcribe_diarize(wav_path, hf_token): audio = whisperx.load_audio(wav_path) result = model.transcribe(audio, batch_size=16, language="en") align_model, meta = whisperx.load_align_model(language_code="en", device=device) result = whisperx.align(result["segments"], align_model, meta, audio, device, return_char_alignments=False) diarize = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device) diar = diarize(audio) return whisperx.assign_word_speakers(diar, result) ```
segments now contains {start, end, text, speaker: "SPEAKER_01"}.
```python from haystack import Document from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.components.embedders import SentenceTransformersDocumentEmbedder from haystack import Pipeline
store = InMemoryDocumentStore(embedding_similarity_function="cosine")
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
def index_segments(segments, call_id): docs = [Document(content=s["text"], meta={"call_id":call_id, "speaker":s["speaker"], "start":s["start"], "end":s["end"]}) for s in segments if s.get("text")] p = Pipeline() p.add_component("emb", SentenceTransformersDocumentEmbedder( model="sentence-transformers/all-MiniLM-L6-v2")) p.add_component("write", store.writer()) p.connect("emb.documents", "write.documents") p.run({"emb":{"documents":docs}}) ```
```python from haystack.components.embedders import SentenceTransformersTextEmbedder from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever from haystack.components.builders import PromptBuilder from haystack_integrations.components.generators.ollama import OllamaGenerator
PROMPT = """Answer using only the call excerpts below. Cite as [{{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s]. Context: {% for doc in documents %}- {{ doc.content }} ({{ doc.meta.speaker }} @ {{ doc.meta.start | round(1) }}s) {% endfor %} Question: {{ question }} Answer:"""
rag = Pipeline() rag.add_component("qemb", SentenceTransformersTextEmbedder( model="sentence-transformers/all-MiniLM-L6-v2")) rag.add_component("retr", InMemoryEmbeddingRetriever(document_store=store, top_k=5)) rag.add_component("prompt", PromptBuilder(template=PROMPT)) rag.add_component("llm", OllamaGenerator(model="llama3.1:8b", url="http://127.0.0.1:11434")) rag.connect("qemb.embedding", "retr.query_embedding") rag.connect("retr.documents", "prompt.documents") rag.connect("prompt.prompt", "llm.prompt") ```
```python from faster_whisper import WhisperModel stt = WhisperModel("small.en", device="cuda", compute_type="float16")
def voice_query(audio_f32, history): segs, _ = stt.transcribe(audio_f32, language="en", vad_filter=True) q = " ".join(s.text for s in segs).strip() out = rag.run({"qemb":{"text":q},"retr":{"query":q},"prompt":{"question":q}}) return q, out["llm"]["replies"][0] ```
```python from fastapi import FastAPI, UploadFile, File import uuid, soundfile as sf, io, numpy as np app = FastAPI() HF = "hf_xxx"
@app.post("/ingest") async def ingest(f: UploadFile = File(...)): p = f"/tmp/{uuid.uuid4()}.wav" open(p,"wb").write(await f.read()) segs = transcribe_diarize(p, HF) index_segments(segs["segments"], call_id=f.filename) return {"ok":True,"turns":len(segs["segments"])}
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
@app.post("/ask") async def ask(f: UploadFile = File(...)): pcm, sr = sf.read(io.BytesIO(await f.read()), dtype="float32") if sr != 16000: pcm = _resample(pcm, sr, 16000) q, a = voice_query(pcm, []) return {"question":q, "answer":a} ```
```python import subprocess def speak(text): subprocess.run(["bash","-c", f'echo "{text}" | piper --model en_US-amy-medium --output_file /tmp/o.wav && aplay /tmp/o.wav']) ```
SPEAKER_00 is the same person across calls; cluster embeddings yourself.CallSphere ingests every call into a HIPAA-aware transcription + diarization pipeline; downstream, our 37 agents across 6 verticals (Healthcare 14 tools / FastAPI :8084 / OpenAI Realtime, OneRoof 10 specialists / WebRTC, plus Salon, Dental, F&B, Behavioral) use call history for context. 90+ tools and 115+ Postgres tables tie everything together. Flat $149/$499/$1499 · 14-day trial · 22% affiliate · /demo.
WhisperX speed? ~70x realtime on a 4090 with batched inference.
Speaker count? WhisperX auto-detects up to ~10 cleanly.
Can I use Postgres + pgvector instead of InMemory? Yes — swap InMemoryDocumentStore for PgvectorDocumentStore.
Real-time diarization? Pyannote real-time module is in beta — viable for short windows.
Best Haystack version? 2.7+ (released February 2026) for stable Ollama integration.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to building a chatbot for answering questions on your website: RAG, voice, and how CallSphere ships one in 3-5 days.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
A founder's guide on how to create a chatbot in 2026. Build options, AI stack, integration patterns, and when buying a managed agent wins over building.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI