TL;DR — A voice agent without RAG hallucinates pricing, hours, and policy. Add a single retrieve tool backed by ChromaDB and text-embedding-3-large, and accuracy on factual questions jumps from ~70% to >95%.

What you'll build

A Python voice agent that answers questions strictly from your indexed knowledge base. Caller asks "what's your refund policy?" — the agent calls the retrieve tool, fetches top-3 chunks, and reads back the grounded answer with no hallucination.

Prerequisites

Python 3.11+, pip install chromadb openai.
OPENAI_API_KEY exported.
A folder of source docs (markdown, PDF, transcripts).
Working voice agent loop (see post 2).
~5 minutes to seed the index.

Architecture

flowchart LR
  Q[Caller question] --> R[retrieve tool]
  R --> E[OpenAI embed-3-large]
  E --> C[ChromaDB query]
  C --> T[Top-3 chunks]
  T --> M[Realtime model]
  M --> A[Grounded answer]

Step 1 — Chunk and embed your docs

```python

index.py

import os, chromadb from openai import OpenAI

client = OpenAI() chroma = chromadb.PersistentClient(path="./kb") col = chroma.get_or_create_collection(name="callsphere_kb")

def chunk(text, size=800, overlap=100): out = [] for i in range(0, len(text), size - overlap): out.append(text[i:i+size]) return out

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

def embed(texts): r = client.embeddings.create(model="text-embedding-3-large", input=texts, dimensions=1024) return [d.embedding for d in r.data]

for fname in os.listdir("docs"): text = open(f"docs/{fname}").read() chunks = chunk(text) col.add( ids=[f"{fname}-{i}" for i in range(len(chunks))], documents=chunks, embeddings=embed(chunks), metadatas=[{"source": fname} for _ in chunks], ) print("Indexed", col.count(), "chunks") ```

Running this once embeds your KB; ChromaDB persists vectors locally.

Step 2 — Define the retrieve tool

```python def retrieve(query: str, k: int = 3) -> str: qvec = embed([query]) res = col.query(query_embeddings=qvec, n_results=k) chunks = res["documents"][0] sources = [m["source"] for m in res["metadatas"][0]] return "\n---\n".join(f"[{s}]\n{c}" for s, c in zip(sources, chunks)) ```

Step 3 — Register it with the voice agent

For OpenAI Realtime over WebSocket, you declare tools in session.update and intercept response.function_call_arguments.done:

```python TOOLS = [{ "type": "function", "name": "retrieve", "description": "Retrieve documents from the CallSphere knowledge base. Use ALWAYS for policy, pricing, hours.", "parameters": { "type": "object", "properties": { "query": { "type": "string" } }, "required": ["query"], }, }]

SESSION = { "type": "session.update", "session": { "voice": "alloy", "instructions": "Answer ONLY using retrieve tool output. If retrieve returns nothing relevant, say you don't know.", "tools": TOOLS, "tool_choice": "auto", "input_audio_format": "pcm16", "output_audio_format": "pcm16", } } ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Step 4 — Handle the function call event

```python async for raw in oai: ev = json.loads(raw) if ev["type"] == "response.function_call_arguments.done": args = json.loads(ev["arguments"]) result = retrieve(args["query"]) await oai.send(json.dumps({ "type": "conversation.item.create", "item": { "type": "function_call_output", "call_id": ev["call_id"], "output": result, }, })) await oai.send(json.dumps({"type": "response.create"})) ```

Step 5 — Tune retrieval quality

Use text-embedding-3-large with dimensions=1024 (cheaper, almost as good as 3072).
Chunk at 600–1000 chars with 100 overlap. Smaller chunks = sharper retrieval, more API calls.
Add metadata filters (where={"source": "pricing.md"}) when the agent already knows the topic.
Run an offline eval: 100 Q/A pairs, measure top-3 hit rate. Target >90%.

Step 6 — Add reranking (optional, recommended at scale)

```python from openai import OpenAI client = OpenAI()

def rerank(query, chunks): prompt = f"Score each passage 0-10 for answering: {query}\n\n" + "\n".join(f"[{i}] {c}" for i,c in enumerate(chunks)) r = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role":"user","content":prompt}], response_format={"type":"json_object"}) scores = json.loads(r.choices[0].message.content) return [chunks[i] for i in sorted(scores, key=scores.get, reverse=True)[:3]] ```

Common pitfalls

No tool-use enforcement: without Answer ONLY using retrieve in the prompt, the model still hallucinates. Be explicit.
Chunks too big: 2000+ chars dilutes retrieval. Split.
Embedding both query and doc with different models: always same model for both sides.
Cold-start latency: ChromaDB query is ~50ms but embedding the query is ~200ms. Cache embeddings of common queries.

How CallSphere does this in production

CallSphere's Healthcare agent retrieves from a 5,000-chunk KB (clinic protocols, insurance acceptance, hours per location) before every factual answer. We use text-embedding-3-large at 1024 dims, ChromaDB self-hosted on k3s, and re-index nightly. Hit rate on 200 eval Q/A: 96.5%. Lead score and sentiment are appended post-call to Postgres. Learn more.

FAQ

ChromaDB vs Pinecone? Chroma is great for self-hosted and <10M vectors. Pinecone is managed and scales to billions.

Embedding cost? text-embedding-3-large is $0.13 per 1M tokens. Indexing 100k docs is usually under $20.

How do I refresh the index? Re-run index.py nightly via cron, or add an upsert endpoint that re-embeds changed files only.

Why not stuff all docs in the system prompt? Beyond ~10k tokens, latency tanks and accuracy drops. RAG keeps prompts tight.

How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings

What you'll build

Prerequisites

Architecture

Step 1 — Chunk and embed your docs

index.py

Step 2 — Define the retrieve tool

Step 3 — Register it with the voice agent

Step 4 — Handle the function call event

Step 5 — Tune retrieval quality

Step 6 — Add reranking (optional, recommended at scale)

Common pitfalls

How CallSphere does this in production

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)