How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings
By Sagar Shankaran, Founder of CallSphere
Index a knowledge base with text-embedding-3-large into ChromaDB, expose a retrieve tool to your voice agent, and ground every answer in real documents — full Python tutorial.
Key takeaways
TL;DR — A voice agent without RAG hallucinates pricing, hours, and policy. Add a single
retrievetool backed by ChromaDB andtext-embedding-3-large, and accuracy on factual questions jumps from ~70% to >95%.
What you'll build
A Python voice agent that answers questions strictly from your indexed knowledge base. Caller asks "what's your refund policy?" — the agent calls the retrieve tool, fetches top-3 chunks, and reads back the grounded answer with no hallucination.
Prerequisites
- Python 3.11+,
pip install chromadb openai. OPENAI_API_KEYexported.- A folder of source docs (markdown, PDF, transcripts).
- Working voice agent loop (see post 2).
- ~5 minutes to seed the index.
Architecture
flowchart LR
Q[Caller question] --> R[retrieve tool]
R --> E[OpenAI embed-3-large]
E --> C[ChromaDB query]
C --> T[Top-3 chunks]
T --> M[Realtime model]
M --> A[Grounded answer]
Step 1 — Chunk and embed your docs
```python
index.py
import os, chromadb from openai import OpenAI
client = OpenAI() chroma = chromadb.PersistentClient(path="./kb") col = chroma.get_or_create_collection(name="callsphere_kb")
def chunk(text, size=800, overlap=100): out = [] for i in range(0, len(text), size - overlap): out.append(text[i:i+size]) return out
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
def embed(texts): r = client.embeddings.create(model="text-embedding-3-large", input=texts, dimensions=1024) return [d.embedding for d in r.data]
for fname in os.listdir("docs"): text = open(f"docs/{fname}").read() chunks = chunk(text) col.add( ids=[f"{fname}-{i}" for i in range(len(chunks))], documents=chunks, embeddings=embed(chunks), metadatas=[{"source": fname} for _ in chunks], ) print("Indexed", col.count(), "chunks") ```
Running this once embeds your KB; ChromaDB persists vectors locally.
Step 2 — Define the retrieve tool
```python def retrieve(query: str, k: int = 3) -> str: qvec = embed([query]) res = col.query(query_embeddings=qvec, n_results=k) chunks = res["documents"][0] sources = [m["source"] for m in res["metadatas"][0]] return "\n---\n".join(f"[{s}]\n{c}" for s, c in zip(sources, chunks)) ```
Step 3 — Register it with the voice agent
For OpenAI Realtime over WebSocket, you declare tools in session.update and intercept response.function_call_arguments.done:
```python TOOLS = [{ "type": "function", "name": "retrieve", "description": "Retrieve documents from the CallSphere knowledge base. Use ALWAYS for policy, pricing, hours.", "parameters": { "type": "object", "properties": { "query": { "type": "string" } }, "required": ["query"], }, }]
SESSION = { "type": "session.update", "session": { "voice": "alloy", "instructions": "Answer ONLY using retrieve tool output. If retrieve returns nothing relevant, say you don't know.", "tools": TOOLS, "tool_choice": "auto", "input_audio_format": "pcm16", "output_audio_format": "pcm16", } } ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 4 — Handle the function call event
```python async for raw in oai: ev = json.loads(raw) if ev["type"] == "response.function_call_arguments.done": args = json.loads(ev["arguments"]) result = retrieve(args["query"]) await oai.send(json.dumps({ "type": "conversation.item.create", "item": { "type": "function_call_output", "call_id": ev["call_id"], "output": result, }, })) await oai.send(json.dumps({"type": "response.create"})) ```
Step 5 — Tune retrieval quality
- Use
text-embedding-3-largewithdimensions=1024(cheaper, almost as good as 3072). - Chunk at 600–1000 chars with 100 overlap. Smaller chunks = sharper retrieval, more API calls.
- Add metadata filters (
where={"source": "pricing.md"}) when the agent already knows the topic. - Run an offline eval: 100 Q/A pairs, measure top-3 hit rate. Target >90%.
Step 6 — Add reranking (optional, recommended at scale)
```python from openai import OpenAI client = OpenAI()
def rerank(query, chunks): prompt = f"Score each passage 0-10 for answering: {query}\n\n" + "\n".join(f"[{i}] {c}" for i,c in enumerate(chunks)) r = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role":"user","content":prompt}], response_format={"type":"json_object"}) scores = json.loads(r.choices[0].message.content) return [chunks[i] for i in sorted(scores, key=scores.get, reverse=True)[:3]] ```
Common pitfalls
- No tool-use enforcement: without
Answer ONLY using retrievein the prompt, the model still hallucinates. Be explicit. - Chunks too big: 2000+ chars dilutes retrieval. Split.
- Embedding both query and doc with different models: always same model for both sides.
- Cold-start latency: ChromaDB query is ~50ms but embedding the query is ~200ms. Cache embeddings of common queries.
How CallSphere does this in production
CallSphere's Healthcare agent retrieves from a 5,000-chunk KB (clinic protocols, insurance acceptance, hours per location) before every factual answer. We use text-embedding-3-large at 1024 dims, ChromaDB self-hosted on k3s, and re-index nightly. Hit rate on 200 eval Q/A: 96.5%. Lead score and sentiment are appended post-call to Postgres. Learn more.
FAQ
ChromaDB vs Pinecone? Chroma is great for self-hosted and <10M vectors. Pinecone is managed and scales to billions.
Embedding cost? text-embedding-3-large is $0.13 per 1M tokens. Indexing 100k docs is usually under $20.
How do I refresh the index? Re-run index.py nightly via cron, or add an upsert endpoint that re-embeds changed files only.
Why not stuff all docs in the system prompt? Beyond ~10k tokens, latency tanks and accuracy drops. RAG keeps prompts tight.
Sources
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.