How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings
Index a knowledge base with text-embedding-3-large into ChromaDB, expose a retrieve tool to your voice agent, and ground every answer in real documents — full Python tutorial.
TL;DR — A voice agent without RAG hallucinates pricing, hours, and policy. Add a single
retrievetool backed by ChromaDB andtext-embedding-3-large, and accuracy on factual questions jumps from ~70% to >95%.
What you'll build
A Python voice agent that answers questions strictly from your indexed knowledge base. Caller asks "what's your refund policy?" — the agent calls the retrieve tool, fetches top-3 chunks, and reads back the grounded answer with no hallucination.
Prerequisites
- Python 3.11+,
pip install chromadb openai. OPENAI_API_KEYexported.- A folder of source docs (markdown, PDF, transcripts).
- Working voice agent loop (see post 2).
- ~5 minutes to seed the index.
Architecture
flowchart LR
Q[Caller question] --> R[retrieve tool]
R --> E[OpenAI embed-3-large]
E --> C[ChromaDB query]
C --> T[Top-3 chunks]
T --> M[Realtime model]
M --> A[Grounded answer]
Step 1 — Chunk and embed your docs
```python
index.py
import os, chromadb from openai import OpenAI
client = OpenAI() chroma = chromadb.PersistentClient(path="./kb") col = chroma.get_or_create_collection(name="callsphere_kb")
def chunk(text, size=800, overlap=100): out = [] for i in range(0, len(text), size - overlap): out.append(text[i:i+size]) return out
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
def embed(texts): r = client.embeddings.create(model="text-embedding-3-large", input=texts, dimensions=1024) return [d.embedding for d in r.data]
for fname in os.listdir("docs"): text = open(f"docs/{fname}").read() chunks = chunk(text) col.add( ids=[f"{fname}-{i}" for i in range(len(chunks))], documents=chunks, embeddings=embed(chunks), metadatas=[{"source": fname} for _ in chunks], ) print("Indexed", col.count(), "chunks") ```
Running this once embeds your KB; ChromaDB persists vectors locally.
Step 2 — Define the retrieve tool
```python def retrieve(query: str, k: int = 3) -> str: qvec = embed([query]) res = col.query(query_embeddings=qvec, n_results=k) chunks = res["documents"][0] sources = [m["source"] for m in res["metadatas"][0]] return "\n---\n".join(f"[{s}]\n{c}" for s, c in zip(sources, chunks)) ```
Step 3 — Register it with the voice agent
For OpenAI Realtime over WebSocket, you declare tools in session.update and intercept response.function_call_arguments.done:
```python TOOLS = [{ "type": "function", "name": "retrieve", "description": "Retrieve documents from the CallSphere knowledge base. Use ALWAYS for policy, pricing, hours.", "parameters": { "type": "object", "properties": { "query": { "type": "string" } }, "required": ["query"], }, }]
SESSION = { "type": "session.update", "session": { "voice": "alloy", "instructions": "Answer ONLY using retrieve tool output. If retrieve returns nothing relevant, say you don't know.", "tools": TOOLS, "tool_choice": "auto", "input_audio_format": "pcm16", "output_audio_format": "pcm16", } } ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 4 — Handle the function call event
```python async for raw in oai: ev = json.loads(raw) if ev["type"] == "response.function_call_arguments.done": args = json.loads(ev["arguments"]) result = retrieve(args["query"]) await oai.send(json.dumps({ "type": "conversation.item.create", "item": { "type": "function_call_output", "call_id": ev["call_id"], "output": result, }, })) await oai.send(json.dumps({"type": "response.create"})) ```
Step 5 — Tune retrieval quality
- Use
text-embedding-3-largewithdimensions=1024(cheaper, almost as good as 3072). - Chunk at 600–1000 chars with 100 overlap. Smaller chunks = sharper retrieval, more API calls.
- Add metadata filters (
where={"source": "pricing.md"}) when the agent already knows the topic. - Run an offline eval: 100 Q/A pairs, measure top-3 hit rate. Target >90%.
Step 6 — Add reranking (optional, recommended at scale)
```python from openai import OpenAI client = OpenAI()
def rerank(query, chunks): prompt = f"Score each passage 0-10 for answering: {query}\n\n" + "\n".join(f"[{i}] {c}" for i,c in enumerate(chunks)) r = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role":"user","content":prompt}], response_format={"type":"json_object"}) scores = json.loads(r.choices[0].message.content) return [chunks[i] for i in sorted(scores, key=scores.get, reverse=True)[:3]] ```
Common pitfalls
- No tool-use enforcement: without
Answer ONLY using retrievein the prompt, the model still hallucinates. Be explicit. - Chunks too big: 2000+ chars dilutes retrieval. Split.
- Embedding both query and doc with different models: always same model for both sides.
- Cold-start latency: ChromaDB query is ~50ms but embedding the query is ~200ms. Cache embeddings of common queries.
How CallSphere does this in production
CallSphere's Healthcare agent retrieves from a 5,000-chunk KB (clinic protocols, insurance acceptance, hours per location) before every factual answer. We use text-embedding-3-large at 1024 dims, ChromaDB self-hosted on k3s, and re-index nightly. Hit rate on 200 eval Q/A: 96.5%. Lead score and sentiment are appended post-call to Postgres. Learn more.
FAQ
ChromaDB vs Pinecone? Chroma is great for self-hosted and <10M vectors. Pinecone is managed and scales to billions.
Embedding cost? text-embedding-3-large is $0.13 per 1M tokens. Indexing 100k docs is usually under $20.
How do I refresh the index? Re-run index.py nightly via cron, or add an upsert endpoint that re-embeds changed files only.
Why not stuff all docs in the system prompt? Beyond ~10k tokens, latency tanks and accuracy drops. RAG keeps prompts tight.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.