---
title: "How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings"
description: "Index a knowledge base with text-embedding-3-large into ChromaDB, expose a retrieve tool to your voice agent, and ground every answer in real documents — full Python tutorial."
canonical: https://callsphere.ai/blog/vw1h-build-voice-agent-rag-chromadb-openai-embeddings-tutorial
category: "AI Engineering"
tags: ["Tutorial", "Build", "RAG", "ChromaDB", "Python"]
author: "CallSphere Team"
published: 2026-03-30T00:00:00.000Z
updated: 2026-05-07T06:45:01.336Z
---

# How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings

> Index a knowledge base with text-embedding-3-large into ChromaDB, expose a retrieve tool to your voice agent, and ground every answer in real documents — full Python tutorial.

> **TL;DR** — A voice agent without RAG hallucinates pricing, hours, and policy. Add a single `retrieve` tool backed by ChromaDB and `text-embedding-3-large`, and accuracy on factual questions jumps from ~70% to >95%.

## What you'll build

A Python voice agent that answers questions strictly from your indexed knowledge base. Caller asks "what's your refund policy?" — the agent calls the `retrieve` tool, fetches top-3 chunks, and reads back the grounded answer with no hallucination.

## Prerequisites

1. Python 3.11+, `pip install chromadb openai`.
2. `OPENAI_API_KEY` exported.
3. A folder of source docs (markdown, PDF, transcripts).
4. Working voice agent loop (see post 2).
5. ~5 minutes to seed the index.

## Architecture

```mermaid
flowchart LR
  Q[Caller question] --> R[retrieve tool]
  R --> E[OpenAI embed-3-large]
  E --> C[ChromaDB query]
  C --> T[Top-3 chunks]
  T --> M[Realtime model]
  M --> A[Grounded answer]
```

## Step 1 — Chunk and embed your docs

```python

# index.py

import os, chromadb
from openai import OpenAI

client = OpenAI()
chroma = chromadb.PersistentClient(path="./kb")
col = chroma.get_or_create_collection(name="callsphere_kb")

def chunk(text, size=800, overlap=100):
    out = []
    for i in range(0, len(text), size - overlap):
        out.append(text[i:i+size])
    return out

def embed(texts):
    r = client.embeddings.create(model="text-embedding-3-large", input=texts, dimensions=1024)
    return [d.embedding for d in r.data]

for fname in os.listdir("docs"):
    text = open(f"docs/{fname}").read()
    chunks = chunk(text)
    col.add(
        ids=[f"{fname}-{i}" for i in range(len(chunks))],
        documents=chunks,
        embeddings=embed(chunks),
        metadatas=[{"source": fname} for _ in chunks],
    )
print("Indexed", col.count(), "chunks")
```

Running this once embeds your KB; ChromaDB persists vectors locally.

## Step 2 — Define the retrieve tool

```python
def retrieve(query: str, k: int = 3) -> str:
    qvec = embed([query])
    res = col.query(query_embeddings=qvec, n_results=k)
    chunks = res["documents"][0]
    sources = [m["source"] for m in res["metadatas"][0]]
    return "\n---\n".join(f"[{s}]\n{c}" for s, c in zip(sources, chunks))
```

## Step 3 — Register it with the voice agent

For OpenAI Realtime over WebSocket, you declare tools in `session.update` and intercept `response.function_call_arguments.done`:

```python
TOOLS = [{
    "type": "function",
    "name": "retrieve",
    "description": "Retrieve documents from the CallSphere knowledge base. Use ALWAYS for policy, pricing, hours.",
    "parameters": {
        "type": "object",
        "properties": { "query": { "type": "string" } },
        "required": ["query"],
    },
}]

SESSION = {
    "type": "session.update",
    "session": {
        "voice": "alloy",
        "instructions": "Answer ONLY using retrieve tool output. If retrieve returns nothing relevant, say you don't know.",
        "tools": TOOLS,
        "tool_choice": "auto",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
    }
}
```

## Step 4 — Handle the function call event

```python
async for raw in oai:
    ev = json.loads(raw)
    if ev["type"] == "response.function_call_arguments.done":
        args = json.loads(ev["arguments"])
        result = retrieve(args["query"])
        await oai.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "function_call_output",
                "call_id": ev["call_id"],
                "output": result,
            },
        }))
        await oai.send(json.dumps({"type": "response.create"}))
```

## Step 5 — Tune retrieval quality

- Use `text-embedding-3-large` with `dimensions=1024` (cheaper, almost as good as 3072).
- Chunk at 600–1000 chars with 100 overlap. Smaller chunks = sharper retrieval, more API calls.
- Add metadata filters (`where={"source": "pricing.md"}`) when the agent already knows the topic.
- Run an offline eval: 100 Q/A pairs, measure top-3 hit rate. Target >90%.

## Step 6 — Add reranking (optional, recommended at scale)

```python
from openai import OpenAI
client = OpenAI()

def rerank(query, chunks):
    prompt = f"Score each passage 0-10 for answering: {query}\n\n" + "\n".join(f"[{i}] {c}" for i,c in enumerate(chunks))
    r = client.chat.completions.create(model="gpt-4o-mini",
        messages=[{"role":"user","content":prompt}],
        response_format={"type":"json_object"})
    scores = json.loads(r.choices[0].message.content)
    return [chunks[i] for i in sorted(scores, key=scores.get, reverse=True)[:3]]
```

## Common pitfalls

- **No tool-use enforcement**: without `Answer ONLY using retrieve` in the prompt, the model still hallucinates. Be explicit.
- **Chunks too big**: 2000+ chars dilutes retrieval. Split.
- **Embedding both query and doc with different models**: always same model for both sides.
- **Cold-start latency**: ChromaDB query is ~50ms but embedding the query is ~200ms. Cache embeddings of common queries.

## How CallSphere does this in production

CallSphere's Healthcare agent retrieves from a 5,000-chunk KB (clinic protocols, insurance acceptance, hours per location) before every factual answer. We use `text-embedding-3-large` at 1024 dims, ChromaDB self-hosted on k3s, and re-index nightly. Hit rate on 200 eval Q/A: 96.5%. Lead score and sentiment are appended post-call to Postgres. [Learn more](/industries/healthcare).

## FAQ

**ChromaDB vs Pinecone?** Chroma is great for self-hosted and <10M vectors. Pinecone is managed and scales to billions.

**Embedding cost?** `text-embedding-3-large` is $0.13 per 1M tokens. Indexing 100k docs is usually under $20.

**How do I refresh the index?** Re-run `index.py` nightly via cron, or add an `upsert` endpoint that re-embeds changed files only.

**Why not stuff all docs in the system prompt?** Beyond ~10k tokens, latency tanks and accuracy drops. RAG keeps prompts tight.

## Sources

- [ChromaDB docs](https://docs.trychroma.com/)
- [OpenAI embeddings guide](https://platform.openai.com/docs/guides/embeddings)
- [OpenAI Realtime function calling](https://platform.openai.com/docs/guides/realtime)
- [text-embedding-3-large model card](https://platform.openai.com/docs/models/text-embedding-3-large)

---

Source: https://callsphere.ai/blog/vw1h-build-voice-agent-rag-chromadb-openai-embeddings-tutorial
