---
title: "ChromaDB RAG for Voice Agents: CallSphere vs Vapi Knowledge Base"
description: "How CallSphere uses ChromaDB embeddings + a Lookup specialist agent for voice RAG vs Vapi PDF Knowledge Base. Retrieval quality, indexing, costs."
canonical: https://callsphere.ai/blog/chromadb-rag-voice-agents-callsphere-vs-vapi-knowledge-base
category: "Technical Guides"
tags: ["RAG", "ChromaDB", "Voice AI", "CallSphere", "Vapi", "Knowledge Base", "Embeddings", "IT Helpdesk"]
author: "CallSphere Team"
published: 2026-04-19T00:00:00.000Z
updated: 2026-05-06T13:57:12.954Z
---

# ChromaDB RAG for Voice Agents: CallSphere vs Vapi Knowledge Base

> How CallSphere uses ChromaDB embeddings + a Lookup specialist agent for voice RAG vs Vapi PDF Knowledge Base. Retrieval quality, indexing, costs.

## TL;DR

**Vapi Knowledge Base** lets you upload PDFs and documents that the assistant can cite during a call — managed embedding, managed retrieval, opaque chunking. **CallSphere** runs **ChromaDB** as a self-hosted vector store with a dedicated **Lookup specialist agent** in IT Helpdesk that performs explicit retrieve-then-answer. Both work for FAQ-style queries; CallSphere's approach gives you tunable chunking, custom retrievers (BM25 hybrid, MMR), and the ability to inspect every retrieval that influenced an answer.

If you can ship one PDF and never look back, Vapi is fine. If you need to know why the agent answered "30-day return policy" instead of "60-day," you need an inspectable RAG pipeline.

## Voice RAG Is Different From Chat RAG

Voice agents have constraints chat does not:

- **Latency budget** — you have ~250ms before the user notices a gap
- **Token cost** — every retrieved chunk lives in the Realtime context across turns
- **Truncation** — the LLM has to summarize, not quote, because audio cannot read citation footnotes
- **Failure handling** — when retrieval misses, the model must say "I'm not sure" rather than hallucinate

These constraints push you toward smaller chunks, fewer of them, and explicit confidence thresholds.

## Vapi Knowledge Base Approach

Vapi exposes a Knowledge Base as a per-assistant resource:

```json
{
  "knowledgeBase": {
    "provider": "trieve",
    "topK": 5,
    "fileIds": ["file_abc123", "file_def456"]
  }
}
```

Behind the scenes: documents are chunked, embedded, indexed in their managed vector store. At call time, every user query triggers a retrieval and the top-K chunks are injected into the LLM context.

**Strengths:** zero infra, drop a PDF, done.

**Weaknesses:**

- Chunking strategy is fixed
- Hybrid retrieval (BM25 + dense) is not exposed
- You cannot inspect which chunks were retrieved for a given turn
- Re-indexing on document update is manual
- No metadata filtering (e.g., "only retrieve from Q1 2026 docs")
- Citations in voice responses are vague

## CallSphere ChromaDB Approach

CallSphere ships with ChromaDB embedded in the IT Helpdesk vertical. The architecture is:

```
User question
   ↓
Orchestrator (IT Triage)
   ↓ hand_off if knowledge query
Lookup Specialist Agent
   ↓ tool: retrieve_kb(query, filters, k=8)
ChromaDB (sentence-transformers/all-MiniLM-L6-v2 embeddings)
   ↓ top-K chunks with metadata
Re-rank (Cohere rerank-3 optional, BM25 hybrid)
   ↓ top-3 chunks
LLM (gpt-4o-realtime) generates audio response
   ↓
Postgres call_logs.retrievals[] for audit
```

### Indexing Pipeline

The IT Helpdesk ingestion script:

```python
import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="/data/chroma")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
col = client.get_or_create_collection(
    name="it_kb",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"},
)

def index_doc(doc_path: str, doc_meta: dict):
    chunks = semantic_chunker(doc_path, target_tokens=180, overlap=30)
    for i, chunk in enumerate(chunks):
        col.upsert(
            ids=[f"{doc_meta['id']}::{i}"],
            documents=[chunk.text],
            metadatas=[{
                **doc_meta,
                "chunk_index": i,
                "section": chunk.section,
                "updated_at": chunk.updated_at,
            }],
        )
```

Three deliberate choices:

- **180-token chunks** — small enough for voice context, large enough for semantic coherence
- **30-token overlap** — preserves cross-boundary entities
- **Metadata-rich** — enables filters like `{"section": "returns", "updated_at": {"$gte": "2026-01-01"}}`

### Retrieval Tool

The Lookup specialist exposes a single tool:

```python
@tool
async def retrieve_kb(
    query: str,
    section_filter: str | None = None,
    k: int = 8,
) -> RetrievalResult:
    where = {"section": section_filter} if section_filter else {}
    raw = col.query(
        query_texts=[query],
        n_results=k,
        where=where,
    )

    # Hybrid: blend dense scores with BM25 from a parallel index
    bm25_scores = bm25_index.get_scores(query, raw["ids"][0])
    blended = blend(raw["distances"][0], bm25_scores, alpha=0.7)

    # Rerank top-8 to top-3
    top3 = cohere_rerank(query, raw["documents"][0], top_n=3)

    return RetrievalResult(
        chunks=top3,
        confidence=max(blended),
        retrieval_id=str(uuid.uuid4()),
    )
```

### Confidence Threshold

If `confidence  NOW() - INTERVAL '24 hours'
  AND r.confidence  Orch[Orchestrator]
    Orch -->|hand_off| Lookup[Lookup Specialist]
    Lookup -->|retrieve_kb| Embed[Embed query
MiniLM-L6-v2]
    Embed --> Chroma[(ChromaDB
cosine)]
    Lookup --> BM25[BM25 index]
    Chroma --> Blend[Blend α=0.7]
    BM25 --> Blend
    Blend --> Rerank[Cohere rerank-3]
    Rerank --> Conf{conf > 0.55?}
    Conf -->|yes| LLM[gpt-4o-realtime]
    Conf -->|no| Escalate[Escalate to human]
    LLM --> Audio[PCM16 response]
    LLM --> Log[(retrievals log)]
```

## Practical Tips

- **Cap context at 3 chunks.** More chunks = more tokens = more first-token latency.
- **Embed FAQ-shaped paraphrases of source docs.** Customers ask in question form; docs are written in declarative form.
- **Re-embed when you change chunkers.** Mixing chunkers in one collection is a silent quality-killer.
- **Always log query + chunk IDs.** Without this you cannot debug RAG failures.
- **Use metadata filters aggressively.** "Only retrieve from active products" beats relevance ranking on stale data.

## FAQ

### Why ChromaDB and not pgvector?

Both work. ChromaDB has lighter operational overhead for the IT Helpdesk scale (50K-500K chunks). At 5M+ chunks, pgvector or a hosted vector DB wins.

### Can I use my own embeddings?

Yes — the embedding function is a config knob. We have run OpenAI text-embedding-3-small and bge-large-en-v1.5 in production.

### Does the voice latency budget kill RAG?

Only if you skip rerank or retrieve too many chunks. With k=8 → rerank → top-3, total retrieval round-trip is 80-150ms.

### How do you keep the KB fresh?

GitHub repo of source docs → CI/CD pipeline re-chunks and upserts on push. ChromaDB upsert is idempotent on chunk ID.

### Can the agent cite sources verbally?

Yes — each chunk carries a `source_title` metadata field, and the system prompt asks the agent to say "according to our returns policy, ..." when relevant.

## Try the IT Helpdesk Demo

The [/demo](/demo) flow includes the IT Helpdesk RAG path; ask it a policy question and inspect the retrieval log. [/industries/it-helpdesk](/industries) has full architecture diagrams.

---

Source: https://callsphere.ai/blog/chromadb-rag-voice-agents-callsphere-vs-vapi-knowledge-base