ChromaDB RAG for Voice Agents: CallSphere vs Vapi Knowledge Base
How CallSphere uses ChromaDB embeddings + a Lookup specialist agent for voice RAG vs Vapi PDF Knowledge Base. Retrieval quality, indexing, costs.
TL;DR
Vapi Knowledge Base lets you upload PDFs and documents that the assistant can cite during a call — managed embedding, managed retrieval, opaque chunking. CallSphere runs ChromaDB as a self-hosted vector store with a dedicated Lookup specialist agent in IT Helpdesk that performs explicit retrieve-then-answer. Both work for FAQ-style queries; CallSphere's approach gives you tunable chunking, custom retrievers (BM25 hybrid, MMR), and the ability to inspect every retrieval that influenced an answer.
If you can ship one PDF and never look back, Vapi is fine. If you need to know why the agent answered "30-day return policy" instead of "60-day," you need an inspectable RAG pipeline.
Voice RAG Is Different From Chat RAG
Voice agents have constraints chat does not:
- Latency budget — you have ~250ms before the user notices a gap
- Token cost — every retrieved chunk lives in the Realtime context across turns
- Truncation — the LLM has to summarize, not quote, because audio cannot read citation footnotes
- Failure handling — when retrieval misses, the model must say "I'm not sure" rather than hallucinate
These constraints push you toward smaller chunks, fewer of them, and explicit confidence thresholds.
Vapi Knowledge Base Approach
Vapi exposes a Knowledge Base as a per-assistant resource:
{
"knowledgeBase": {
"provider": "trieve",
"topK": 5,
"fileIds": ["file_abc123", "file_def456"]
}
}
Behind the scenes: documents are chunked, embedded, indexed in their managed vector store. At call time, every user query triggers a retrieval and the top-K chunks are injected into the LLM context.
Strengths: zero infra, drop a PDF, done.
Weaknesses:
- Chunking strategy is fixed
- Hybrid retrieval (BM25 + dense) is not exposed
- You cannot inspect which chunks were retrieved for a given turn
- Re-indexing on document update is manual
- No metadata filtering (e.g., "only retrieve from Q1 2026 docs")
- Citations in voice responses are vague
CallSphere ChromaDB Approach
CallSphere ships with ChromaDB embedded in the IT Helpdesk vertical. The architecture is:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
User question
↓
Orchestrator (IT Triage)
↓ hand_off if knowledge query
Lookup Specialist Agent
↓ tool: retrieve_kb(query, filters, k=8)
ChromaDB (sentence-transformers/all-MiniLM-L6-v2 embeddings)
↓ top-K chunks with metadata
Re-rank (Cohere rerank-3 optional, BM25 hybrid)
↓ top-3 chunks
LLM (gpt-4o-realtime) generates audio response
↓
Postgres call_logs.retrievals[] for audit
Indexing Pipeline
The IT Helpdesk ingestion script:
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.PersistentClient(path="/data/chroma")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
col = client.get_or_create_collection(
name="it_kb",
embedding_function=ef,
metadata={"hnsw:space": "cosine"},
)
def index_doc(doc_path: str, doc_meta: dict):
chunks = semantic_chunker(doc_path, target_tokens=180, overlap=30)
for i, chunk in enumerate(chunks):
col.upsert(
ids=[f"{doc_meta['id']}::{i}"],
documents=[chunk.text],
metadatas=[{
**doc_meta,
"chunk_index": i,
"section": chunk.section,
"updated_at": chunk.updated_at,
}],
)
Three deliberate choices:
- 180-token chunks — small enough for voice context, large enough for semantic coherence
- 30-token overlap — preserves cross-boundary entities
- Metadata-rich — enables filters like
{"section": "returns", "updated_at": {"$gte": "2026-01-01"}}
Retrieval Tool
The Lookup specialist exposes a single tool:
@tool
async def retrieve_kb(
query: str,
section_filter: str | None = None,
k: int = 8,
) -> RetrievalResult:
where = {"section": section_filter} if section_filter else {}
raw = col.query(
query_texts=[query],
n_results=k,
where=where,
)
# Hybrid: blend dense scores with BM25 from a parallel index
bm25_scores = bm25_index.get_scores(query, raw["ids"][0])
blended = blend(raw["distances"][0], bm25_scores, alpha=0.7)
# Rerank top-8 to top-3
top3 = cohere_rerank(query, raw["documents"][0], top_n=3)
return RetrievalResult(
chunks=top3,
confidence=max(blended),
retrieval_id=str(uuid.uuid4()),
)
Confidence Threshold
If confidence < 0.55, the specialist tells the user "I am not sure — let me transfer you to a human agent" rather than hallucinate an answer. This is the single most important RAG pattern for voice.
Inspectability
Every retrieval gets a retrieval_id written to Postgres:
SELECT
cl.call_id,
r.retrieval_id,
r.query,
r.chunks_returned,
r.confidence,
r.influenced_response_id
FROM call_logs cl
JOIN retrievals r ON r.call_id = cl.call_id
WHERE cl.created_at > NOW() - INTERVAL '24 hours'
AND r.confidence < 0.7;
This query surfaces low-confidence retrievals from the last day, which feeds the weekly content gap report — "we kept failing to answer X, write a doc."
Vapi vs CallSphere RAG Comparison
| Dimension | Vapi Knowledge Base | CallSphere ChromaDB |
|---|---|---|
| Vector store | Managed (Trieve) | ChromaDB self-hosted |
| Embedding model | Provider default | all-MiniLM-L6-v2 (swappable) |
| Chunking | Fixed | Configurable, semantic |
| Hybrid retrieval | Not exposed | BM25 + dense blend |
| Reranking | Built-in (opaque) | Cohere rerank-3 optional |
| Metadata filter | Limited | Full where-clause |
| Confidence threshold | Implicit | Explicit, configurable |
| Inspect retrieval logs | No | Per-turn in Postgres |
| Re-indexing | Manual upload | CI/CD pipeline |
| Cost | Bundled in Vapi pricing | Compute + embedding |
RAG Retrieval Pipeline
graph LR
Q[User voice query] --> Orch[Orchestrator]
Orch -->|hand_off| Lookup[Lookup Specialist]
Lookup -->|retrieve_kb| Embed[Embed query<br/>MiniLM-L6-v2]
Embed --> Chroma[(ChromaDB<br/>cosine)]
Lookup --> BM25[BM25 index]
Chroma --> Blend[Blend α=0.7]
BM25 --> Blend
Blend --> Rerank[Cohere rerank-3]
Rerank --> Conf{conf > 0.55?}
Conf -->|yes| LLM[gpt-4o-realtime]
Conf -->|no| Escalate[Escalate to human]
LLM --> Audio[PCM16 response]
LLM --> Log[(retrievals log)]
Practical Tips
- Cap context at 3 chunks. More chunks = more tokens = more first-token latency.
- Embed FAQ-shaped paraphrases of source docs. Customers ask in question form; docs are written in declarative form.
- Re-embed when you change chunkers. Mixing chunkers in one collection is a silent quality-killer.
- Always log query + chunk IDs. Without this you cannot debug RAG failures.
- Use metadata filters aggressively. "Only retrieve from active products" beats relevance ranking on stale data.
FAQ
Why ChromaDB and not pgvector?
Both work. ChromaDB has lighter operational overhead for the IT Helpdesk scale (50K-500K chunks). At 5M+ chunks, pgvector or a hosted vector DB wins.
Can I use my own embeddings?
Yes — the embedding function is a config knob. We have run OpenAI text-embedding-3-small and bge-large-en-v1.5 in production.
Does the voice latency budget kill RAG?
Only if you skip rerank or retrieve too many chunks. With k=8 → rerank → top-3, total retrieval round-trip is 80-150ms.
How do you keep the KB fresh?
GitHub repo of source docs → CI/CD pipeline re-chunks and upserts on push. ChromaDB upsert is idempotent on chunk ID.
Can the agent cite sources verbally?
Yes — each chunk carries a source_title metadata field, and the system prompt asks the agent to say "according to our returns policy, ..." when relevant.
Try the IT Helpdesk Demo
The /demo flow includes the IT Helpdesk RAG path; ask it a policy question and inspect the retrieval log. /industries/it-helpdesk has full architecture diagrams.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.