ChromaDB RAG for Voice Agents: CallSphere vs Vapi Knowledge Base

TL;DR

Vapi Knowledge Base lets you upload PDFs and documents that the assistant can cite during a call — managed embedding, managed retrieval, opaque chunking. CallSphere runs ChromaDB as a self-hosted vector store with a dedicated Lookup specialist agent in IT Helpdesk that performs explicit retrieve-then-answer. Both work for FAQ-style queries; CallSphere's approach gives you tunable chunking, custom retrievers (BM25 hybrid, MMR), and the ability to inspect every retrieval that influenced an answer.

If you can ship one PDF and never look back, Vapi is fine. If you need to know why the agent answered "30-day return policy" instead of "60-day," you need an inspectable RAG pipeline.

Voice RAG Is Different From Chat RAG

Voice agents have constraints chat does not:

Latency budget — you have ~250ms before the user notices a gap
Token cost — every retrieved chunk lives in the Realtime context across turns
Truncation — the LLM has to summarize, not quote, because audio cannot read citation footnotes
Failure handling — when retrieval misses, the model must say "I'm not sure" rather than hallucinate

These constraints push you toward smaller chunks, fewer of them, and explicit confidence thresholds.

Vapi Knowledge Base Approach

Vapi exposes a Knowledge Base as a per-assistant resource:

{
  "knowledgeBase": {
    "provider": "trieve",
    "topK": 5,
    "fileIds": ["file_abc123", "file_def456"]
  }
}

Behind the scenes: documents are chunked, embedded, indexed in their managed vector store. At call time, every user query triggers a retrieval and the top-K chunks are injected into the LLM context.

Strengths: zero infra, drop a PDF, done.

Weaknesses:

Chunking strategy is fixed
Hybrid retrieval (BM25 + dense) is not exposed
You cannot inspect which chunks were retrieved for a given turn
Re-indexing on document update is manual
No metadata filtering (e.g., "only retrieve from Q1 2026 docs")
Citations in voice responses are vague

CallSphere ChromaDB Approach

CallSphere ships with ChromaDB embedded in the IT Helpdesk vertical. The architecture is:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

User question
   ↓
Orchestrator (IT Triage)
   ↓ hand_off if knowledge query
Lookup Specialist Agent
   ↓ tool: retrieve_kb(query, filters, k=8)
ChromaDB (sentence-transformers/all-MiniLM-L6-v2 embeddings)
   ↓ top-K chunks with metadata
Re-rank (Cohere rerank-3 optional, BM25 hybrid)
   ↓ top-3 chunks
LLM (gpt-4o-realtime) generates audio response
   ↓
Postgres call_logs.retrievals[] for audit

Indexing Pipeline

The IT Helpdesk ingestion script:

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="/data/chroma")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
col = client.get_or_create_collection(
    name="it_kb",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"},
)

def index_doc(doc_path: str, doc_meta: dict):
    chunks = semantic_chunker(doc_path, target_tokens=180, overlap=30)
    for i, chunk in enumerate(chunks):
        col.upsert(
            ids=[f"{doc_meta['id']}::{i}"],
            documents=[chunk.text],
            metadatas=[{
                **doc_meta,
                "chunk_index": i,
                "section": chunk.section,
                "updated_at": chunk.updated_at,
            }],
        )

Three deliberate choices:

180-token chunks — small enough for voice context, large enough for semantic coherence
30-token overlap — preserves cross-boundary entities
Metadata-rich — enables filters like {"section": "returns", "updated_at": {"$gte": "2026-01-01"}}

Retrieval Tool

The Lookup specialist exposes a single tool:

@tool
async def retrieve_kb(
    query: str,
    section_filter: str | None = None,
    k: int = 8,
) -> RetrievalResult:
    where = {"section": section_filter} if section_filter else {}
    raw = col.query(
        query_texts=[query],
        n_results=k,
        where=where,
    )

    # Hybrid: blend dense scores with BM25 from a parallel index
    bm25_scores = bm25_index.get_scores(query, raw["ids"][0])
    blended = blend(raw["distances"][0], bm25_scores, alpha=0.7)

    # Rerank top-8 to top-3
    top3 = cohere_rerank(query, raw["documents"][0], top_n=3)

    return RetrievalResult(
        chunks=top3,
        confidence=max(blended),
        retrieval_id=str(uuid.uuid4()),
    )

Confidence Threshold

If confidence < 0.55, the specialist tells the user "I am not sure — let me transfer you to a human agent" rather than hallucinate an answer. This is the single most important RAG pattern for voice.

Inspectability

Every retrieval gets a retrieval_id written to Postgres:

SELECT
  cl.call_id,
  r.retrieval_id,
  r.query,
  r.chunks_returned,
  r.confidence,
  r.influenced_response_id
FROM call_logs cl
JOIN retrievals r ON r.call_id = cl.call_id
WHERE cl.created_at > NOW() - INTERVAL '24 hours'
  AND r.confidence < 0.7;

This query surfaces low-confidence retrievals from the last day, which feeds the weekly content gap report — "we kept failing to answer X, write a doc."

Vapi vs CallSphere RAG Comparison

Dimension	Vapi Knowledge Base	CallSphere ChromaDB
Vector store	Managed (Trieve)	ChromaDB self-hosted
Embedding model	Provider default	all-MiniLM-L6-v2 (swappable)
Chunking	Fixed	Configurable, semantic
Hybrid retrieval	Not exposed	BM25 + dense blend
Reranking	Built-in (opaque)	Cohere rerank-3 optional
Metadata filter	Limited	Full where-clause
Confidence threshold	Implicit	Explicit, configurable
Inspect retrieval logs	No	Per-turn in Postgres
Re-indexing	Manual upload	CI/CD pipeline
Cost	Bundled in Vapi pricing	Compute + embedding

RAG Retrieval Pipeline

graph LR
    Q[User voice query] --> Orch[Orchestrator]
    Orch -->|hand_off| Lookup[Lookup Specialist]
    Lookup -->|retrieve_kb| Embed[Embed query<br/>MiniLM-L6-v2]
    Embed --> Chroma[(ChromaDB<br/>cosine)]
    Lookup --> BM25[BM25 index]
    Chroma --> Blend[Blend α=0.7]
    BM25 --> Blend
    Blend --> Rerank[Cohere rerank-3]
    Rerank --> Conf{conf > 0.55?}
    Conf -->|yes| LLM[gpt-4o-realtime]
    Conf -->|no| Escalate[Escalate to human]
    LLM --> Audio[PCM16 response]
    LLM --> Log[(retrievals log)]

Practical Tips

Cap context at 3 chunks. More chunks = more tokens = more first-token latency.
Embed FAQ-shaped paraphrases of source docs. Customers ask in question form; docs are written in declarative form.
Re-embed when you change chunkers. Mixing chunkers in one collection is a silent quality-killer.
Always log query + chunk IDs. Without this you cannot debug RAG failures.
Use metadata filters aggressively. "Only retrieve from active products" beats relevance ranking on stale data.

FAQ

Why ChromaDB and not pgvector?

Both work. ChromaDB has lighter operational overhead for the IT Helpdesk scale (50K-500K chunks). At 5M+ chunks, pgvector or a hosted vector DB wins.

Can I use my own embeddings?

Yes — the embedding function is a config knob. We have run OpenAI text-embedding-3-small and bge-large-en-v1.5 in production.

Does the voice latency budget kill RAG?

Only if you skip rerank or retrieve too many chunks. With k=8 → rerank → top-3, total retrieval round-trip is 80-150ms.

How do you keep the KB fresh?

GitHub repo of source docs → CI/CD pipeline re-chunks and upserts on push. ChromaDB upsert is idempotent on chunk ID.

Can the agent cite sources verbally?

Yes — each chunk carries a source_title metadata field, and the system prompt asks the agent to say "according to our returns policy, ..." when relevant.

Try the IT Helpdesk Demo

The /demo flow includes the IT Helpdesk RAG path; ask it a policy question and inspect the retrieval log. /industries/it-helpdesk has full architecture diagrams.