Skip to content
Technical Guides
Technical Guides14 min read0 views

ChromaDB RAG for Voice Agents: CallSphere vs Vapi Knowledge Base

How CallSphere uses ChromaDB embeddings + a Lookup specialist agent for voice RAG vs Vapi PDF Knowledge Base. Retrieval quality, indexing, costs.

TL;DR

Vapi Knowledge Base lets you upload PDFs and documents that the assistant can cite during a call — managed embedding, managed retrieval, opaque chunking. CallSphere runs ChromaDB as a self-hosted vector store with a dedicated Lookup specialist agent in IT Helpdesk that performs explicit retrieve-then-answer. Both work for FAQ-style queries; CallSphere's approach gives you tunable chunking, custom retrievers (BM25 hybrid, MMR), and the ability to inspect every retrieval that influenced an answer.

If you can ship one PDF and never look back, Vapi is fine. If you need to know why the agent answered "30-day return policy" instead of "60-day," you need an inspectable RAG pipeline.

Voice RAG Is Different From Chat RAG

Voice agents have constraints chat does not:

  • Latency budget — you have ~250ms before the user notices a gap
  • Token cost — every retrieved chunk lives in the Realtime context across turns
  • Truncation — the LLM has to summarize, not quote, because audio cannot read citation footnotes
  • Failure handling — when retrieval misses, the model must say "I'm not sure" rather than hallucinate

These constraints push you toward smaller chunks, fewer of them, and explicit confidence thresholds.

Vapi Knowledge Base Approach

Vapi exposes a Knowledge Base as a per-assistant resource:

{
  "knowledgeBase": {
    "provider": "trieve",
    "topK": 5,
    "fileIds": ["file_abc123", "file_def456"]
  }
}

Behind the scenes: documents are chunked, embedded, indexed in their managed vector store. At call time, every user query triggers a retrieval and the top-K chunks are injected into the LLM context.

Strengths: zero infra, drop a PDF, done.

Weaknesses:

  • Chunking strategy is fixed
  • Hybrid retrieval (BM25 + dense) is not exposed
  • You cannot inspect which chunks were retrieved for a given turn
  • Re-indexing on document update is manual
  • No metadata filtering (e.g., "only retrieve from Q1 2026 docs")
  • Citations in voice responses are vague

CallSphere ChromaDB Approach

CallSphere ships with ChromaDB embedded in the IT Helpdesk vertical. The architecture is:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

User question
   ↓
Orchestrator (IT Triage)
   ↓ hand_off if knowledge query
Lookup Specialist Agent
   ↓ tool: retrieve_kb(query, filters, k=8)
ChromaDB (sentence-transformers/all-MiniLM-L6-v2 embeddings)
   ↓ top-K chunks with metadata
Re-rank (Cohere rerank-3 optional, BM25 hybrid)
   ↓ top-3 chunks
LLM (gpt-4o-realtime) generates audio response
   ↓
Postgres call_logs.retrievals[] for audit

Indexing Pipeline

The IT Helpdesk ingestion script:

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="/data/chroma")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
col = client.get_or_create_collection(
    name="it_kb",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"},
)

def index_doc(doc_path: str, doc_meta: dict):
    chunks = semantic_chunker(doc_path, target_tokens=180, overlap=30)
    for i, chunk in enumerate(chunks):
        col.upsert(
            ids=[f"{doc_meta['id']}::{i}"],
            documents=[chunk.text],
            metadatas=[{
                **doc_meta,
                "chunk_index": i,
                "section": chunk.section,
                "updated_at": chunk.updated_at,
            }],
        )

Three deliberate choices:

  • 180-token chunks — small enough for voice context, large enough for semantic coherence
  • 30-token overlap — preserves cross-boundary entities
  • Metadata-rich — enables filters like {"section": "returns", "updated_at": {"$gte": "2026-01-01"}}

Retrieval Tool

The Lookup specialist exposes a single tool:

@tool
async def retrieve_kb(
    query: str,
    section_filter: str | None = None,
    k: int = 8,
) -> RetrievalResult:
    where = {"section": section_filter} if section_filter else {}
    raw = col.query(
        query_texts=[query],
        n_results=k,
        where=where,
    )

    # Hybrid: blend dense scores with BM25 from a parallel index
    bm25_scores = bm25_index.get_scores(query, raw["ids"][0])
    blended = blend(raw["distances"][0], bm25_scores, alpha=0.7)

    # Rerank top-8 to top-3
    top3 = cohere_rerank(query, raw["documents"][0], top_n=3)

    return RetrievalResult(
        chunks=top3,
        confidence=max(blended),
        retrieval_id=str(uuid.uuid4()),
    )

Confidence Threshold

If confidence < 0.55, the specialist tells the user "I am not sure — let me transfer you to a human agent" rather than hallucinate an answer. This is the single most important RAG pattern for voice.

Inspectability

Every retrieval gets a retrieval_id written to Postgres:

SELECT
  cl.call_id,
  r.retrieval_id,
  r.query,
  r.chunks_returned,
  r.confidence,
  r.influenced_response_id
FROM call_logs cl
JOIN retrievals r ON r.call_id = cl.call_id
WHERE cl.created_at > NOW() - INTERVAL '24 hours'
  AND r.confidence < 0.7;

This query surfaces low-confidence retrievals from the last day, which feeds the weekly content gap report — "we kept failing to answer X, write a doc."

Vapi vs CallSphere RAG Comparison

Dimension Vapi Knowledge Base CallSphere ChromaDB
Vector store Managed (Trieve) ChromaDB self-hosted
Embedding model Provider default all-MiniLM-L6-v2 (swappable)
Chunking Fixed Configurable, semantic
Hybrid retrieval Not exposed BM25 + dense blend
Reranking Built-in (opaque) Cohere rerank-3 optional
Metadata filter Limited Full where-clause
Confidence threshold Implicit Explicit, configurable
Inspect retrieval logs No Per-turn in Postgres
Re-indexing Manual upload CI/CD pipeline
Cost Bundled in Vapi pricing Compute + embedding

RAG Retrieval Pipeline

graph LR
    Q[User voice query] --> Orch[Orchestrator]
    Orch -->|hand_off| Lookup[Lookup Specialist]
    Lookup -->|retrieve_kb| Embed[Embed query<br/>MiniLM-L6-v2]
    Embed --> Chroma[(ChromaDB<br/>cosine)]
    Lookup --> BM25[BM25 index]
    Chroma --> Blend[Blend α=0.7]
    BM25 --> Blend
    Blend --> Rerank[Cohere rerank-3]
    Rerank --> Conf{conf > 0.55?}
    Conf -->|yes| LLM[gpt-4o-realtime]
    Conf -->|no| Escalate[Escalate to human]
    LLM --> Audio[PCM16 response]
    LLM --> Log[(retrievals log)]

Practical Tips

  • Cap context at 3 chunks. More chunks = more tokens = more first-token latency.
  • Embed FAQ-shaped paraphrases of source docs. Customers ask in question form; docs are written in declarative form.
  • Re-embed when you change chunkers. Mixing chunkers in one collection is a silent quality-killer.
  • Always log query + chunk IDs. Without this you cannot debug RAG failures.
  • Use metadata filters aggressively. "Only retrieve from active products" beats relevance ranking on stale data.

FAQ

Why ChromaDB and not pgvector?

Both work. ChromaDB has lighter operational overhead for the IT Helpdesk scale (50K-500K chunks). At 5M+ chunks, pgvector or a hosted vector DB wins.

Can I use my own embeddings?

Yes — the embedding function is a config knob. We have run OpenAI text-embedding-3-small and bge-large-en-v1.5 in production.

Does the voice latency budget kill RAG?

Only if you skip rerank or retrieve too many chunks. With k=8 → rerank → top-3, total retrieval round-trip is 80-150ms.

How do you keep the KB fresh?

GitHub repo of source docs → CI/CD pipeline re-chunks and upserts on push. ChromaDB upsert is idempotent on chunk ID.

Can the agent cite sources verbally?

Yes — each chunk carries a source_title metadata field, and the system prompt asks the agent to say "according to our returns policy, ..." when relevant.

Try the IT Helpdesk Demo

The /demo flow includes the IT Helpdesk RAG path; ask it a policy question and inspect the retrieval log. /industries/it-helpdesk has full architecture diagrams.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Comparisons

Smart Escalation Ladders: CallSphere Built-In vs Vapi DIY

Acknowledgments table, ladder configs, 120s timeout — built-in on CallSphere. On Vapi this is a from-scratch state-machine engineering project.

IT Helpdesk

Denver and Boulder IT Helpdesks: A Different Take on CallSphere Voice + Chat for Front Range MSPs Running Tight Margins

Colorado MSPs and IT helpdesks: integrate CallSphere's 10-agent voice + chat AI into ConnectWise, Autotask, ServiceNow, or your PSA in 24-72 hours.

IT Helpdesk

Hassle-Free CallSphere Integration for Edison IT Departments — RAG Knowledge Base, Auto Ticket, Live Voice & Chat

New Jersey MSPs and IT helpdesks: integrate CallSphere's 10-agent voice + chat AI into ConnectWise, Autotask, ServiceNow, or your PSA in 24-72 hours.

IT Helpdesk

Michigan MSP Operators' Playbook for Plugging Voice + Chat AI Into Your PSA Without Rewriting a Workflow

Michigan MSPs and IT helpdesks: integrate CallSphere's 10-agent voice + chat AI into ConnectWise, Autotask, ServiceNow, or your PSA in 24-72 hours.

IT Helpdesk

From Rochester to Statewide MN: Smooth CallSphere Rollout for MSPs Running Halo, Freshservice, and Jira SM

Minnesota MSPs and IT helpdesks: integrate CallSphere's 10-agent voice + chat AI into ConnectWise, Autotask, ServiceNow, or your PSA in 24-72 hours.

IT Helpdesk

Why Pennsylvania IT Helpdesks Are Routing L1 Tickets Through CallSphere's 10-Agent AI — Pittsburgh Lead Adopters

Pennsylvania MSPs and IT helpdesks: integrate CallSphere's 10-agent voice + chat AI into ConnectWise, Autotask, ServiceNow, or your PSA in 24-72 hours.