By Sagar Shankaran, Founder of CallSphere
Index a knowledge base with text-embedding-3-large into ChromaDB, expose a retrieve tool to your voice agent, and ground every answer in real documents — full Python tutorial.
Key takeaways
TL;DR — A voice agent without RAG hallucinates pricing, hours, and policy. Add a single
retrievetool backed by ChromaDB andtext-embedding-3-large, and accuracy on factual questions jumps from ~70% to >95%.
A Python voice agent that answers questions strictly from your indexed knowledge base. Caller asks "what's your refund policy?" — the agent calls the retrieve tool, fetches top-3 chunks, and reads back the grounded answer with no hallucination.
pip install chromadb openai.OPENAI_API_KEY exported.flowchart LR
Q[Caller question] --> R[retrieve tool]
R --> E[OpenAI embed-3-large]
E --> C[ChromaDB query]
C --> T[Top-3 chunks]
T --> M[Realtime model]
M --> A[Grounded answer]
```python
import os, chromadb from openai import OpenAI
client = OpenAI() chroma = chromadb.PersistentClient(path="./kb") col = chroma.get_or_create_collection(name="callsphere_kb")
def chunk(text, size=800, overlap=100): out = [] for i in range(0, len(text), size - overlap): out.append(text[i:i+size]) return out
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
def embed(texts): r = client.embeddings.create(model="text-embedding-3-large", input=texts, dimensions=1024) return [d.embedding for d in r.data]
for fname in os.listdir("docs"): text = open(f"docs/{fname}").read() chunks = chunk(text) col.add( ids=[f"{fname}-{i}" for i in range(len(chunks))], documents=chunks, embeddings=embed(chunks), metadatas=[{"source": fname} for _ in chunks], ) print("Indexed", col.count(), "chunks") ```
Running this once embeds your KB; ChromaDB persists vectors locally.
```python def retrieve(query: str, k: int = 3) -> str: qvec = embed([query]) res = col.query(query_embeddings=qvec, n_results=k) chunks = res["documents"][0] sources = [m["source"] for m in res["metadatas"][0]] return "\n---\n".join(f"[{s}]\n{c}" for s, c in zip(sources, chunks)) ```
For OpenAI Realtime over WebSocket, you declare tools in session.update and intercept response.function_call_arguments.done:
```python TOOLS = [{ "type": "function", "name": "retrieve", "description": "Retrieve documents from the CallSphere knowledge base. Use ALWAYS for policy, pricing, hours.", "parameters": { "type": "object", "properties": { "query": { "type": "string" } }, "required": ["query"], }, }]
SESSION = { "type": "session.update", "session": { "voice": "alloy", "instructions": "Answer ONLY using retrieve tool output. If retrieve returns nothing relevant, say you don't know.", "tools": TOOLS, "tool_choice": "auto", "input_audio_format": "pcm16", "output_audio_format": "pcm16", } } ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```python async for raw in oai: ev = json.loads(raw) if ev["type"] == "response.function_call_arguments.done": args = json.loads(ev["arguments"]) result = retrieve(args["query"]) await oai.send(json.dumps({ "type": "conversation.item.create", "item": { "type": "function_call_output", "call_id": ev["call_id"], "output": result, }, })) await oai.send(json.dumps({"type": "response.create"})) ```
text-embedding-3-large with dimensions=1024 (cheaper, almost as good as 3072).where={"source": "pricing.md"}) when the agent already knows the topic.```python from openai import OpenAI client = OpenAI()
def rerank(query, chunks): prompt = f"Score each passage 0-10 for answering: {query}\n\n" + "\n".join(f"[{i}] {c}" for i,c in enumerate(chunks)) r = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role":"user","content":prompt}], response_format={"type":"json_object"}) scores = json.loads(r.choices[0].message.content) return [chunks[i] for i in sorted(scores, key=scores.get, reverse=True)[:3]] ```
Answer ONLY using retrieve in the prompt, the model still hallucinates. Be explicit.CallSphere's Healthcare agent retrieves from a 5,000-chunk KB (clinic protocols, insurance acceptance, hours per location) before every factual answer. We use text-embedding-3-large at 1024 dims, ChromaDB self-hosted on k3s, and re-index nightly. Hit rate on 200 eval Q/A: 96.5%. Lead score and sentiment are appended post-call to Postgres. Learn more.
ChromaDB vs Pinecone? Chroma is great for self-hosted and <10M vectors. Pinecone is managed and scales to billions.
Embedding cost? text-embedding-3-large is $0.13 per 1M tokens. Indexing 100k docs is usually under $20.
How do I refresh the index? Re-run index.py nightly via cron, or add an upsert endpoint that re-embeds changed files only.
Why not stuff all docs in the system prompt? Beyond ~10k tokens, latency tanks and accuracy drops. RAG keeps prompts tight.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to building a chatbot for answering questions on your website: RAG, voice, and how CallSphere ships one in 3-5 days.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
A founder's guide on how to create a chatbot in 2026. Build options, AI stack, integration patterns, and when buying a managed agent wins over building.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI