By Sagar Shankaran, Founder of CallSphere
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Key takeaways
TL;DR — Haystack 2.7 introduced a stable
Agentcomponent with native tool-calling and exit conditions. Pair it with Ollama's Llama 3.2 3B andpgvectorand you have a fully local, citing, agentic RAG service in ~150 lines.
A FastAPI /chat endpoint that retrieves from a Postgres+pgvector store, calls Llama 3.2 via Ollama, and returns answers with document citations. The agent decides when to call retrieval (vs answer from memory) — agentic, not naive RAG.
pip install "haystack-ai>=2.7" ollama-haystack pgvector-haystack fastapi uvicorn psycopg2-binary sentence-transformers.ollama pull llama3.2:3b and ollama pull nomic-embed-text..txt/.md documents to index.flowchart LR
Q[User] --> AGT[Haystack Agent]
AGT --> LLM[OllamaChatGenerator llama3.2:3b]
AGT --> RT[retrieve_docs tool]
RT --> EMB[OllamaEmbedder nomic-embed-text]
RT --> PG[(pgvector store)]
AGT -->|cited answer| Q
```bash docker run -d --name pgv -e POSTGRES_PASSWORD=pw -p 5432:5432 \ ankane/pgvector:latest psql postgresql://postgres:pw@127.0.0.1/postgres -c "CREATE EXTENSION IF NOT EXISTS vector;" ```
```python
from pathlib import Path from haystack import Document, Pipeline from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore from haystack_integrations.components.embedders.ollama import OllamaDocumentEmbedder from haystack.components.preprocessors import DocumentSplitter from haystack.components.writers import DocumentWriter
store = PgvectorDocumentStore( table_name="docs", embedding_dimension=768, vector_function="cosine_similarity", connection_string="postgresql://postgres:pw@127.0.0.1/postgres", recreate_table=True)
raw = [Document(content=p.read_text(), meta={"source": p.name}) for p in Path("./corpus").glob("*.md")]
ix = Pipeline() ix.add_component("split", DocumentSplitter(split_by="word", split_length=200, split_overlap=30)) ix.add_component("emb", OllamaDocumentEmbedder(model="nomic-embed-text", url="http://127.0.0.1:11434")) ix.add_component("write", DocumentWriter(document_store=store)) ix.connect("split.documents", "emb.documents") ix.connect("emb.documents", "write.documents") ix.run({"split": {"documents": raw}}) print("Indexed:", store.count_documents()) ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
```python
from haystack.tools import Tool from haystack_integrations.components.embedders.ollama import OllamaTextEmbedder from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever
emb = OllamaTextEmbedder(model="nomic-embed-text", url="http://127.0.0.1:11434") retr = PgvectorEmbeddingRetriever(document_store=store, top_k=5)
def retrieve_docs(query: str) -> list[dict]: """Retrieve top documents matching the query.""" e = emb.run(text=query)["embedding"] docs = retr.run(query_embedding=e)["documents"] return [{"source": d.meta.get("source"), "snippet": d.content[:400]} for d in docs]
retrieve_tool = Tool( name="retrieve_docs", description="Search the knowledge base for documents relevant to a query.", parameters={"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}, function=retrieve_docs) ```
```python
from haystack_integrations.components.generators.ollama import OllamaChatGenerator from haystack.components.agents import Agent from haystack.dataclasses import ChatMessage
llm = OllamaChatGenerator(model="llama3.2:3b", url="http://127.0.0.1:11434", generation_kwargs={"temperature": 0.3})
agent = Agent( chat_generator=llm, tools=[retrieve_tool], system_prompt=( "You are a helpful assistant. When you don't know an answer or when the user asks " "about company-specific topics, call retrieve_docs to find context. " "Always cite sources by file name in [brackets]."), exit_conditions=["text"], # stop when LLM produces text (not a tool call) max_agent_steps=5) ```
```python
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
from fastapi import FastAPI from pydantic import BaseModel from agent import agent from haystack.dataclasses import ChatMessage app = FastAPI() SESSIONS = {}
class Q(BaseModel): session_id: str message: str
@app.post("/chat") def chat(q: Q): history = SESSIONS.setdefault(q.session_id, []) history.append(ChatMessage.from_user(q.message)) out = agent.run(messages=history) reply = out["messages"][-1] history.append(reply) return {"answer": reply.text} ```
```bash uvicorn server:app --port 8002 & curl -s -XPOST http://127.0.0.1:8002/chat \ -H 'content-type: application/json' \ -d '{"session_id":"u1","message":"What does the onboarding doc say about MFA?"}' ```
The agent decides whether to call retrieve_docs, weaves snippets into the answer, and cites sources by filename.
nomic-embed-text is 768; bge-m3 is 1024. Pick one and reindex if you switch.CallSphere uses a similar agentic RAG pattern across our 37 voice + chat specialists in 6 verticals — Healthcare's 14 HIPAA tools on FastAPI :8084 with OpenAI Realtime, OneRoof Property's 10 specialists, plus Salon, Dental, F&B, and Behavioral. 90+ tools, 115+ Postgres tables, all citation-aware. Flat pricing $149/$499/$1499 — 14-day trial · 22% affiliate · /pricing · /demo.
Why Haystack over LangChain? Cleaner pipeline graph + first-class document store integrations.
Local embeddings vs OpenAI? nomic-embed-text is good enough for English/code; bge-m3 for multilingual.
Streaming? OllamaChatGenerator supports streaming since 1.7.
Multi-tenant doc stores? Use a tenant_id meta filter on the retriever.
Add reranking? SentenceTransformersDiversityRanker between retriever and prompt.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to building a chatbot for answering questions on your website: RAG, voice, and how CallSphere ships one in 3-5 days.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
A founder's guide on how to create a chatbot in 2026. Build options, AI stack, integration patterns, and when buying a managed agent wins over building.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.
Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI