TL;DR — Haystack 2.7 introduced a stable Agent component with native tool-calling and exit conditions. Pair it with Ollama's Llama 3.2 3B and pgvector and you have a fully local, citing, agentic RAG service in ~150 lines.

What you'll build

A FastAPI /chat endpoint that retrieves from a Postgres+pgvector store, calls Llama 3.2 via Ollama, and returns answers with document citations. The agent decides when to call retrieval (vs answer from memory) — agentic, not naive RAG.

Prerequisites

Python 3.11, pip install "haystack-ai>=2.7" ollama-haystack pgvector-haystack fastapi uvicorn psycopg2-binary sentence-transformers.
Postgres 16 with pgvector extension.
Ollama running: ollama pull llama3.2:3b and ollama pull nomic-embed-text.
A folder of .txt/.md documents to index.

Architecture

flowchart LR
  Q[User] --> AGT[Haystack Agent]
  AGT --> LLM[OllamaChatGenerator llama3.2:3b]
  AGT --> RT[retrieve_docs tool]
  RT --> EMB[OllamaEmbedder nomic-embed-text]
  RT --> PG[(pgvector store)]
  AGT -->|cited answer| Q

Step 1 — Spin up Postgres + pgvector

```bash docker run -d --name pgv -e POSTGRES_PASSWORD=pw -p 5432:5432 \ ankane/pgvector:latest psql postgresql://postgres:[email protected]/postgres -c "CREATE EXTENSION IF NOT EXISTS vector;" ```

Step 2 — Index documents

```python

index.py

from pathlib import Path from haystack import Document, Pipeline from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore from haystack_integrations.components.embedders.ollama import OllamaDocumentEmbedder from haystack.components.preprocessors import DocumentSplitter from haystack.components.writers import DocumentWriter

store = PgvectorDocumentStore( table_name="docs", embedding_dimension=768, vector_function="cosine_similarity", connection_string="postgresql://postgres:[email protected]/postgres", recreate_table=True)

raw = [Document(content=p.read_text(), meta={"source": p.name}) for p in Path("./corpus").glob("*.md")]

ix = Pipeline() ix.add_component("split", DocumentSplitter(split_by="word", split_length=200, split_overlap=30)) ix.add_component("emb", OllamaDocumentEmbedder(model="nomic-embed-text", url="http://127.0.0.1:11434")) ix.add_component("write", DocumentWriter(document_store=store)) ix.connect("split.documents", "emb.documents") ix.connect("emb.documents", "write.documents") ix.run({"split": {"documents": raw}}) print("Indexed:", store.count_documents()) ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Step 3 — Define the retrieval tool

```python

retrieval_tool.py

from haystack.tools import Tool from haystack_integrations.components.embedders.ollama import OllamaTextEmbedder from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever

emb = OllamaTextEmbedder(model="nomic-embed-text", url="http://127.0.0.1:11434") retr = PgvectorEmbeddingRetriever(document_store=store, top_k=5)

def retrieve_docs(query: str) -> list[dict]: """Retrieve top documents matching the query.""" e = emb.run(text=query)["embedding"] docs = retr.run(query_embedding=e)["documents"] return [{"source": d.meta.get("source"), "snippet": d.content[:400]} for d in docs]

retrieve_tool = Tool( name="retrieve_docs", description="Search the knowledge base for documents relevant to a query.", parameters={"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}, function=retrieve_docs) ```

Step 4 — Wire the Agent

```python

agent.py

from haystack_integrations.components.generators.ollama import OllamaChatGenerator from haystack.components.agents import Agent from haystack.dataclasses import ChatMessage

llm = OllamaChatGenerator(model="llama3.2:3b", url="http://127.0.0.1:11434", generation_kwargs={"temperature": 0.3})

agent = Agent( chat_generator=llm, tools=[retrieve_tool], system_prompt=( "You are a helpful assistant. When you don't know an answer or when the user asks " "about company-specific topics, call retrieve_docs to find context. " "Always cite sources by file name in [brackets]."), exit_conditions=["text"], # stop when LLM produces text (not a tool call) max_agent_steps=5) ```

Step 5 — FastAPI chat endpoint

```python

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

server.py

from fastapi import FastAPI from pydantic import BaseModel from agent import agent from haystack.dataclasses import ChatMessage app = FastAPI() SESSIONS = {}

class Q(BaseModel): session_id: str message: str

@app.post("/chat") def chat(q: Q): history = SESSIONS.setdefault(q.session_id, []) history.append(ChatMessage.from_user(q.message)) out = agent.run(messages=history) reply = out["messages"][-1] history.append(reply) return {"answer": reply.text} ```

Step 6 — Try it

```bash uvicorn server:app --port 8002 & curl -s -XPOST http://127.0.0.1:8002/chat \ -H 'content-type: application/json' \ -d '{"session_id":"u1","message":"What does the onboarding doc say about MFA?"}' ```

The agent decides whether to call retrieve_docs, weaves snippets into the answer, and cites sources by filename.

Common pitfalls

Embedding dim mismatch. nomic-embed-text is 768; bge-m3 is 1024. Pick one and reindex if you switch.
Llama 3.2 3B context. 8k tokens — chunk RAG context; don't dump 50 docs.
Tool-call format. Llama 3.x in Ollama uses an OpenAI-compatible JSON schema; older Mistral 7B sometimes drops tool calls.

How CallSphere does this in production

CallSphere uses a similar agentic RAG pattern across our 37 voice + chat specialists in 6 verticals — Healthcare's 14 HIPAA tools on FastAPI :8084 with OpenAI Realtime, OneRoof Property's 10 specialists, plus Salon, Dental, F&B, and Behavioral. 90+ tools, 115+ Postgres tables, all citation-aware. Flat pricing $149/$499/$1499 — 14-day trial · 22% affiliate · /pricing · /demo.

FAQ

Why Haystack over LangChain? Cleaner pipeline graph + first-class document store integrations.

Local embeddings vs OpenAI? nomic-embed-text is good enough for English/code; bge-m3 for multilingual.

Streaming? OllamaChatGenerator supports streaming since 1.7.

Multi-tenant doc stores? Use a tenant_id meta filter on the retriever.

Add reranking? SentenceTransformersDiversityRanker between retriever and prompt.

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

What you'll build

Prerequisites

Architecture

Step 1 — Spin up Postgres + pgvector

Step 2 — Index documents

index.py

Step 3 — Define the retrieval tool

retrieval_tool.py

Step 4 — Wire the Agent

agent.py

Step 5 — FastAPI chat endpoint

server.py

Step 6 — Try it

Common pitfalls

How CallSphere does this in production

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

Build a CallSphere-Style Outbound Voice Campaign Tool

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide