Skip to content
AI Engineering
AI Engineering11 min read0 views

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

TL;DR — Haystack 2.7 introduced a stable Agent component with native tool-calling and exit conditions. Pair it with Ollama's Llama 3.2 3B and pgvector and you have a fully local, citing, agentic RAG service in ~150 lines.

What you'll build

A FastAPI /chat endpoint that retrieves from a Postgres+pgvector store, calls Llama 3.2 via Ollama, and returns answers with document citations. The agent decides when to call retrieval (vs answer from memory) — agentic, not naive RAG.

Prerequisites

  1. Python 3.11, pip install "haystack-ai>=2.7" ollama-haystack pgvector-haystack fastapi uvicorn psycopg2-binary sentence-transformers.
  2. Postgres 16 with pgvector extension.
  3. Ollama running: ollama pull llama3.2:3b and ollama pull nomic-embed-text.
  4. A folder of .txt/.md documents to index.

Architecture

flowchart LR
  Q[User] --> AGT[Haystack Agent]
  AGT --> LLM[OllamaChatGenerator llama3.2:3b]
  AGT --> RT[retrieve_docs tool]
  RT --> EMB[OllamaEmbedder nomic-embed-text]
  RT --> PG[(pgvector store)]
  AGT -->|cited answer| Q

Step 1 — Spin up Postgres + pgvector

```bash docker run -d --name pgv -e POSTGRES_PASSWORD=pw -p 5432:5432 \ ankane/pgvector:latest psql postgresql://postgres:[email protected]/postgres -c "CREATE EXTENSION IF NOT EXISTS vector;" ```

Step 2 — Index documents

```python

index.py

from pathlib import Path from haystack import Document, Pipeline from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore from haystack_integrations.components.embedders.ollama import OllamaDocumentEmbedder from haystack.components.preprocessors import DocumentSplitter from haystack.components.writers import DocumentWriter

store = PgvectorDocumentStore( table_name="docs", embedding_dimension=768, vector_function="cosine_similarity", connection_string="postgresql://postgres:[email protected]/postgres", recreate_table=True)

raw = [Document(content=p.read_text(), meta={"source": p.name}) for p in Path("./corpus").glob("*.md")]

ix = Pipeline() ix.add_component("split", DocumentSplitter(split_by="word", split_length=200, split_overlap=30)) ix.add_component("emb", OllamaDocumentEmbedder(model="nomic-embed-text", url="http://127.0.0.1:11434")) ix.add_component("write", DocumentWriter(document_store=store)) ix.connect("split.documents", "emb.documents") ix.connect("emb.documents", "write.documents") ix.run({"split": {"documents": raw}}) print("Indexed:", store.count_documents()) ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Define the retrieval tool

```python

retrieval_tool.py

from haystack.tools import Tool from haystack_integrations.components.embedders.ollama import OllamaTextEmbedder from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever

emb = OllamaTextEmbedder(model="nomic-embed-text", url="http://127.0.0.1:11434") retr = PgvectorEmbeddingRetriever(document_store=store, top_k=5)

def retrieve_docs(query: str) -> list[dict]: """Retrieve top documents matching the query.""" e = emb.run(text=query)["embedding"] docs = retr.run(query_embedding=e)["documents"] return [{"source": d.meta.get("source"), "snippet": d.content[:400]} for d in docs]

retrieve_tool = Tool( name="retrieve_docs", description="Search the knowledge base for documents relevant to a query.", parameters={"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}, function=retrieve_docs) ```

Step 4 — Wire the Agent

```python

agent.py

from haystack_integrations.components.generators.ollama import OllamaChatGenerator from haystack.components.agents import Agent from haystack.dataclasses import ChatMessage

llm = OllamaChatGenerator(model="llama3.2:3b", url="http://127.0.0.1:11434", generation_kwargs={"temperature": 0.3})

agent = Agent( chat_generator=llm, tools=[retrieve_tool], system_prompt=( "You are a helpful assistant. When you don't know an answer or when the user asks " "about company-specific topics, call retrieve_docs to find context. " "Always cite sources by file name in [brackets]."), exit_conditions=["text"], # stop when LLM produces text (not a tool call) max_agent_steps=5) ```

Step 5 — FastAPI chat endpoint

```python

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

server.py

from fastapi import FastAPI from pydantic import BaseModel from agent import agent from haystack.dataclasses import ChatMessage app = FastAPI() SESSIONS = {}

class Q(BaseModel): session_id: str message: str

@app.post("/chat") def chat(q: Q): history = SESSIONS.setdefault(q.session_id, []) history.append(ChatMessage.from_user(q.message)) out = agent.run(messages=history) reply = out["messages"][-1] history.append(reply) return {"answer": reply.text} ```

Step 6 — Try it

```bash uvicorn server:app --port 8002 & curl -s -XPOST http://127.0.0.1:8002/chat \ -H 'content-type: application/json' \ -d '{"session_id":"u1","message":"What does the onboarding doc say about MFA?"}' ```

The agent decides whether to call retrieve_docs, weaves snippets into the answer, and cites sources by filename.

Common pitfalls

  • Embedding dim mismatch. nomic-embed-text is 768; bge-m3 is 1024. Pick one and reindex if you switch.
  • Llama 3.2 3B context. 8k tokens — chunk RAG context; don't dump 50 docs.
  • Tool-call format. Llama 3.x in Ollama uses an OpenAI-compatible JSON schema; older Mistral 7B sometimes drops tool calls.

How CallSphere does this in production

CallSphere uses a similar agentic RAG pattern across our 37 voice + chat specialists in 6 verticals — Healthcare's 14 HIPAA tools on FastAPI :8084 with OpenAI Realtime, OneRoof Property's 10 specialists, plus Salon, Dental, F&B, and Behavioral. 90+ tools, 115+ Postgres tables, all citation-aware. Flat pricing $149/$499/$1499 — 14-day trial · 22% affiliate · /pricing · /demo.

FAQ

Why Haystack over LangChain? Cleaner pipeline graph + first-class document store integrations.

Local embeddings vs OpenAI? nomic-embed-text is good enough for English/code; bge-m3 for multilingual.

Streaming? OllamaChatGenerator supports streaming since 1.7.

Multi-tenant doc stores? Use a tenant_id meta filter on the retriever.

Add reranking? SentenceTransformersDiversityRanker between retriever and prompt.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.