---
title: "Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)"
description: "Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store."
canonical: https://callsphere.ai/blog/vw4h-build-chat-agent-haystack-rag-open-llm
category: "AI Engineering"
tags: ["Haystack", "RAG", "Llama 3.2", "Open LLM", "Tutorial"]
author: "CallSphere Team"
published: 2026-05-07T00:00:00.000Z
updated: 2026-05-07T16:13:47.554Z
---

# Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

> Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

> **TL;DR** — Haystack 2.7 introduced a stable `Agent` component with native tool-calling and exit conditions. Pair it with Ollama's Llama 3.2 3B and `pgvector` and you have a fully local, citing, agentic RAG service in ~150 lines.

## What you'll build

A FastAPI `/chat` endpoint that retrieves from a Postgres+pgvector store, calls Llama 3.2 via Ollama, and returns answers with document citations. The agent decides when to call retrieval (vs answer from memory) — agentic, not naive RAG.

## Prerequisites

1. Python 3.11, `pip install "haystack-ai>=2.7" ollama-haystack pgvector-haystack fastapi uvicorn psycopg2-binary sentence-transformers`.
2. Postgres 16 with pgvector extension.
3. Ollama running: `ollama pull llama3.2:3b` and `ollama pull nomic-embed-text`.
4. A folder of `.txt`/`.md` documents to index.

## Architecture

```mermaid
flowchart LR
  Q[User] --> AGT[Haystack Agent]
  AGT --> LLM[OllamaChatGenerator llama3.2:3b]
  AGT --> RT[retrieve_docs tool]
  RT --> EMB[OllamaEmbedder nomic-embed-text]
  RT --> PG[(pgvector store)]
  AGT -->|cited answer| Q
```

## Step 1 — Spin up Postgres + pgvector

```bash
docker run -d --name pgv -e POSTGRES_PASSWORD=pw -p 5432:5432 \
  ankane/pgvector:latest
psql postgresql://postgres:[pw@127.0.0.1](mailto:pw@127.0.0.1)/postgres -c "CREATE EXTENSION IF NOT EXISTS vector;"
```

## Step 2 — Index documents

```python

# index.py

from pathlib import Path
from haystack import Document, Pipeline
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.embedders.ollama import OllamaDocumentEmbedder
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

store = PgvectorDocumentStore(
    table_name="docs", embedding_dimension=768, vector_function="cosine_similarity",
    connection_string="postgresql://postgres:[pw@127.0.0.1](mailto:pw@127.0.0.1)/postgres", recreate_table=True)

raw = [Document(content=p.read_text(), meta={"source": p.name})
       for p in Path("./corpus").glob("*.md")]

ix = Pipeline()
ix.add_component("split", DocumentSplitter(split_by="word", split_length=200, split_overlap=30))
ix.add_component("emb", OllamaDocumentEmbedder(model="nomic-embed-text",
                  url="[http://127.0.0.1:11434](http://127.0.0.1:11434)"))
ix.add_component("write", DocumentWriter(document_store=store))
ix.connect("split.documents", "emb.documents")
ix.connect("emb.documents", "write.documents")
ix.run({"split": {"documents": raw}})
print("Indexed:", store.count_documents())
```

## Step 3 — Define the retrieval tool

```python

# retrieval_tool.py

from haystack.tools import Tool
from haystack_integrations.components.embedders.ollama import OllamaTextEmbedder
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever

emb = OllamaTextEmbedder(model="nomic-embed-text", url="[http://127.0.0.1:11434](http://127.0.0.1:11434)")
retr = PgvectorEmbeddingRetriever(document_store=store, top_k=5)

def retrieve_docs(query: str) -> list[dict]:
    """Retrieve top documents matching the query."""
    e = emb.run(text=query)["embedding"]
    docs = retr.run(query_embedding=e)["documents"]
    return [{"source": d.meta.get("source"), "snippet": d.content[:400]} for d in docs]

retrieve_tool = Tool(
    name="retrieve_docs",
    description="Search the knowledge base for documents relevant to a query.",
    parameters={"type":"object","properties":{"query":{"type":"string"}},"required":["query"]},
    function=retrieve_docs)
```

## Step 4 — Wire the Agent

```python

# agent.py

from haystack_integrations.components.generators.ollama import OllamaChatGenerator
from haystack.components.agents import Agent
from haystack.dataclasses import ChatMessage

llm = OllamaChatGenerator(model="llama3.2:3b", url="[http://127.0.0.1:11434](http://127.0.0.1:11434)",
                           generation_kwargs={"temperature": 0.3})

agent = Agent(
  chat_generator=llm,
  tools=[retrieve_tool],
  system_prompt=(
    "You are a helpful assistant. When you don't know an answer or when the user asks "
    "about company-specific topics, call retrieve_docs to find context. "
    "Always cite sources by file name in [brackets]."),
  exit_conditions=["text"],   # stop when LLM produces text (not a tool call)
  max_agent_steps=5)
```

## Step 5 — FastAPI chat endpoint

```python

# server.py

from fastapi import FastAPI
from pydantic import BaseModel
from agent import agent
from haystack.dataclasses import ChatMessage
app = FastAPI()
SESSIONS = {}

class Q(BaseModel):
    session_id: str
    message: str

@app.post("/chat")
def chat(q: Q):
    history = SESSIONS.setdefault(q.session_id, [])
    history.append(ChatMessage.from_user(q.message))
    out = agent.run(messages=history)
    reply = out["messages"][-1]
    history.append(reply)
    return {"answer": reply.text}
```

## Step 6 — Try it

```bash
uvicorn server:app --port 8002 &
curl -s -XPOST [http://127.0.0.1:8002/chat](http://127.0.0.1:8002/chat) \
  -H 'content-type: application/json' \
  -d '{"session_id":"u1","message":"What does the onboarding doc say about MFA?"}'
```

The agent decides whether to call `retrieve_docs`, weaves snippets into the answer, and cites sources by filename.

## Common pitfalls

- **Embedding dim mismatch.** `nomic-embed-text` is 768; bge-m3 is 1024. Pick one and reindex if you switch.
- **Llama 3.2 3B context.** 8k tokens — chunk RAG context; don't dump 50 docs.
- **Tool-call format.** Llama 3.x in Ollama uses an OpenAI-compatible JSON schema; older Mistral 7B sometimes drops tool calls.

## How CallSphere does this in production

CallSphere uses a similar agentic RAG pattern across our 37 voice + chat specialists in 6 verticals — Healthcare's 14 HIPAA tools on FastAPI :8084 with OpenAI Realtime, OneRoof Property's 10 specialists, plus Salon, Dental, F&B, and Behavioral. 90+ tools, 115+ Postgres tables, all citation-aware. Flat pricing $149/$499/$1499 — [14-day trial](/trial) · [22% affiliate](/affiliate) · [/pricing](/pricing) · [/demo](/demo).

## FAQ

**Why Haystack over LangChain?** Cleaner pipeline graph + first-class document store integrations.

**Local embeddings vs OpenAI?** `nomic-embed-text` is good enough for English/code; `bge-m3` for multilingual.

**Streaming?** `OllamaChatGenerator` supports streaming since 1.7.

**Multi-tenant doc stores?** Use a `tenant_id` meta filter on the retriever.

**Add reranking?** `SentenceTransformersDiversityRanker` between retriever and prompt.

## Sources

- [Haystack on GitHub](https://github.com/deepset-ai/haystack)
- [Haystack Ollama integration](https://haystack.deepset.ai/integrations/ollama)
- [Agentic RAG with Llama 3.2 3B](https://haystack.deepset.ai/cookbook/llama32_agentic_rag)
- [OllamaChatGenerator docs](https://docs.haystack.deepset.ai/docs/ollamachatgenerator)

---

Source: https://callsphere.ai/blog/vw4h-build-chat-agent-haystack-rag-open-llm