---
title: "Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines"
description: "Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift."
canonical: https://callsphere.ai/blog/agentic-rag-langgraph-iterative-retrieval-2026
category: "Agentic AI"
tags: ["RAG", "LangGraph", "Agent Evaluation", "Production AI", "Vector Search", "LangChain"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.608Z
---

# Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

> Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

## TL;DR

Single-shot RAG (retrieve once, generate once) handles the easy 70% of questions. The remaining 30% — multi-hop, ambiguous phrasing, vocabulary mismatch, missing context on first pull — is where production RAG either gets quiet about its failures or gets agentic. **Agentic RAG** is RAG wired as a LangGraph state graph that can re-retrieve, grade its own retrievals, rewrite the query when retrieval is bad, and grade its own answer before returning. Self-RAG, CRAG (Corrective RAG), and adaptive RAG are all variations on the same loop. They cost 2–4x more in tokens and latency than single-shot RAG. They are worth that price on a measurable subset of questions and absolutely not worth it on the rest. This post is the working LangGraph implementation, the per-iteration eval pattern, and the cost analysis we use to decide when to send a question to the agentic path versus the cheap path on CallSphere [healthcare](/industries) and [finance](/industries) RAG agents.

## Why Single-Shot RAG Quietly Fails

The single-shot pattern from the [companion post on production RAG with RAGAS](/blog/production-rag-agents-langchain-ragas-eval-2026) looks like this:

`question → retrieve(k=12) → rerank → generate → answer`

It fails silently when:

- The user's phrasing does not match the source vocabulary ("can I refill my prescription early?" vs. document language about "early refill authorization policy"). Retrieval misses; the LLM either hallucinates or punts.
- The question is multi-hop. ("What is the copay for a 90-day supply of the generic equivalent?") The first retrieval pulls one of the two needed chunks; the answer is wrong-but-plausible.
- The retrieved chunks contradict each other. Single-shot has no mechanism to re-retrieve to disambiguate.
- The retrieved chunks are off-topic but the LLM tries to answer anyway.

Each of these is a *retrieval problem masquerading as a generation problem.* The agentic loop fixes them by giving the system the ability to look again.

## The Three Patterns

The literature uses three names that are mostly the same idea with different tuning:

| Pattern | Origin | Distinguishing move |
| --- | --- | --- |
| Self-RAG | Asai et al. 2023 | Generation produces reflection tokens that decide whether to retrieve |
| CRAG (Corrective RAG) | Yan et al. 2024 | A retrieval grader fires before generation; bad retrievals trigger query rewriting and web search fallback |
| Adaptive RAG | Jeong et al. 2024 | A classifier routes simple questions to single-shot, complex ones to iterative |

In production, the practical synthesis is: **use a retrieval grader, rewrite the query when retrieval is bad, optionally re-retrieve, grade the final answer, and cap iterations.** That is what LangGraph models cleanly.

## The State Graph

```mermaid
flowchart TD
  S[start] --> R1[retrieve]
  R1 --> GD{grade documents}
  GD -->|relevant| GEN[generate answer]
  GD -->|not relevant + iters < 3| RW[rewrite query]
  GD -->|not relevant + iters >= 3| FAIL[fallback: I cannot answer]
  RW --> R1
  GEN --> GA{grade answer}
  GA -->|grounded + addresses question| END[return answer]
  GA -->|hallucination| GEN
  GA -->|not addressing question + iters < 3| RW
  GA -->|not addressing + iters >= 3| FAIL
  style GD fill:#ffd
  style GA fill:#ffd
  style FAIL fill:#fcc
  style END fill:#cfc
```

*Figure 1 — The agentic RAG loop as a LangGraph state graph. Two grader nodes act as conditional edges; the rewrite/regenerate self-loops are bounded by an iteration counter to prevent runaway cost.*

## The LangGraph Implementation

Pinned to LangGraph 0.2.x with the modern `StateGraph` API. Same chat model and embeddings as the single-shot pattern (`gpt-4o-2024-08-06`, `text-embedding-3-large`).

```python
from typing import TypedDict, List
from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

CHAT_MODEL = "gpt-4o-2024-08-06"
GRADER_MODEL = "gpt-4o-mini-2024-07-18"  # cheaper for graders
MAX_ITERATIONS = 3

class AgentState(TypedDict):
    question: str
    rewritten_question: str
    documents: List[Document]
    answer: str
    iterations: int
    trace: List[dict]

class DocGrade(BaseModel):
    relevant: bool = Field(description="Whether docs answer the question")
    reason: str

class AnswerGrade(BaseModel):
    grounded: bool = Field(description="Answer supported by docs")
    addresses_question: bool

grader_llm = ChatOpenAI(model=GRADER_MODEL, temperature=0)
gen_llm = ChatOpenAI(model=CHAT_MODEL, temperature=0)

def retrieve(state: AgentState) -> AgentState:
    q = state.get("rewritten_question") or state["question"]
    docs = retrieve_and_rerank(q)  # same retriever as single-shot
    return {**state, "documents": docs,
            "trace": state["trace"] + [{"node": "retrieve", "k": len(docs)}]}

def grade_documents(state: AgentState) -> AgentState:
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Grade whether the documents are relevant to the question. "
                   "Be strict: if no chunk directly addresses the question, mark not relevant."),
        ("human", "Q: {q}\n\nDocs:\n{docs}"),
    ])
    chain = prompt | grader_llm.with_structured_output(DocGrade)
    grade = chain.invoke({
        "q": state["question"],
        "docs": "\n---\n".join(d.page_content for d in state["documents"]),
    })
    return {**state, "trace": state["trace"] + [
        {"node": "grade_documents", "relevant": grade.relevant, "reason": grade.reason}
    ]}

def rewrite_query(state: AgentState) -> AgentState:
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Rewrite the question to better match formal documentation vocabulary. "
                   "Expand abbreviations, use canonical terminology, keep meaning identical."),
        ("human", "Original: {q}\nPrevious docs (irrelevant):\n{docs}"),
    ])
    chain = prompt | gen_llm
    rewritten = chain.invoke({
        "q": state["question"],
        "docs": "\n---\n".join(d.page_content[:200] for d in state["documents"]),
    }).content
    return {**state,
            "rewritten_question": rewritten,
            "iterations": state["iterations"] + 1,
            "trace": state["trace"] + [{"node": "rewrite", "new_q": rewritten}]}

def generate(state: AgentState) -> AgentState:
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer ONLY using the context. Cite source_id in brackets. "
                   "If context is insufficient, say so."),
        ("human", "Q: {q}\n\nContext:\n{ctx}"),
    ])
    chain = prompt | gen_llm
    answer = chain.invoke({
        "q": state["question"],
        "ctx": "\n\n".join(f"[{d.metadata['source_id']}] {d.page_content}"
                              for d in state["documents"]),
    }).content
    return {**state, "answer": answer,
            "trace": state["trace"] + [{"node": "generate"}]}

def grade_answer(state: AgentState) -> AgentState:
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Two checks: (1) is the answer supported by the docs (grounded)? "
                   "(2) does it actually address the original question?"),
        ("human", "Q: {q}\nDocs:\n{docs}\nAnswer: {a}"),
    ])
    chain = prompt | grader_llm.with_structured_output(AnswerGrade)
    grade = chain.invoke({
        "q": state["question"],
        "docs": "\n---\n".join(d.page_content for d in state["documents"]),
        "a": state["answer"],
    })
    return {**state, "trace": state["trace"] + [
        {"node": "grade_answer",
         "grounded": grade.grounded,
         "addresses": grade.addresses_question}
    ]}

# Conditional edges
def docs_route(state: AgentState):
    last = state["trace"][-1]
    if last["relevant"]:
        return "generate"
    if state["iterations"] >= MAX_ITERATIONS:
        return "fallback"
    return "rewrite"

def answer_route(state: AgentState):
    last = state["trace"][-1]
    if last["grounded"] and last["addresses"]:
        return "end"
    if not last["grounded"]:
        return "generate"  # try again with same docs
    if state["iterations"] >= MAX_ITERATIONS:
        return "fallback"
    return "rewrite"

def fallback(state: AgentState) -> AgentState:
    return {**state, "answer": "I do not have sufficient information to answer that."}

g = StateGraph(AgentState)
g.add_node("retrieve", retrieve)
g.add_node("grade_documents", grade_documents)
g.add_node("rewrite", rewrite_query)
g.add_node("generate", generate)
g.add_node("grade_answer", grade_answer)
g.add_node("fallback", fallback)

g.add_edge(START, "retrieve")
g.add_edge("retrieve", "grade_documents")
g.add_conditional_edges("grade_documents", docs_route,
                        {"generate": "generate", "rewrite": "rewrite", "fallback": "fallback"})
g.add_edge("rewrite", "retrieve")
g.add_edge("generate", "grade_answer")
g.add_conditional_edges("grade_answer", answer_route,
                        {"end": END, "generate": "generate",
                         "rewrite": "rewrite", "fallback": "fallback"})
g.add_edge("fallback", END)

app = g.compile()
```

A few production details:

- **Use the cheap model for graders.** `gpt-4o-mini-2024-07-18` runs the two grader nodes for roughly 1/15th the cost of `gpt-4o` with measured agreement above 92% with the bigger model on our internal grader-eval set.
- **Cap `MAX_ITERATIONS`.** Three is the sweet spot; four does not noticeably help and five regularly causes runaway cost on adversarial inputs.
- **Trace every node.** The `trace` field is your post-hoc debug primitive. Without it, agentic loops are unobservable.
- **Structured output for graders is non-optional.** Free-form grader output is parser-fragile; `with_structured_output` is the only reliable pattern.

## Evaluating Agentic RAG

The single-shot RAGAS metrics still apply, but they are no longer enough. You also need to measure *the loop itself*. Four additional things matter:

| Metric | What it answers | Why it matters |
| --- | --- | --- |
| Mean iterations per question | How often does the loop re-try? | If most questions hit 3 iterations, your retriever or chunking is broken |
| Per-iteration retrieval quality | Does context_precision improve from iter 1 → iter 2? | If not, query rewriting is not helping |
| Tool calls per question | Total LLM calls (retrieve + grade_docs + rewrite + generate + grade_answer) | Direct cost driver |
| Iteration delta on terminal answer quality | Does faithfulness on questions that took 2+ iterations beat single-shot on the same questions? | The whole point — does the loop earn its keep? |

The eval pattern, run on the same held-out QA set as single-shot:

```python
def run_with_metrics(question: str, ground_truth: str):
    final = app.invoke({
        "question": question, "rewritten_question": "",
        "documents": [], "answer": "", "iterations": 0, "trace": [],
    })

    # Per-iteration retrieval quality
    retrieval_iters = [t for t in final["trace"] if t["node"] == "retrieve"]
    grade_iters = [t for t in final["trace"] if t["node"] == "grade_documents"]

    return {
        "question": question,
        "answer": final["answer"],
        "iterations": final["iterations"],
        "tool_calls": len(final["trace"]),
        "retrieval_grade_trace": [g["relevant"] for g in grade_iters],
        "ground_truth": ground_truth,
    }

# Then run RAGAS faithfulness + answer_relevancy on the terminal answers
# alongside the loop-specific metrics above.
```

## Cost Analysis: When the Loop Is Worth It

Real numbers from one of our production [healthcare](/industries) FAQ agents, 1,000 questions, May 2026:

| Mode | Avg LLM calls/q | Avg cost/q | Avg latency p50 | Faithfulness | Recall on hard subset |
| --- | --- | --- | --- | --- | --- |
| Single-shot RAG | 1.0 | $0.0021 | 1.4s | 0.93 | 0.71 |
| Agentic RAG (always on) | 3.6 | $0.0078 | 4.9s | 0.95 | 0.88 |
| Adaptive routing | 1.7 | $0.0041 | 2.3s | 0.95 | 0.87 |

Agentic-always is 3.7x the cost and 3.5x the latency of single-shot for a 2-point bump in faithfulness and a 17-point bump in recall on hard questions. **On easy questions, the loop adds latency and cost for zero quality gain** — graders confirm the first retrieval was fine and the loop exits in one iteration anyway.

The economic answer is **adaptive routing**: a small classifier (or a cheap LLM call) decides whether the question is "easy" (likely to succeed single-shot) or "hard" (likely to need the loop). Easy goes to single-shot; hard goes to agentic. Adaptive routing on this dataset matches agentic-always quality at roughly half the cost.

A simple-but-effective router: run the question through single-shot, run `grade_answer` on the result, and if it fails, escalate to the agentic loop. That gives you single-shot cost on easy questions and one extra grader call as the routing tax.

## Honest Tradeoffs

- **Latency tail is the killer.** Average latency is misleading; p99 on agentic RAG is often 3x p50 because of the worst-case 3-iteration path. For voice agents (where the [CallSphere voice product](/products) operates at 200ms response targets), agentic RAG is incompatible with the live turn unless wrapped in a "hold on" filler phrase. Use it for chat, async, and slow-path features.
- **Graders are LLM-as-judge in production.** Calibrate them quarterly against humans. We caught a grader that started returning `relevant=true` on 100% of inputs after a model alias drift; the only reason we noticed was the iteration-count metric flatlining at 1.0.
- **Loop cost can spiral with bad chunking.** If your retriever is bad, the loop will grind through 3 iterations on most questions and 4x your bill. Fix retrieval first, then add the loop.
- **The agentic loop is harder to evaluate.** Per-iteration metrics are not standard in RAGAS; you build them yourself. Budget engineering time for it.

The right model is: *agentic RAG is a tool you reach for after single-shot RAG with proper evals exists and you have measured which subset of your traffic needs it.* Building agentic RAG before you have RAGAS in CI is putting the engine in before the chassis.

## Frequently Asked Questions

### Should I always use agentic RAG?

No. Use single-shot for easy questions and adaptive routing to send only hard questions to the loop. Always-on agentic RAG is paying full price for 70% of queries that did not need the upgrade.

### How is this different from a ReAct agent doing search?

ReAct treats search as one tool among many and gives the model open-ended control over when to call it. Agentic RAG is a structured loop with explicit grader nodes and bounded iterations. ReAct is more flexible but harder to evaluate; agentic RAG is more constrained but produces deterministic eval traces.

### What about caching?

Cache rewritten queries to embeddings (saves the embed call on retry) and cache grader outputs keyed on (question, doc_ids). On our traffic these two caches together cut iterative-path cost by ~22%.

### Does this work with Anthropic models?

Yes. Swap `gpt-4o-2024-08-06` for `claude-sonnet-4-5-20250929` and `gpt-4o-mini-2024-07-18` for `claude-haiku-4-5-20250515`. The graphs are identical; LangGraph is provider-neutral.

### How do I debug a loop that won't terminate?

The `trace` field plus a hard `MAX_ITERATIONS` cap. Read the [companion piece on the trace-to-fix workflow](/blog/trace-to-production-fix-agent-observability-workflow) for the muscle memory. Most non-terminating loops we have seen were graders giving inconsistent verdicts on the same docs — fix the grader prompt or pin a stronger judge model.

---

Source: https://callsphere.ai/blog/agentic-rag-langgraph-iterative-retrieval-2026
