Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

TL;DR

Single-shot RAG (retrieve once, generate once) handles the easy 70% of questions. The remaining 30% — multi-hop, ambiguous phrasing, vocabulary mismatch, missing context on first pull — is where production RAG either gets quiet about its failures or gets agentic. Agentic RAG is RAG wired as a LangGraph state graph that can re-retrieve, grade its own retrievals, rewrite the query when retrieval is bad, and grade its own answer before returning. Self-RAG, CRAG (Corrective RAG), and adaptive RAG are all variations on the same loop. They cost 2–4x more in tokens and latency than single-shot RAG. They are worth that price on a measurable subset of questions and absolutely not worth it on the rest. This post is the working LangGraph implementation, the per-iteration eval pattern, and the cost analysis we use to decide when to send a question to the agentic path versus the cheap path on CallSphere healthcare and finance RAG agents.

Why Single-Shot RAG Quietly Fails

The single-shot pattern from the companion post on production RAG with RAGAS looks like this:

question → retrieve(k=12) → rerank → generate → answer

It fails silently when:

The user's phrasing does not match the source vocabulary ("can I refill my prescription early?" vs. document language about "early refill authorization policy"). Retrieval misses; the LLM either hallucinates or punts.
The question is multi-hop. ("What is the copay for a 90-day supply of the generic equivalent?") The first retrieval pulls one of the two needed chunks; the answer is wrong-but-plausible.
The retrieved chunks contradict each other. Single-shot has no mechanism to re-retrieve to disambiguate.
The retrieved chunks are off-topic but the LLM tries to answer anyway.

Each of these is a retrieval problem masquerading as a generation problem. The agentic loop fixes them by giving the system the ability to look again.

The Three Patterns

The literature uses three names that are mostly the same idea with different tuning:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Pattern	Origin	Distinguishing move
Self-RAG	Asai et al. 2023	Generation produces reflection tokens that decide whether to retrieve
CRAG (Corrective RAG)	Yan et al. 2024	A retrieval grader fires before generation; bad retrievals trigger query rewriting and web search fallback
Adaptive RAG	Jeong et al. 2024	A classifier routes simple questions to single-shot, complex ones to iterative

In production, the practical synthesis is: use a retrieval grader, rewrite the query when retrieval is bad, optionally re-retrieve, grade the final answer, and cap iterations. That is what LangGraph models cleanly.

The State Graph

flowchart TD
  S[start] --> R1[retrieve]
  R1 --> GD{grade documents}
  GD -->|relevant| GEN[generate answer]
  GD -->|not relevant + iters &lt; 3| RW[rewrite query]
  GD -->|not relevant + iters >= 3| FAIL[fallback: I cannot answer]
  RW --> R1
  GEN --> GA{grade answer}
  GA -->|grounded + addresses question| END[return answer]
  GA -->|hallucination| GEN
  GA -->|not addressing question + iters &lt; 3| RW
  GA -->|not addressing + iters >= 3| FAIL
  style GD fill:#ffd
  style GA fill:#ffd
  style FAIL fill:#fcc
  style END fill:#cfc

Figure 1 — The agentic RAG loop as a LangGraph state graph. Two grader nodes act as conditional edges; the rewrite/regenerate self-loops are bounded by an iteration counter to prevent runaway cost.

The LangGraph Implementation

Pinned to LangGraph 0.2.x with the modern StateGraph API. Same chat model and embeddings as the single-shot pattern (gpt-4o-2024-08-06, text-embedding-3-large).

from typing import TypedDict, List
from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

CHAT_MODEL = "gpt-4o-2024-08-06"
GRADER_MODEL = "gpt-4o-mini-2024-07-18"  # cheaper for graders
MAX_ITERATIONS = 3

class AgentState(TypedDict):
    question: str
    rewritten_question: str
    documents: List[Document]
    answer: str
    iterations: int
    trace: List[dict]

class DocGrade(BaseModel):
    relevant: bool = Field(description="Whether docs answer the question")
    reason: str

class AnswerGrade(BaseModel):
    grounded: bool = Field(description="Answer supported by docs")
    addresses_question: bool

grader_llm = ChatOpenAI(model=GRADER_MODEL, temperature=0)
gen_llm = ChatOpenAI(model=CHAT_MODEL, temperature=0)

def retrieve(state: AgentState) -> AgentState:
    q = state.get("rewritten_question") or state["question"]
    docs = retrieve_and_rerank(q)  # same retriever as single-shot
    return {**state, "documents": docs,
            "trace": state["trace"] + [{"node": "retrieve", "k": len(docs)}]}

def grade_documents(state: AgentState) -> AgentState:
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Grade whether the documents are relevant to the question. "
                   "Be strict: if no chunk directly addresses the question, mark not relevant."),
        ("human", "Q: {q}\n\nDocs:\n{docs}"),
    ])
    chain = prompt | grader_llm.with_structured_output(DocGrade)
    grade = chain.invoke({
        "q": state["question"],
        "docs": "\n---\n".join(d.page_content for d in state["documents"]),
    })
    return {**state, "trace": state["trace"] + [
        {"node": "grade_documents", "relevant": grade.relevant, "reason": grade.reason}
    ]}

def rewrite_query(state: AgentState) -> AgentState:
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Rewrite the question to better match formal documentation vocabulary. "
                   "Expand abbreviations, use canonical terminology, keep meaning identical."),
        ("human", "Original: {q}\nPrevious docs (irrelevant):\n{docs}"),
    ])
    chain = prompt | gen_llm
    rewritten = chain.invoke({
        "q": state["question"],
        "docs": "\n---\n".join(d.page_content[:200] for d in state["documents"]),
    }).content
    return {**state,
            "rewritten_question": rewritten,
            "iterations": state["iterations"] + 1,
            "trace": state["trace"] + [{"node": "rewrite", "new_q": rewritten}]}

def generate(state: AgentState) -> AgentState:
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer ONLY using the context. Cite source_id in brackets. "
                   "If context is insufficient, say so."),
        ("human", "Q: {q}\n\nContext:\n{ctx}"),
    ])
    chain = prompt | gen_llm
    answer = chain.invoke({
        "q": state["question"],
        "ctx": "\n\n".join(f"[{d.metadata['source_id']}] {d.page_content}"
                              for d in state["documents"]),
    }).content
    return {**state, "answer": answer,
            "trace": state["trace"] + [{"node": "generate"}]}

def grade_answer(state: AgentState) -> AgentState:
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Two checks: (1) is the answer supported by the docs (grounded)? "
                   "(2) does it actually address the original question?"),
        ("human", "Q: {q}\nDocs:\n{docs}\nAnswer: {a}"),
    ])
    chain = prompt | grader_llm.with_structured_output(AnswerGrade)
    grade = chain.invoke({
        "q": state["question"],
        "docs": "\n---\n".join(d.page_content for d in state["documents"]),
        "a": state["answer"],
    })
    return {**state, "trace": state["trace"] + [
        {"node": "grade_answer",
         "grounded": grade.grounded,
         "addresses": grade.addresses_question}
    ]}

# Conditional edges
def docs_route(state: AgentState):
    last = state["trace"][-1]
    if last["relevant"]:
        return "generate"
    if state["iterations"] >= MAX_ITERATIONS:
        return "fallback"
    return "rewrite"

def answer_route(state: AgentState):
    last = state["trace"][-1]
    if last["grounded"] and last["addresses"]:
        return "end"
    if not last["grounded"]:
        return "generate"  # try again with same docs
    if state["iterations"] >= MAX_ITERATIONS:
        return "fallback"
    return "rewrite"

def fallback(state: AgentState) -> AgentState:
    return {**state, "answer": "I do not have sufficient information to answer that."}

g = StateGraph(AgentState)
g.add_node("retrieve", retrieve)
g.add_node("grade_documents", grade_documents)
g.add_node("rewrite", rewrite_query)
g.add_node("generate", generate)
g.add_node("grade_answer", grade_answer)
g.add_node("fallback", fallback)

g.add_edge(START, "retrieve")
g.add_edge("retrieve", "grade_documents")
g.add_conditional_edges("grade_documents", docs_route,
                        {"generate": "generate", "rewrite": "rewrite", "fallback": "fallback"})
g.add_edge("rewrite", "retrieve")
g.add_edge("generate", "grade_answer")
g.add_conditional_edges("grade_answer", answer_route,
                        {"end": END, "generate": "generate",
                         "rewrite": "rewrite", "fallback": "fallback"})
g.add_edge("fallback", END)

app = g.compile()

A few production details:

Use the cheap model for graders. gpt-4o-mini-2024-07-18 runs the two grader nodes for roughly 1/15th the cost of gpt-4o with measured agreement above 92% with the bigger model on our internal grader-eval set.
Cap MAX_ITERATIONS. Three is the sweet spot; four does not noticeably help and five regularly causes runaway cost on adversarial inputs.
Trace every node. The trace field is your post-hoc debug primitive. Without it, agentic loops are unobservable.
Structured output for graders is non-optional. Free-form grader output is parser-fragile; with_structured_output is the only reliable pattern.

Evaluating Agentic RAG

The single-shot RAGAS metrics still apply, but they are no longer enough. You also need to measure the loop itself. Four additional things matter:

Metric	What it answers	Why it matters
Mean iterations per question	How often does the loop re-try?	If most questions hit 3 iterations, your retriever or chunking is broken
Per-iteration retrieval quality	Does context_precision improve from iter 1 → iter 2?	If not, query rewriting is not helping
Tool calls per question	Total LLM calls (retrieve + grade_docs + rewrite + generate + grade_answer)	Direct cost driver
Iteration delta on terminal answer quality	Does faithfulness on questions that took 2+ iterations beat single-shot on the same questions?	The whole point — does the loop earn its keep?

The eval pattern, run on the same held-out QA set as single-shot:

def run_with_metrics(question: str, ground_truth: str):
    final = app.invoke({
        "question": question, "rewritten_question": "",
        "documents": [], "answer": "", "iterations": 0, "trace": [],
    })

    # Per-iteration retrieval quality
    retrieval_iters = [t for t in final["trace"] if t["node"] == "retrieve"]
    grade_iters = [t for t in final["trace"] if t["node"] == "grade_documents"]

    return {
        "question": question,
        "answer": final["answer"],
        "iterations": final["iterations"],
        "tool_calls": len(final["trace"]),
        "retrieval_grade_trace": [g["relevant"] for g in grade_iters],
        "ground_truth": ground_truth,
    }

# Then run RAGAS faithfulness + answer_relevancy on the terminal answers
# alongside the loop-specific metrics above.

Cost Analysis: When the Loop Is Worth It

Real numbers from one of our production healthcare FAQ agents, 1,000 questions, May 2026:

Mode	Avg LLM calls/q	Avg cost/q	Avg latency p50	Faithfulness	Recall on hard subset
Single-shot RAG	1.0	$0.0021	1.4s	0.93	0.71
Agentic RAG (always on)	3.6	$0.0078	4.9s	0.95	0.88
Adaptive routing	1.7	$0.0041	2.3s	0.95	0.87

Agentic-always is 3.7x the cost and 3.5x the latency of single-shot for a 2-point bump in faithfulness and a 17-point bump in recall on hard questions. On easy questions, the loop adds latency and cost for zero quality gain — graders confirm the first retrieval was fine and the loop exits in one iteration anyway.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

The economic answer is adaptive routing: a small classifier (or a cheap LLM call) decides whether the question is "easy" (likely to succeed single-shot) or "hard" (likely to need the loop). Easy goes to single-shot; hard goes to agentic. Adaptive routing on this dataset matches agentic-always quality at roughly half the cost.

A simple-but-effective router: run the question through single-shot, run grade_answer on the result, and if it fails, escalate to the agentic loop. That gives you single-shot cost on easy questions and one extra grader call as the routing tax.

Honest Tradeoffs

Latency tail is the killer. Average latency is misleading; p99 on agentic RAG is often 3x p50 because of the worst-case 3-iteration path. For voice agents (where the CallSphere voice product operates at 200ms response targets), agentic RAG is incompatible with the live turn unless wrapped in a "hold on" filler phrase. Use it for chat, async, and slow-path features.
Graders are LLM-as-judge in production. Calibrate them quarterly against humans. We caught a grader that started returning relevant=true on 100% of inputs after a model alias drift; the only reason we noticed was the iteration-count metric flatlining at 1.0.
Loop cost can spiral with bad chunking. If your retriever is bad, the loop will grind through 3 iterations on most questions and 4x your bill. Fix retrieval first, then add the loop.
The agentic loop is harder to evaluate. Per-iteration metrics are not standard in RAGAS; you build them yourself. Budget engineering time for it.

The right model is: agentic RAG is a tool you reach for after single-shot RAG with proper evals exists and you have measured which subset of your traffic needs it. Building agentic RAG before you have RAGAS in CI is putting the engine in before the chassis.

Frequently Asked Questions

Should I always use agentic RAG?

No. Use single-shot for easy questions and adaptive routing to send only hard questions to the loop. Always-on agentic RAG is paying full price for 70% of queries that did not need the upgrade.

How is this different from a ReAct agent doing search?

ReAct treats search as one tool among many and gives the model open-ended control over when to call it. Agentic RAG is a structured loop with explicit grader nodes and bounded iterations. ReAct is more flexible but harder to evaluate; agentic RAG is more constrained but produces deterministic eval traces.

What about caching?

Cache rewritten queries to embeddings (saves the embed call on retry) and cache grader outputs keyed on (question, doc_ids). On our traffic these two caches together cut iterative-path cost by ~22%.

Does this work with Anthropic models?

Yes. Swap gpt-4o-2024-08-06 for claude-sonnet-4-5-20250929 and gpt-4o-mini-2024-07-18 for claude-haiku-4-5-20250515. The graphs are identical; LangGraph is provider-neutral.

How do I debug a loop that won't terminate?

The trace field plus a hard MAX_ITERATIONS cap. Read the companion piece on the trace-to-fix workflow for the muscle memory. Most non-terminating loops we have seen were graders giving inconsistent verdicts on the same docs — fix the grader prompt or pin a stronger judge model.

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

TL;DR

Why Single-Shot RAG Quietly Fails

The Three Patterns

The State Graph

The LangGraph Implementation

Evaluating Agentic RAG

Cost Analysis: When the Loop Is Worth It

Honest Tradeoffs

Frequently Asked Questions

Should I always use agentic RAG?

How is this different from a ReAct agent doing search?

What about caching?

Does this work with Anthropic models?

How do I debug a loop that won't terminate?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Build a Golden Dataset for Production AI Agents

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split