Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines
Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.
TL;DR
Single-shot RAG (retrieve once, generate once) handles the easy 70% of questions. The remaining 30% — multi-hop, ambiguous phrasing, vocabulary mismatch, missing context on first pull — is where production RAG either gets quiet about its failures or gets agentic. Agentic RAG is RAG wired as a LangGraph state graph that can re-retrieve, grade its own retrievals, rewrite the query when retrieval is bad, and grade its own answer before returning. Self-RAG, CRAG (Corrective RAG), and adaptive RAG are all variations on the same loop. They cost 2–4x more in tokens and latency than single-shot RAG. They are worth that price on a measurable subset of questions and absolutely not worth it on the rest. This post is the working LangGraph implementation, the per-iteration eval pattern, and the cost analysis we use to decide when to send a question to the agentic path versus the cheap path on CallSphere healthcare and finance RAG agents.
Why Single-Shot RAG Quietly Fails
The single-shot pattern from the companion post on production RAG with RAGAS looks like this:
question → retrieve(k=12) → rerank → generate → answer
It fails silently when:
- The user's phrasing does not match the source vocabulary ("can I refill my prescription early?" vs. document language about "early refill authorization policy"). Retrieval misses; the LLM either hallucinates or punts.
- The question is multi-hop. ("What is the copay for a 90-day supply of the generic equivalent?") The first retrieval pulls one of the two needed chunks; the answer is wrong-but-plausible.
- The retrieved chunks contradict each other. Single-shot has no mechanism to re-retrieve to disambiguate.
- The retrieved chunks are off-topic but the LLM tries to answer anyway.
Each of these is a retrieval problem masquerading as a generation problem. The agentic loop fixes them by giving the system the ability to look again.
The Three Patterns
The literature uses three names that are mostly the same idea with different tuning:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
| Pattern | Origin | Distinguishing move |
|---|---|---|
| Self-RAG | Asai et al. 2023 | Generation produces reflection tokens that decide whether to retrieve |
| CRAG (Corrective RAG) | Yan et al. 2024 | A retrieval grader fires before generation; bad retrievals trigger query rewriting and web search fallback |
| Adaptive RAG | Jeong et al. 2024 | A classifier routes simple questions to single-shot, complex ones to iterative |
In production, the practical synthesis is: use a retrieval grader, rewrite the query when retrieval is bad, optionally re-retrieve, grade the final answer, and cap iterations. That is what LangGraph models cleanly.
The State Graph
flowchart TD
S[start] --> R1[retrieve]
R1 --> GD{grade documents}
GD -->|relevant| GEN[generate answer]
GD -->|not relevant + iters < 3| RW[rewrite query]
GD -->|not relevant + iters >= 3| FAIL[fallback: I cannot answer]
RW --> R1
GEN --> GA{grade answer}
GA -->|grounded + addresses question| END[return answer]
GA -->|hallucination| GEN
GA -->|not addressing question + iters < 3| RW
GA -->|not addressing + iters >= 3| FAIL
style GD fill:#ffd
style GA fill:#ffd
style FAIL fill:#fcc
style END fill:#cfc
Figure 1 — The agentic RAG loop as a LangGraph state graph. Two grader nodes act as conditional edges; the rewrite/regenerate self-loops are bounded by an iteration counter to prevent runaway cost.
The LangGraph Implementation
Pinned to LangGraph 0.2.x with the modern StateGraph API. Same chat model and embeddings as the single-shot pattern (gpt-4o-2024-08-06, text-embedding-3-large).
from typing import TypedDict, List
from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
CHAT_MODEL = "gpt-4o-2024-08-06"
GRADER_MODEL = "gpt-4o-mini-2024-07-18" # cheaper for graders
MAX_ITERATIONS = 3
class AgentState(TypedDict):
question: str
rewritten_question: str
documents: List[Document]
answer: str
iterations: int
trace: List[dict]
class DocGrade(BaseModel):
relevant: bool = Field(description="Whether docs answer the question")
reason: str
class AnswerGrade(BaseModel):
grounded: bool = Field(description="Answer supported by docs")
addresses_question: bool
grader_llm = ChatOpenAI(model=GRADER_MODEL, temperature=0)
gen_llm = ChatOpenAI(model=CHAT_MODEL, temperature=0)
def retrieve(state: AgentState) -> AgentState:
q = state.get("rewritten_question") or state["question"]
docs = retrieve_and_rerank(q) # same retriever as single-shot
return {**state, "documents": docs,
"trace": state["trace"] + [{"node": "retrieve", "k": len(docs)}]}
def grade_documents(state: AgentState) -> AgentState:
prompt = ChatPromptTemplate.from_messages([
("system", "Grade whether the documents are relevant to the question. "
"Be strict: if no chunk directly addresses the question, mark not relevant."),
("human", "Q: {q}\n\nDocs:\n{docs}"),
])
chain = prompt | grader_llm.with_structured_output(DocGrade)
grade = chain.invoke({
"q": state["question"],
"docs": "\n---\n".join(d.page_content for d in state["documents"]),
})
return {**state, "trace": state["trace"] + [
{"node": "grade_documents", "relevant": grade.relevant, "reason": grade.reason}
]}
def rewrite_query(state: AgentState) -> AgentState:
prompt = ChatPromptTemplate.from_messages([
("system", "Rewrite the question to better match formal documentation vocabulary. "
"Expand abbreviations, use canonical terminology, keep meaning identical."),
("human", "Original: {q}\nPrevious docs (irrelevant):\n{docs}"),
])
chain = prompt | gen_llm
rewritten = chain.invoke({
"q": state["question"],
"docs": "\n---\n".join(d.page_content[:200] for d in state["documents"]),
}).content
return {**state,
"rewritten_question": rewritten,
"iterations": state["iterations"] + 1,
"trace": state["trace"] + [{"node": "rewrite", "new_q": rewritten}]}
def generate(state: AgentState) -> AgentState:
prompt = ChatPromptTemplate.from_messages([
("system", "Answer ONLY using the context. Cite source_id in brackets. "
"If context is insufficient, say so."),
("human", "Q: {q}\n\nContext:\n{ctx}"),
])
chain = prompt | gen_llm
answer = chain.invoke({
"q": state["question"],
"ctx": "\n\n".join(f"[{d.metadata['source_id']}] {d.page_content}"
for d in state["documents"]),
}).content
return {**state, "answer": answer,
"trace": state["trace"] + [{"node": "generate"}]}
def grade_answer(state: AgentState) -> AgentState:
prompt = ChatPromptTemplate.from_messages([
("system", "Two checks: (1) is the answer supported by the docs (grounded)? "
"(2) does it actually address the original question?"),
("human", "Q: {q}\nDocs:\n{docs}\nAnswer: {a}"),
])
chain = prompt | grader_llm.with_structured_output(AnswerGrade)
grade = chain.invoke({
"q": state["question"],
"docs": "\n---\n".join(d.page_content for d in state["documents"]),
"a": state["answer"],
})
return {**state, "trace": state["trace"] + [
{"node": "grade_answer",
"grounded": grade.grounded,
"addresses": grade.addresses_question}
]}
# Conditional edges
def docs_route(state: AgentState):
last = state["trace"][-1]
if last["relevant"]:
return "generate"
if state["iterations"] >= MAX_ITERATIONS:
return "fallback"
return "rewrite"
def answer_route(state: AgentState):
last = state["trace"][-1]
if last["grounded"] and last["addresses"]:
return "end"
if not last["grounded"]:
return "generate" # try again with same docs
if state["iterations"] >= MAX_ITERATIONS:
return "fallback"
return "rewrite"
def fallback(state: AgentState) -> AgentState:
return {**state, "answer": "I do not have sufficient information to answer that."}
g = StateGraph(AgentState)
g.add_node("retrieve", retrieve)
g.add_node("grade_documents", grade_documents)
g.add_node("rewrite", rewrite_query)
g.add_node("generate", generate)
g.add_node("grade_answer", grade_answer)
g.add_node("fallback", fallback)
g.add_edge(START, "retrieve")
g.add_edge("retrieve", "grade_documents")
g.add_conditional_edges("grade_documents", docs_route,
{"generate": "generate", "rewrite": "rewrite", "fallback": "fallback"})
g.add_edge("rewrite", "retrieve")
g.add_edge("generate", "grade_answer")
g.add_conditional_edges("grade_answer", answer_route,
{"end": END, "generate": "generate",
"rewrite": "rewrite", "fallback": "fallback"})
g.add_edge("fallback", END)
app = g.compile()
A few production details:
- Use the cheap model for graders.
gpt-4o-mini-2024-07-18runs the two grader nodes for roughly 1/15th the cost ofgpt-4owith measured agreement above 92% with the bigger model on our internal grader-eval set. - Cap
MAX_ITERATIONS. Three is the sweet spot; four does not noticeably help and five regularly causes runaway cost on adversarial inputs. - Trace every node. The
tracefield is your post-hoc debug primitive. Without it, agentic loops are unobservable. - Structured output for graders is non-optional. Free-form grader output is parser-fragile;
with_structured_outputis the only reliable pattern.
Evaluating Agentic RAG
The single-shot RAGAS metrics still apply, but they are no longer enough. You also need to measure the loop itself. Four additional things matter:
| Metric | What it answers | Why it matters |
|---|---|---|
| Mean iterations per question | How often does the loop re-try? | If most questions hit 3 iterations, your retriever or chunking is broken |
| Per-iteration retrieval quality | Does context_precision improve from iter 1 → iter 2? | If not, query rewriting is not helping |
| Tool calls per question | Total LLM calls (retrieve + grade_docs + rewrite + generate + grade_answer) | Direct cost driver |
| Iteration delta on terminal answer quality | Does faithfulness on questions that took 2+ iterations beat single-shot on the same questions? | The whole point — does the loop earn its keep? |
The eval pattern, run on the same held-out QA set as single-shot:
def run_with_metrics(question: str, ground_truth: str):
final = app.invoke({
"question": question, "rewritten_question": "",
"documents": [], "answer": "", "iterations": 0, "trace": [],
})
# Per-iteration retrieval quality
retrieval_iters = [t for t in final["trace"] if t["node"] == "retrieve"]
grade_iters = [t for t in final["trace"] if t["node"] == "grade_documents"]
return {
"question": question,
"answer": final["answer"],
"iterations": final["iterations"],
"tool_calls": len(final["trace"]),
"retrieval_grade_trace": [g["relevant"] for g in grade_iters],
"ground_truth": ground_truth,
}
# Then run RAGAS faithfulness + answer_relevancy on the terminal answers
# alongside the loop-specific metrics above.
Cost Analysis: When the Loop Is Worth It
Real numbers from one of our production healthcare FAQ agents, 1,000 questions, May 2026:
| Mode | Avg LLM calls/q | Avg cost/q | Avg latency p50 | Faithfulness | Recall on hard subset |
|---|---|---|---|---|---|
| Single-shot RAG | 1.0 | $0.0021 | 1.4s | 0.93 | 0.71 |
| Agentic RAG (always on) | 3.6 | $0.0078 | 4.9s | 0.95 | 0.88 |
| Adaptive routing | 1.7 | $0.0041 | 2.3s | 0.95 | 0.87 |
Agentic-always is 3.7x the cost and 3.5x the latency of single-shot for a 2-point bump in faithfulness and a 17-point bump in recall on hard questions. On easy questions, the loop adds latency and cost for zero quality gain — graders confirm the first retrieval was fine and the loop exits in one iteration anyway.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The economic answer is adaptive routing: a small classifier (or a cheap LLM call) decides whether the question is "easy" (likely to succeed single-shot) or "hard" (likely to need the loop). Easy goes to single-shot; hard goes to agentic. Adaptive routing on this dataset matches agentic-always quality at roughly half the cost.
A simple-but-effective router: run the question through single-shot, run grade_answer on the result, and if it fails, escalate to the agentic loop. That gives you single-shot cost on easy questions and one extra grader call as the routing tax.
Honest Tradeoffs
- Latency tail is the killer. Average latency is misleading; p99 on agentic RAG is often 3x p50 because of the worst-case 3-iteration path. For voice agents (where the CallSphere voice product operates at 200ms response targets), agentic RAG is incompatible with the live turn unless wrapped in a "hold on" filler phrase. Use it for chat, async, and slow-path features.
- Graders are LLM-as-judge in production. Calibrate them quarterly against humans. We caught a grader that started returning
relevant=trueon 100% of inputs after a model alias drift; the only reason we noticed was the iteration-count metric flatlining at 1.0. - Loop cost can spiral with bad chunking. If your retriever is bad, the loop will grind through 3 iterations on most questions and 4x your bill. Fix retrieval first, then add the loop.
- The agentic loop is harder to evaluate. Per-iteration metrics are not standard in RAGAS; you build them yourself. Budget engineering time for it.
The right model is: agentic RAG is a tool you reach for after single-shot RAG with proper evals exists and you have measured which subset of your traffic needs it. Building agentic RAG before you have RAGAS in CI is putting the engine in before the chassis.
Frequently Asked Questions
Should I always use agentic RAG?
No. Use single-shot for easy questions and adaptive routing to send only hard questions to the loop. Always-on agentic RAG is paying full price for 70% of queries that did not need the upgrade.
How is this different from a ReAct agent doing search?
ReAct treats search as one tool among many and gives the model open-ended control over when to call it. Agentic RAG is a structured loop with explicit grader nodes and bounded iterations. ReAct is more flexible but harder to evaluate; agentic RAG is more constrained but produces deterministic eval traces.
What about caching?
Cache rewritten queries to embeddings (saves the embed call on retry) and cache grader outputs keyed on (question, doc_ids). On our traffic these two caches together cut iterative-path cost by ~22%.
Does this work with Anthropic models?
Yes. Swap gpt-4o-2024-08-06 for claude-sonnet-4-5-20250929 and gpt-4o-mini-2024-07-18 for claude-haiku-4-5-20250515. The graphs are identical; LangGraph is provider-neutral.
How do I debug a loop that won't terminate?
The trace field plus a hard MAX_ITERATIONS cap. Read the companion piece on the trace-to-fix workflow for the muscle memory. Most non-terminating loops we have seen were graders giving inconsistent verdicts on the same docs — fix the grader prompt or pin a stronger judge model.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.