By Sagar Shankaran, Founder of CallSphere
Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.
Key takeaways
Single-shot RAG (retrieve once, generate once) handles the easy 70% of questions. The remaining 30% — multi-hop, ambiguous phrasing, vocabulary mismatch, missing context on first pull — is where production RAG either gets quiet about its failures or gets agentic. Agentic RAG is RAG wired as a LangGraph state graph that can re-retrieve, grade its own retrievals, rewrite the query when retrieval is bad, and grade its own answer before returning. Self-RAG, CRAG (Corrective RAG), and adaptive RAG are all variations on the same loop. They cost 2–4x more in tokens and latency than single-shot RAG. They are worth that price on a measurable subset of questions and absolutely not worth it on the rest. This post is the working LangGraph implementation, the per-iteration eval pattern, and the cost analysis we use to decide when to send a question to the agentic path versus the cheap path on CallSphere healthcare and finance RAG agents.
The single-shot pattern from the companion post on production RAG with RAGAS looks like this:
question → retrieve(k=12) → rerank → generate → answer
It fails silently when:
Each of these is a retrieval problem masquerading as a generation problem. The agentic loop fixes them by giving the system the ability to look again.
The literature uses three names that are mostly the same idea with different tuning:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
| Pattern | Origin | Distinguishing move |
|---|---|---|
| Self-RAG | Asai et al. 2023 | Generation produces reflection tokens that decide whether to retrieve |
| CRAG (Corrective RAG) | Yan et al. 2024 | A retrieval grader fires before generation; bad retrievals trigger query rewriting and web search fallback |
| Adaptive RAG | Jeong et al. 2024 | A classifier routes simple questions to single-shot, complex ones to iterative |
In production, the practical synthesis is: use a retrieval grader, rewrite the query when retrieval is bad, optionally re-retrieve, grade the final answer, and cap iterations. That is what LangGraph models cleanly.
flowchart TD
S[start] --> R1[retrieve]
R1 --> GD{grade documents}
GD -->|relevant| GEN[generate answer]
GD -->|not relevant + iters < 3| RW[rewrite query]
GD -->|not relevant + iters >= 3| FAIL[fallback: I cannot answer]
RW --> R1
GEN --> GA{grade answer}
GA -->|grounded + addresses question| END[return answer]
GA -->|hallucination| GEN
GA -->|not addressing question + iters < 3| RW
GA -->|not addressing + iters >= 3| FAIL
style GD fill:#ffd
style GA fill:#ffd
style FAIL fill:#fcc
style END fill:#cfc
Figure 1 — The agentic RAG loop as a LangGraph state graph. Two grader nodes act as conditional edges; the rewrite/regenerate self-loops are bounded by an iteration counter to prevent runaway cost.
Pinned to LangGraph 0.2.x with the modern StateGraph API. Same chat model and embeddings as the single-shot pattern (gpt-4o-2024-08-06, text-embedding-3-large).
from typing import TypedDict, List
from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
CHAT_MODEL = "gpt-4o-2024-08-06"
GRADER_MODEL = "gpt-4o-mini-2024-07-18" # cheaper for graders
MAX_ITERATIONS = 3
class AgentState(TypedDict):
question: str
rewritten_question: str
documents: List[Document]
answer: str
iterations: int
trace: List[dict]
class DocGrade(BaseModel):
relevant: bool = Field(description="Whether docs answer the question")
reason: str
class AnswerGrade(BaseModel):
grounded: bool = Field(description="Answer supported by docs")
addresses_question: bool
grader_llm = ChatOpenAI(model=GRADER_MODEL, temperature=0)
gen_llm = ChatOpenAI(model=CHAT_MODEL, temperature=0)
def retrieve(state: AgentState) -> AgentState:
q = state.get("rewritten_question") or state["question"]
docs = retrieve_and_rerank(q) # same retriever as single-shot
return {**state, "documents": docs,
"trace": state["trace"] + [{"node": "retrieve", "k": len(docs)}]}
def grade_documents(state: AgentState) -> AgentState:
prompt = ChatPromptTemplate.from_messages([
("system", "Grade whether the documents are relevant to the question. "
"Be strict: if no chunk directly addresses the question, mark not relevant."),
("human", "Q: {q}\n\nDocs:\n{docs}"),
])
chain = prompt | grader_llm.with_structured_output(DocGrade)
grade = chain.invoke({
"q": state["question"],
"docs": "\n---\n".join(d.page_content for d in state["documents"]),
})
return {**state, "trace": state["trace"] + [
{"node": "grade_documents", "relevant": grade.relevant, "reason": grade.reason}
]}
def rewrite_query(state: AgentState) -> AgentState:
prompt = ChatPromptTemplate.from_messages([
("system", "Rewrite the question to better match formal documentation vocabulary. "
"Expand abbreviations, use canonical terminology, keep meaning identical."),
("human", "Original: {q}\nPrevious docs (irrelevant):\n{docs}"),
])
chain = prompt | gen_llm
rewritten = chain.invoke({
"q": state["question"],
"docs": "\n---\n".join(d.page_content[:200] for d in state["documents"]),
}).content
return {**state,
"rewritten_question": rewritten,
"iterations": state["iterations"] + 1,
"trace": state["trace"] + [{"node": "rewrite", "new_q": rewritten}]}
def generate(state: AgentState) -> AgentState:
prompt = ChatPromptTemplate.from_messages([
("system", "Answer ONLY using the context. Cite source_id in brackets. "
"If context is insufficient, say so."),
("human", "Q: {q}\n\nContext:\n{ctx}"),
])
chain = prompt | gen_llm
answer = chain.invoke({
"q": state["question"],
"ctx": "\n\n".join(f"[{d.metadata['source_id']}] {d.page_content}"
for d in state["documents"]),
}).content
return {**state, "answer": answer,
"trace": state["trace"] + [{"node": "generate"}]}
def grade_answer(state: AgentState) -> AgentState:
prompt = ChatPromptTemplate.from_messages([
("system", "Two checks: (1) is the answer supported by the docs (grounded)? "
"(2) does it actually address the original question?"),
("human", "Q: {q}\nDocs:\n{docs}\nAnswer: {a}"),
])
chain = prompt | grader_llm.with_structured_output(AnswerGrade)
grade = chain.invoke({
"q": state["question"],
"docs": "\n---\n".join(d.page_content for d in state["documents"]),
"a": state["answer"],
})
return {**state, "trace": state["trace"] + [
{"node": "grade_answer",
"grounded": grade.grounded,
"addresses": grade.addresses_question}
]}
# Conditional edges
def docs_route(state: AgentState):
last = state["trace"][-1]
if last["relevant"]:
return "generate"
if state["iterations"] >= MAX_ITERATIONS:
return "fallback"
return "rewrite"
def answer_route(state: AgentState):
last = state["trace"][-1]
if last["grounded"] and last["addresses"]:
return "end"
if not last["grounded"]:
return "generate" # try again with same docs
if state["iterations"] >= MAX_ITERATIONS:
return "fallback"
return "rewrite"
def fallback(state: AgentState) -> AgentState:
return {**state, "answer": "I do not have sufficient information to answer that."}
g = StateGraph(AgentState)
g.add_node("retrieve", retrieve)
g.add_node("grade_documents", grade_documents)
g.add_node("rewrite", rewrite_query)
g.add_node("generate", generate)
g.add_node("grade_answer", grade_answer)
g.add_node("fallback", fallback)
g.add_edge(START, "retrieve")
g.add_edge("retrieve", "grade_documents")
g.add_conditional_edges("grade_documents", docs_route,
{"generate": "generate", "rewrite": "rewrite", "fallback": "fallback"})
g.add_edge("rewrite", "retrieve")
g.add_edge("generate", "grade_answer")
g.add_conditional_edges("grade_answer", answer_route,
{"end": END, "generate": "generate",
"rewrite": "rewrite", "fallback": "fallback"})
g.add_edge("fallback", END)
app = g.compile()
A few production details:
gpt-4o-mini-2024-07-18 runs the two grader nodes for roughly 1/15th the cost of gpt-4o with measured agreement above 92% with the bigger model on our internal grader-eval set.MAX_ITERATIONS. Three is the sweet spot; four does not noticeably help and five regularly causes runaway cost on adversarial inputs.trace field is your post-hoc debug primitive. Without it, agentic loops are unobservable.with_structured_output is the only reliable pattern.The single-shot RAGAS metrics still apply, but they are no longer enough. You also need to measure the loop itself. Four additional things matter:
| Metric | What it answers | Why it matters |
|---|---|---|
| Mean iterations per question | How often does the loop re-try? | If most questions hit 3 iterations, your retriever or chunking is broken |
| Per-iteration retrieval quality | Does context_precision improve from iter 1 → iter 2? | If not, query rewriting is not helping |
| Tool calls per question | Total LLM calls (retrieve + grade_docs + rewrite + generate + grade_answer) | Direct cost driver |
| Iteration delta on terminal answer quality | Does faithfulness on questions that took 2+ iterations beat single-shot on the same questions? | The whole point — does the loop earn its keep? |
The eval pattern, run on the same held-out QA set as single-shot:
def run_with_metrics(question: str, ground_truth: str):
final = app.invoke({
"question": question, "rewritten_question": "",
"documents": [], "answer": "", "iterations": 0, "trace": [],
})
# Per-iteration retrieval quality
retrieval_iters = [t for t in final["trace"] if t["node"] == "retrieve"]
grade_iters = [t for t in final["trace"] if t["node"] == "grade_documents"]
return {
"question": question,
"answer": final["answer"],
"iterations": final["iterations"],
"tool_calls": len(final["trace"]),
"retrieval_grade_trace": [g["relevant"] for g in grade_iters],
"ground_truth": ground_truth,
}
# Then run RAGAS faithfulness + answer_relevancy on the terminal answers
# alongside the loop-specific metrics above.
Real numbers from one of our production healthcare FAQ agents, 1,000 questions, May 2026:
| Mode | Avg LLM calls/q | Avg cost/q | Avg latency p50 | Faithfulness | Recall on hard subset |
|---|---|---|---|---|---|
| Single-shot RAG | 1.0 | $0.0021 | 1.4s | 0.93 | 0.71 |
| Agentic RAG (always on) | 3.6 | $0.0078 | 4.9s | 0.95 | 0.88 |
| Adaptive routing | 1.7 | $0.0041 | 2.3s | 0.95 | 0.87 |
Agentic-always is 3.7x the cost and 3.5x the latency of single-shot for a 2-point bump in faithfulness and a 17-point bump in recall on hard questions. On easy questions, the loop adds latency and cost for zero quality gain — graders confirm the first retrieval was fine and the loop exits in one iteration anyway.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The economic answer is adaptive routing: a small classifier (or a cheap LLM call) decides whether the question is "easy" (likely to succeed single-shot) or "hard" (likely to need the loop). Easy goes to single-shot; hard goes to agentic. Adaptive routing on this dataset matches agentic-always quality at roughly half the cost.
A simple-but-effective router: run the question through single-shot, run grade_answer on the result, and if it fails, escalate to the agentic loop. That gives you single-shot cost on easy questions and one extra grader call as the routing tax.
relevant=true on 100% of inputs after a model alias drift; the only reason we noticed was the iteration-count metric flatlining at 1.0.The right model is: agentic RAG is a tool you reach for after single-shot RAG with proper evals exists and you have measured which subset of your traffic needs it. Building agentic RAG before you have RAGAS in CI is putting the engine in before the chassis.
No. Use single-shot for easy questions and adaptive routing to send only hard questions to the loop. Always-on agentic RAG is paying full price for 70% of queries that did not need the upgrade.
ReAct treats search as one tool among many and gives the model open-ended control over when to call it. Agentic RAG is a structured loop with explicit grader nodes and bounded iterations. ReAct is more flexible but harder to evaluate; agentic RAG is more constrained but produces deterministic eval traces.
Cache rewritten queries to embeddings (saves the embed call on retry) and cache grader outputs keyed on (question, doc_ids). On our traffic these two caches together cut iterative-path cost by ~22%.
Yes. Swap gpt-4o-2024-08-06 for claude-sonnet-4-5-20250929 and gpt-4o-mini-2024-07-18 for claude-haiku-4-5-20250515. The graphs are identical; LangGraph is provider-neutral.
The trace field plus a hard MAX_ITERATIONS cap. Read the companion piece on the trace-to-fix workflow for the muscle memory. Most non-terminating loops we have seen were graders giving inconsistent verdicts on the same docs — fix the grader prompt or pin a stronger judge model.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to building a chatbot for answering questions on your website: RAG, voice, and how CallSphere ships one in 3-5 days.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
A founder's guide on how to create a chatbot in 2026. Build options, AI stack, integration patterns, and when buying a managed agent wins over building.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI