By Sagar Shankaran, Founder of CallSphere
Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.
Key takeaways
Most "production RAG" systems I review in 2026 are still graded by vibes. The team chunks some PDFs, drops them into a vector store, hooks up a retriever in LangChain, and ships when answers "look right." Then a quarter later they discover the bot has been quietly hallucinating policy numbers and citing the wrong section of the formulary, and nobody can tell whether it has been getting worse because nobody has ever measured it. The fix is not exotic. Build the RAG agent properly with the modern LangChain LCEL pattern, pin your embedding and chat models, and wire ragas.evaluate() into your CI on a held-out QA set. Four metrics carry almost all the diagnostic weight: faithfulness, answer_relevancy, context_precision, and context_recall. This post shows the exact LCEL chain, the exact RAGAS call, the metric-to-failure-mode mapping, and the tradeoffs we hit running this on FAQ agents in healthcare and finance at CallSphere.
When a domain expert tells you "the answer is wrong," there are at least four very different bugs hiding behind that sentence:
Patching the system prompt fixes none of these reliably because you do not yet know which one you have. Each maps cleanly to a different RAGAS metric, and that is the entire point of structured eval — turning "the answer is wrong" into a measurement that points at the right layer.
Models drift. Pin everything that influences output and write the version into git. Our 2026 default for general-purpose RAG is:
| Layer | Choice | Why |
|---|---|---|
| Embeddings | text-embedding-3-large (3072-d) |
Best cost/quality on retrieval benchmarks; supports dimensions param |
| Chat model | gpt-4o-2024-08-06 |
Pinned date stamp; floating gpt-4o will silently change |
| Vector store | pgvector on Postgres 16 with HNSW | Same DB as app data; no extra infra |
| Reranker (optional) | cohere-rerank-3.5 |
+6–9 points on context_precision when chunk count > 8 |
| Eval framework | RAGAS 0.2.x | Native LangChain integration, four core metrics stable |
| Orchestration | LangChain LCEL + LangGraph 0.2 | LCEL for the chain, LangGraph for any iterative flow |
If you are running on Anthropic, swap the chat model for claude-sonnet-4-5-20250929; the rest stands. If you are on Bedrock, swap embeddings for amazon.titan-embed-text-v2 and accept a 4–7 point drop on retrieval recall in exchange for staying in-VPC.
Most RAG quality issues are chunking issues wearing a costume. Defaults that hold up across the production deployments I have reviewed:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
tiktoken for the same tokenizer family as your embedding model.source_id and section_path in metadata. Without it, faithfulness debugging is guesswork and citation rendering is impossible.The modern pattern is small, composable, and explicit about every step. No agent abstractions, no implicit memory, no surprises.
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_postgres import PGVector
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain_cohere import CohereRerank
EMBED_MODEL = "text-embedding-3-large"
CHAT_MODEL = "gpt-4o-2024-08-06"
embeddings = OpenAIEmbeddings(model=EMBED_MODEL, dimensions=1536)
vectorstore = PGVector(
embeddings=embeddings,
collection_name="healthcare_faq_v3",
connection="postgresql+psycopg://...",
use_jsonb=True,
)
# Retrieve 12, rerank to 4 — this single change moves context_precision the most
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 12})
reranker = CohereRerank(model="rerank-3.5", top_n=4)
def retrieve_and_rerank(question: str):
docs = base_retriever.invoke(question)
return reranker.compress_documents(docs, question)
prompt = ChatPromptTemplate.from_messages([
("system",
"You are a healthcare FAQ assistant. Answer ONLY using the provided context. "
"If the context does not contain the answer, say 'I do not have that information.' "
"Cite source_id for every factual claim in square brackets."),
("human",
"Question: {question}\n\nContext:\n{context}"),
])
llm = ChatOpenAI(model=CHAT_MODEL, temperature=0)
def format_context(docs):
return "\n\n".join(
f"[{d.metadata['source_id']}] {d.page_content}" for d in docs
)
rag_chain = (
RunnableParallel({
"context": (lambda x: x["question"]) | retrieve_and_rerank | format_context,
"question": (lambda x: x["question"]),
})
| prompt
| llm
| StrOutputParser()
)
A few notes that matter in production:
temperature=0 is non-negotiable for FAQ-style RAG. Save creativity for ideation agents.format_context function inlines source_id so the model can cite. Without inline citations, faithfulness scoring is harder and end-user trust is lower.RunnableParallel is what makes LCEL fast — retrieval and the (trivial) question pass-through compose cleanly without await spaghetti.The whole point of building the chain cleanly is that you can now measure it. RAGAS 0.2 takes a dataset of {question, answer, contexts, ground_truth} rows and returns numeric metrics per row plus an aggregate.
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_similarity,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
# Build the eval dataset by running the chain on a held-out QA set
def build_eval_row(q: str, gt: str):
docs = retrieve_and_rerank(q)
answer = rag_chain.invoke({"question": q})
return {
"question": q,
"answer": answer,
"contexts": [d.page_content for d in docs],
"ground_truth": gt,
}
qa_pairs = load_held_out_set() # ~150 human-curated rows
rows = [build_eval_row(q, gt) for q, gt in qa_pairs]
ds = Dataset.from_list(rows)
judge_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-2024-08-06", temperature=0))
judge_emb = LangchainEmbeddingsWrapper(embeddings)
result = evaluate(
dataset=ds,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_similarity,
],
llm=judge_llm,
embeddings=judge_emb,
raise_exceptions=False,
)
print(result)
df = result.to_pandas()
df.to_parquet("eval_runs/2026-05-06.parquet")
flowchart LR
A[Source docs] --> B[Chunk + embed]
B --> C[pgvector index]
D[User question] --> E[Retrieve k=12]
C --> E
E --> F[Rerank to top 4]
F --> G[Format context + prompt]
G --> H[gpt-4o-2024-08-06]
H --> I[Answer + citations]
I --> J[RAGAS evaluate()]
D --> J
F --> J
K[Ground truth] --> J
J --> L{Metrics}
L --> M[faithfulness]
L --> N[answer_relevancy]
L --> O[context_precision]
L --> P[context_recall]
style J fill:#ffd
style L fill:#cfc
Figure 1 — The full pipeline. Indexing happens once per doc revision; retrieval, generation, and eval happen per question. RAGAS observes three of the four arrows it needs without modifying the chain.
Here is the metric-to-failure-mode mapping I stand behind. These are the four bugs from the introduction, each with the metric that catches it and the layer where you actually fix it.
| Metric | Range | What it measures | Catches | Fix layer |
|---|---|---|---|---|
faithfulness |
0–1 | Are the answer's claims supported by the retrieved context? | LLM hallucination on top of correct retrieval | Prompt + chat model |
answer_relevancy |
0–1 | Does the answer actually address the question? | True-but-irrelevant answers | Prompt + question rewriting |
context_precision |
0–1 | Of the retrieved chunks, how many are actually relevant? | Retriever pulls noise alongside signal | Retriever k, rerank, chunk size |
context_recall |
0–1 | Did we retrieve all chunks needed to answer? | Retriever misses the right chunk entirely | Embedding model, chunking, query expansion |
answer_similarity |
0–1 | Semantic similarity to ground truth | Phrasing drift, regression vs. previous version | Use as canary, not gate |
A typical production target floor on a healthy FAQ corpus: faithfulness ≥ 0.92, answer_relevancy ≥ 0.88, context_precision ≥ 0.75, context_recall ≥ 0.85. Numbers below those should fail CI for any PR that touches the chain, the prompt, or the index.
The most common failure pattern I see in first eval runs is high faithfulness but low context_recall. Translation: the model is being honest with the chunks it sees, but the retriever is missing the relevant chunk a quarter of the time. The fix is never "tweak the prompt." It is upstream: switch to a stronger embedding model, raise k, add a reranker, or fix chunking.
The second most common: high context_precision but low faithfulness. The retriever is doing its job, the LLM is not. Almost always the prompt is too permissive ("answer using context where possible" instead of "answer ONLY using context"). Tighten the prompt, drop temperature to 0, and consider switching to a stronger reasoning model.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The third pattern: everything looks great in aggregate but a long tail of zero-faithfulness rows. Inspect those rows by hand. They are usually questions whose answer requires combining information across two non-adjacent chunks, and the model picked one chunk's framing while contradicting the other. The fix is structural — multi-hop or agentic RAG (covered in the companion piece on agentic RAG with LangGraph).
The eval is only as good as the gate. We run RAGAS as a GitHub Actions job on every PR that touches the RAG chain, the index config, the prompt, or the eval dataset itself.
- name: Run RAGAS eval
run: python scripts/run_ragas.py --dataset held_out_v4 --out result.json
- name: Compare to baseline
run: |
python scripts/compare_metrics.py \
--new result.json \
--baseline s3://evals/main/latest.json \
--thresholds faithfulness=0.92,answer_relevancy=0.88,context_precision=0.75,context_recall=0.85 \
--max_drop 0.02
The --max_drop flag is what prevents slow rot. Even if a PR is above the absolute floor, dropping any metric by more than 2 points vs. the main baseline fails the gate. This is the single rule that has saved us from "death by a thousand 0.5-point regressions" more than once.
gpt-4o-2024-08-06 and takes ~4 minutes. CI runs it on every relevant PR.The whole loop costs us under $400/month in eval inference for an FAQ agent that handles tens of thousands of queries. The first time we shipped a "small" prompt change without it, we silently dropped faithfulness from 0.94 to 0.81 and did not notice for two weeks. That is the bill that matters.
We see useful signal at 80 rows, sharp gates at 150, and diminishing returns past 400. Below 80 the LLM-judge variance overwhelms real differences. Stratify by topic (every major doc category gets at least 5 rows) before you scale up.
faithfulness and answer_relevancy work without ground truth. context_recall and answer_similarity need it. We always invest in the ground truth — without it, you cannot detect retrieval misses, which is the most common failure mode.
Yes, but pin the judge LLM to a model that handles your target language well. We use gpt-4o-2024-08-06 for English, Spanish, and French; for Hindi and Arabic, we evaluate on a translated subset and accept the loss in directness.
Run RAGAS in two tiers: a 50-row smoke set on every PR (~$0.30, ~90s) and the full 150–400-row set on PRs that touch the chain or weekly on main. CallSphere healthcare FAQ agents follow exactly this pattern.
Long-document QA (legal contracts, full medical records) where the relevant span is hundreds of tokens deep into one chunk. RAGAS metrics still work, but you will need to pair them with a span-level evaluator. That is when you graduate to the agentic RAG patterns in the follow-up post.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to building a chatbot for answering questions on your website: RAG, voice, and how CallSphere ships one in 3-5 days.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
A founder's guide on how to create a chatbot in 2026. Build options, AI stack, integration patterns, and when buying a managed agent wins over building.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI