Production RAG Agents with LangChain and RAGAS Evaluation in 2026

TL;DR

Most "production RAG" systems I review in 2026 are still graded by vibes. The team chunks some PDFs, drops them into a vector store, hooks up a retriever in LangChain, and ships when answers "look right." Then a quarter later they discover the bot has been quietly hallucinating policy numbers and citing the wrong section of the formulary, and nobody can tell whether it has been getting worse because nobody has ever measured it. The fix is not exotic. Build the RAG agent properly with the modern LangChain LCEL pattern, pin your embedding and chat models, and wire ragas.evaluate() into your CI on a held-out QA set. Four metrics carry almost all the diagnostic weight: faithfulness, answer_relevancy, context_precision, and context_recall. This post shows the exact LCEL chain, the exact RAGAS call, the metric-to-failure-mode mapping, and the tradeoffs we hit running this on FAQ agents in healthcare and finance at CallSphere.

Why Generic RAG Quality Conversations Go Nowhere

When a domain expert tells you "the answer is wrong," there are at least four very different bugs hiding behind that sentence:

The retriever pulled the right chunk but the LLM ignored it.
The retriever pulled a related-but-wrong chunk; the LLM faithfully restated the wrong source.
The retriever pulled nothing relevant; the LLM made it up.
The retriever pulled the right chunk; the LLM stated something true but unrelated to the question.

Patching the system prompt fixes none of these reliably because you do not yet know which one you have. Each maps cleanly to a different RAGAS metric, and that is the entire point of structured eval — turning "the answer is wrong" into a measurement that points at the right layer.

The Production Stack We Pin

Models drift. Pin everything that influences output and write the version into git. Our 2026 default for general-purpose RAG is:

Layer	Choice	Why
Embeddings	`text-embedding-3-large` (3072-d)	Best cost/quality on retrieval benchmarks; supports dimensions param
Chat model	`gpt-4o-2024-08-06`	Pinned date stamp; floating `gpt-4o` will silently change
Vector store	pgvector on Postgres 16 with HNSW	Same DB as app data; no extra infra
Reranker (optional)	`cohere-rerank-3.5`	+6–9 points on context_precision when chunk count > 8
Eval framework	RAGAS 0.2.x	Native LangChain integration, four core metrics stable
Orchestration	LangChain LCEL + LangGraph 0.2	LCEL for the chain, LangGraph for any iterative flow

If you are running on Anthropic, swap the chat model for claude-sonnet-4-5-20250929; the rest stands. If you are on Bedrock, swap embeddings for amazon.titan-embed-text-v2 and accept a 4–7 point drop on retrieval recall in exchange for staying in-VPC.

Chunking Strategy: The Boring Decision That Determines Everything

Most RAG quality issues are chunking issues wearing a costume. Defaults that hold up across the production deployments I have reviewed:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Token-based, not character-based. Use tiktoken for the same tokenizer family as your embedding model.
512 tokens with 64-token overlap for FAQ and policy docs.
256 tokens with 32-token overlap for highly structured reference material (drug formularies, code documentation).
Always preserve a stable source_id and section_path in metadata. Without it, faithfulness debugging is guesswork and citation rendering is impossible.
Never chunk across semantic boundaries if you can avoid it. Splitting mid-paragraph crushes context_precision because the relevant span is now half-in, half-out of two chunks.

The LangChain LCEL Chain

The modern pattern is small, composable, and explicit about every step. No agent abstractions, no implicit memory, no surprises.

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_postgres import PGVector
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain_cohere import CohereRerank

EMBED_MODEL = "text-embedding-3-large"
CHAT_MODEL = "gpt-4o-2024-08-06"

embeddings = OpenAIEmbeddings(model=EMBED_MODEL, dimensions=1536)

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="healthcare_faq_v3",
    connection="postgresql+psycopg://...",
    use_jsonb=True,
)

# Retrieve 12, rerank to 4 — this single change moves context_precision the most
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 12})
reranker = CohereRerank(model="rerank-3.5", top_n=4)

def retrieve_and_rerank(question: str):
    docs = base_retriever.invoke(question)
    return reranker.compress_documents(docs, question)

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a healthcare FAQ assistant. Answer ONLY using the provided context. "
     "If the context does not contain the answer, say 'I do not have that information.' "
     "Cite source_id for every factual claim in square brackets."),
    ("human",
     "Question: {question}\n\nContext:\n{context}"),
])

llm = ChatOpenAI(model=CHAT_MODEL, temperature=0)

def format_context(docs):
    return "\n\n".join(
        f"[{d.metadata['source_id']}] {d.page_content}" for d in docs
    )

rag_chain = (
    RunnableParallel({
        "context": (lambda x: x["question"]) | retrieve_and_rerank | format_context,
        "question": (lambda x: x["question"]),
    })
    | prompt
    | llm
    | StrOutputParser()
)

A few notes that matter in production:

temperature=0 is non-negotiable for FAQ-style RAG. Save creativity for ideation agents.
The format_context function inlines source_id so the model can cite. Without inline citations, faithfulness scoring is harder and end-user trust is lower.
RunnableParallel is what makes LCEL fast — retrieval and the (trivial) question pass-through compose cleanly without await spaghetti.
Pinning embedding dimensions to 1536 (instead of full 3072) cuts pgvector index size in half with a measured 1.2-point recall drop on our internal eval. Worth it for the storage and latency savings at our scale.

The RAGAS Evaluation Loop

The whole point of building the chain cleanly is that you can now measure it. RAGAS 0.2 takes a dataset of {question, answer, contexts, ground_truth} rows and returns numeric metrics per row plus an aggregate.

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_similarity,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Build the eval dataset by running the chain on a held-out QA set
def build_eval_row(q: str, gt: str):
    docs = retrieve_and_rerank(q)
    answer = rag_chain.invoke({"question": q})
    return {
        "question": q,
        "answer": answer,
        "contexts": [d.page_content for d in docs],
        "ground_truth": gt,
    }

qa_pairs = load_held_out_set()  # ~150 human-curated rows
rows = [build_eval_row(q, gt) for q, gt in qa_pairs]
ds = Dataset.from_list(rows)

judge_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-2024-08-06", temperature=0))
judge_emb = LangchainEmbeddingsWrapper(embeddings)

result = evaluate(
    dataset=ds,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_similarity,
    ],
    llm=judge_llm,
    embeddings=judge_emb,
    raise_exceptions=False,
)

print(result)
df = result.to_pandas()
df.to_parquet("eval_runs/2026-05-06.parquet")

flowchart LR
  A[Source docs] --> B[Chunk + embed]
  B --> C[pgvector index]
  D[User question] --> E[Retrieve k=12]
  C --> E
  E --> F[Rerank to top 4]
  F --> G[Format context + prompt]
  G --> H[gpt-4o-2024-08-06]
  H --> I[Answer + citations]
  I --> J[RAGAS evaluate&#40;&#41;]
  D --> J
  F --> J
  K[Ground truth] --> J
  J --> L{Metrics}
  L --> M[faithfulness]
  L --> N[answer_relevancy]
  L --> O[context_precision]
  L --> P[context_recall]
  style J fill:#ffd
  style L fill:#cfc

Figure 1 — The full pipeline. Indexing happens once per doc revision; retrieval, generation, and eval happen per question. RAGAS observes three of the four arrows it needs without modifying the chain.

The Four Metrics That Carry the Diagnostic Weight

Here is the metric-to-failure-mode mapping I stand behind. These are the four bugs from the introduction, each with the metric that catches it and the layer where you actually fix it.

Metric	Range	What it measures	Catches	Fix layer
`faithfulness`	0–1	Are the answer's claims supported by the retrieved context?	LLM hallucination on top of correct retrieval	Prompt + chat model
`answer_relevancy`	0–1	Does the answer actually address the question?	True-but-irrelevant answers	Prompt + question rewriting
`context_precision`	0–1	Of the retrieved chunks, how many are actually relevant?	Retriever pulls noise alongside signal	Retriever k, rerank, chunk size
`context_recall`	0–1	Did we retrieve all chunks needed to answer?	Retriever misses the right chunk entirely	Embedding model, chunking, query expansion
`answer_similarity`	0–1	Semantic similarity to ground truth	Phrasing drift, regression vs. previous version	Use as canary, not gate

A typical production target floor on a healthy FAQ corpus: faithfulness ≥ 0.92, answer_relevancy ≥ 0.88, context_precision ≥ 0.75, context_recall ≥ 0.85. Numbers below those should fail CI for any PR that touches the chain, the prompt, or the index.

Reading the Results

The most common failure pattern I see in first eval runs is high faithfulness but low context_recall. Translation: the model is being honest with the chunks it sees, but the retriever is missing the relevant chunk a quarter of the time. The fix is never "tweak the prompt." It is upstream: switch to a stronger embedding model, raise k, add a reranker, or fix chunking.

The second most common: high context_precision but low faithfulness. The retriever is doing its job, the LLM is not. Almost always the prompt is too permissive ("answer using context where possible" instead of "answer ONLY using context"). Tighten the prompt, drop temperature to 0, and consider switching to a stronger reasoning model.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

The third pattern: everything looks great in aggregate but a long tail of zero-faithfulness rows. Inspect those rows by hand. They are usually questions whose answer requires combining information across two non-adjacent chunks, and the model picked one chunk's framing while contradicting the other. The fix is structural — multi-hop or agentic RAG (covered in the companion piece on agentic RAG with LangGraph).

Wiring RAGAS Into CI as a Real Gate

The eval is only as good as the gate. We run RAGAS as a GitHub Actions job on every PR that touches the RAG chain, the index config, the prompt, or the eval dataset itself.

- name: Run RAGAS eval
  run: python scripts/run_ragas.py --dataset held_out_v4 --out result.json

- name: Compare to baseline
  run: |
    python scripts/compare_metrics.py \
      --new result.json \
      --baseline s3://evals/main/latest.json \
      --thresholds faithfulness=0.92,answer_relevancy=0.88,context_precision=0.75,context_recall=0.85 \
      --max_drop 0.02

The --max_drop flag is what prevents slow rot. Even if a PR is above the absolute floor, dropping any metric by more than 2 points vs. the main baseline fails the gate. This is the single rule that has saved us from "death by a thousand 0.5-point regressions" more than once.

Honest Tradeoffs

RAGAS is LLM-as-judge for three of the four core metrics. It is not free. Our 150-row eval costs about $0.85 per run on gpt-4o-2024-08-06 and takes ~4 minutes. CI runs it on every relevant PR.
Ground truth answers must be written by domain experts. A clinician for healthcare, a licensed financial professional for finance. Engineers writing reference answers is the most common silent quality killer I see.
The eval set drifts. Quarterly review where domain experts walk through the lowest-scoring rows is mandatory. About 8% of our rows turn out to be eval bugs (ambiguous question, outdated ground truth) per quarter.
answer_similarity is a noisy metric. Use it as a canary for "did the answer style change radically" but never as a hard gate.

The whole loop costs us under $400/month in eval inference for an FAQ agent that handles tens of thousands of queries. The first time we shipped a "small" prompt change without it, we silently dropped faithfulness from 0.94 to 0.81 and did not notice for two weeks. That is the bill that matters.

Frequently Asked Questions

How big should the held-out QA set be?

We see useful signal at 80 rows, sharp gates at 150, and diminishing returns past 400. Below 80 the LLM-judge variance overwhelms real differences. Stratify by topic (every major doc category gets at least 5 rows) before you scale up.

Can I use RAGAS without ground truth?

faithfulness and answer_relevancy work without ground truth. context_recall and answer_similarity need it. We always invest in the ground truth — without it, you cannot detect retrieval misses, which is the most common failure mode.

Does RAGAS work for multilingual RAG?

Yes, but pin the judge LLM to a model that handles your target language well. We use gpt-4o-2024-08-06 for English, Spanish, and French; for Hindi and Arabic, we evaluate on a translated subset and accept the loss in directness.

What about cost at scale?

Run RAGAS in two tiers: a 50-row smoke set on every PR (~$0.30, ~90s) and the full 150–400-row set on PRs that touch the chain or weekly on main. CallSphere healthcare FAQ agents follow exactly this pattern.

Where does this break down?

Long-document QA (legal contracts, full medical records) where the relevant span is hundreds of tokens deep into one chunk. RAGAS metrics still work, but you will need to pair them with a span-level evaluator. That is when you graduate to the agentic RAG patterns in the follow-up post.

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

TL;DR

Why Generic RAG Quality Conversations Go Nowhere

The Production Stack We Pin

Chunking Strategy: The Boring Decision That Determines Everything

The LangChain LCEL Chain

The RAGAS Evaluation Loop

The Four Metrics That Carry the Diagnostic Weight

Reading the Results

Wiring RAGAS Into CI as a Real Gate

Honest Tradeoffs

Frequently Asked Questions

How big should the held-out QA set be?

Can I use RAGAS without ground truth?

Does RAGAS work for multilingual RAG?

What about cost at scale?

Where does this break down?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Build a Golden Dataset for Production AI Agents

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split