Skip to content
Learn Agentic AI
Learn Agentic AI11 min read3 views

Evaluating RAG in Production: Building Automated Quality Monitoring for Retrieval Systems

Learn how to build comprehensive RAG evaluation systems with online metrics, user feedback loops, automated quality scoring, A/B testing, and degradation detection for production retrieval pipelines.

Why Offline Evaluation Is Not Enough

Most teams evaluate their RAG system once during development using a curated test set, declare the results acceptable, and ship to production. Then reality hits. Documents get updated, new content is added, user query patterns shift, and embedding model behavior drifts on edge cases. The system that scored 85% on your test set six weeks ago might be producing incorrect answers 30% of the time today, and nobody knows until users complain.

Production RAG evaluation must be continuous, automated, and multi-dimensional. You need to monitor retrieval quality, generation faithfulness, and user satisfaction — all in real time.

The Four Pillars of RAG Evaluation

1. Retrieval Quality

Are the right documents being retrieved? Measured by context relevance and recall.

flowchart TD
    START["Evaluating RAG in Production: Building Automated …"] --> A
    A["Why Offline Evaluation Is Not Enough"]
    A --> B
    B["The Four Pillars of RAG Evaluation"]
    B --> C
    C["Building an Automated Quality Scorer"]
    C --> D
    D["Integrating Evaluation into Your RAG Pi…"]
    D --> E
    E["Building a Degradation Detection System"]
    E --> F
    F["Incorporating User Feedback"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

2. Generation Faithfulness

Is the LLM's answer actually supported by the retrieved documents? Measured by groundedness.

3. Answer Correctness

Does the answer actually address the user's question? Measured by answer relevance.

4. User Satisfaction

Do users find the answers helpful? Measured by explicit feedback and behavioral signals.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Building an Automated Quality Scorer

from openai import OpenAI
from dataclasses import dataclass
from datetime import datetime
import json

client = OpenAI()

@dataclass
class RAGEvaluation:
    query: str
    retrieved_docs: list[str]
    generated_answer: str
    context_relevance: float
    faithfulness: float
    answer_relevance: float
    timestamp: datetime

def evaluate_context_relevance(
    query: str, documents: list[str]
) -> float:
    """Score how relevant retrieved documents are to the query.
    Returns 0.0 to 1.0."""
    scores = []
    for doc in documents:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": """Rate the relevance of this document
                to the query on a scale of 0.0 to 1.0.
                Return JSON: {"score": 0.X, "reason": "..."}"""
            }, {
                "role": "user",
                "content": f"Query: {query}\nDocument: {doc}"
            }],
            response_format={"type": "json_object"}
        )
        result = json.loads(
            response.choices[0].message.content
        )
        scores.append(result["score"])

    return sum(scores) / len(scores) if scores else 0.0

def evaluate_faithfulness(
    answer: str, documents: list[str]
) -> float:
    """Score whether the answer is grounded in the documents.
    Returns 0.0 to 1.0."""
    context = "\n\n".join(documents)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Evaluate if each claim in the answer
            is supported by the provided documents.
            Return JSON:
            {
              "claims": [
                {"claim": "...", "supported": true/false}
              ],
              "faithfulness_score": 0.0-1.0
            }"""
        }, {
            "role": "user",
            "content": (
                f"Documents:\n{context}\n\n"
                f"Answer:\n{answer}"
            )
        }],
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)
    return result["faithfulness_score"]

def evaluate_answer_relevance(
    query: str, answer: str
) -> float:
    """Score whether the answer addresses the question.
    Returns 0.0 to 1.0."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Rate how well the answer addresses
            the user's question on a scale of 0.0 to 1.0.
            Return JSON: {"score": 0.X, "reason": "..."}"""
        }, {
            "role": "user",
            "content": f"Question: {query}\nAnswer: {answer}"
        }],
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)
    return result["score"]

Integrating Evaluation into Your RAG Pipeline

import logging

logger = logging.getLogger("rag_eval")

class MonitoredRAGPipeline:
    def __init__(self, retriever, eval_sample_rate: float = 0.1):
        self.retriever = retriever
        self.sample_rate = eval_sample_rate
        self.evaluations: list[RAGEvaluation] = []

    def answer(self, query: str) -> str:
        """Answer with optional quality evaluation."""
        import random

        # Retrieve and generate as normal
        docs = self.retriever.search(query, k=5)
        doc_texts = [d.page_content for d in docs]

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Answer using the provided context."
            }, {
                "role": "user",
                "content": (
                    f"Context:\n{'chr(10)'.join(doc_texts)}"
                    f"\n\nQuestion: {query}"
                )
            }],
        )
        answer = response.choices[0].message.content

        # Evaluate a sample of responses
        if random.random() < self.sample_rate:
            self._async_evaluate(query, doc_texts, answer)

        return answer

    def _async_evaluate(
        self, query: str, docs: list[str], answer: str
    ):
        """Run evaluation asynchronously to avoid
        adding latency to the response."""
        import threading

        def evaluate():
            try:
                eval_result = RAGEvaluation(
                    query=query,
                    retrieved_docs=docs,
                    generated_answer=answer,
                    context_relevance=evaluate_context_relevance(
                        query, docs
                    ),
                    faithfulness=evaluate_faithfulness(
                        answer, docs
                    ),
                    answer_relevance=evaluate_answer_relevance(
                        query, answer
                    ),
                    timestamp=datetime.now(),
                )
                self.evaluations.append(eval_result)
                self._check_degradation(eval_result)
            except Exception as e:
                logger.error(f"Evaluation failed: {e}")

        thread = threading.Thread(target=evaluate)
        thread.start()

    def _check_degradation(self, evaluation: RAGEvaluation):
        """Alert if quality drops below thresholds."""
        thresholds = {
            "context_relevance": 0.6,
            "faithfulness": 0.7,
            "answer_relevance": 0.6,
        }

        for metric, threshold in thresholds.items():
            value = getattr(evaluation, metric)
            if value < threshold:
                logger.warning(
                    f"Quality degradation detected: "
                    f"{metric}={value:.2f} < {threshold} "
                    f"for query: {evaluation.query[:100]}"
                )

Building a Degradation Detection System

Track rolling averages to detect systemic quality drops, not just individual bad answers:

flowchart TD
    ROOT["Evaluating RAG in Production: Building Autom…"] 
    ROOT --> P0["The Four Pillars of RAG Evaluation"]
    P0 --> P0C0["1. Retrieval Quality"]
    P0 --> P0C1["2. Generation Faithfulness"]
    P0 --> P0C2["3. Answer Correctness"]
    P0 --> P0C3["4. User Satisfaction"]
    ROOT --> P1["FAQ"]
    P1 --> P1C0["What sample rate should I use for autom…"]
    P1 --> P1C1["How quickly can degradation detection c…"]
    P1 --> P1C2["Should I use an LLM judge or fine-tuned…"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
from collections import deque

class DegradationDetector:
    def __init__(self, window_size: int = 100):
        self.window_size = window_size
        self.context_scores = deque(maxlen=window_size)
        self.faith_scores = deque(maxlen=window_size)
        self.relevance_scores = deque(maxlen=window_size)
        self.alert_threshold = 0.1  # 10% drop triggers alert

    def add_evaluation(self, evaluation: RAGEvaluation):
        self.context_scores.append(
            evaluation.context_relevance
        )
        self.faith_scores.append(evaluation.faithfulness)
        self.relevance_scores.append(
            evaluation.answer_relevance
        )

    def check_trends(self) -> list[str]:
        """Compare recent scores to historical baseline."""
        alerts = []
        if len(self.context_scores) < self.window_size:
            return alerts

        for name, scores in [
            ("context_relevance", self.context_scores),
            ("faithfulness", self.faith_scores),
            ("answer_relevance", self.relevance_scores),
        ]:
            scores_list = list(scores)
            midpoint = len(scores_list) // 2
            first_half_avg = (
                sum(scores_list[:midpoint]) / midpoint
            )
            second_half_avg = (
                sum(scores_list[midpoint:])
                / (len(scores_list) - midpoint)
            )

            drop = first_half_avg - second_half_avg
            if drop > self.alert_threshold:
                alerts.append(
                    f"{name} dropped by {drop:.2%}: "
                    f"{first_half_avg:.2f} -> "
                    f"{second_half_avg:.2f}"
                )

        return alerts

Incorporating User Feedback

Automated evaluation catches technical quality issues, but user feedback captures real-world usefulness. Implement thumbs-up/thumbs-down on every response, track which answers get follow-up questions (indicating the first answer was insufficient), and correlate user feedback with automated scores to calibrate your thresholds.

The combination of automated scoring and user signals gives you a complete picture. Automated scoring runs on every sampled response with consistent criteria. User feedback provides ground truth on actual helpfulness. Together, they enable you to detect problems early, diagnose root causes, and continuously improve your RAG system.

FAQ

What sample rate should I use for automated evaluation?

Start with 10% of queries. This gives you statistically meaningful data without excessive LLM evaluation costs. For critical applications (medical, financial, legal), increase to 25-50%. You can also evaluate 100% of queries from specific user segments or query categories that are high risk.

How quickly can degradation detection catch a problem?

With a 10% sample rate and 100-query window, you need approximately 1,000 queries before the window fills. At high traffic volumes this happens within hours. For faster detection, increase the sample rate or reduce the window size, accepting more noise in exchange for quicker alerts.

Should I use an LLM judge or fine-tuned classifier for evaluation?

Start with an LLM judge (GPT-4o-mini is cost-effective and accurate enough). As you accumulate labeled evaluation data, train a fine-tuned classifier that can evaluate in milliseconds instead of hundreds of milliseconds. The LLM judge becomes your labeling tool, and the classifier becomes your production evaluator.


#RAGEvaluation #ProductionMonitoring #QualityMetrics #ABTesting #MLOps #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 MLOps & AI Deployment Interview Questions for 2026

Real MLOps and AI deployment interview questions from Google, Amazon, Meta, and Microsoft in 2026. Covers CI/CD for ML, model monitoring, quantization, continuous batching, serving infrastructure, and evaluation frameworks.

Learn Agentic AI

Agent A/B Testing: Comparing Model Versions, Prompts, and Architectures in Production

How to A/B test AI agents in production: traffic splitting, evaluation metrics, statistical significance, prompt version comparison, and architecture experiments.

Learn Agentic AI

Prompt Testing and Iteration: A Scientific Approach to Prompt Development

Apply rigorous testing methodology to prompt engineering — A/B test prompts, define evaluation metrics, version your prompts, and build regression test suites that prevent quality regressions in production.

Learn Agentic AI

Building a Continuous Evaluation Pipeline: Automated Agent Quality Monitoring

Learn how to build a continuous evaluation pipeline for AI agents with scheduled evaluations, dashboard integration, alerting on quality drops, and trend analysis over time.

Learn Agentic AI

The Strategy Pattern: Swappable Agent Behaviors Based on Runtime Context

Implement the Strategy pattern to dynamically swap AI agent behaviors at runtime — supporting A/B testing, context-driven model selection, and flexible agent configuration.

Learn Agentic AI

A/B Testing AI Agents: Comparing Prompts, Models, and Configurations in Production

Implement rigorous A/B testing for AI agents to compare prompts, models, and configurations in production with proper experiment design, traffic splitting, statistical significance, and safe rollout strategies.