---
title: "Debugging RAG Retrieval: When the Agent Retrieves Wrong or Irrelevant Documents"
description: "Learn systematic approaches to debugging RAG retrieval failures including query analysis, embedding inspection, relevance scoring evaluation, and chunk quality review for more accurate AI agent responses."
canonical: https://callsphere.ai/blog/debugging-rag-retrieval-wrong-irrelevant-documents
category: "Learn Agentic AI"
tags: ["Debugging", "RAG", "Embeddings", "Vector Search", "AI Agents"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:44.638Z
---

# Debugging RAG Retrieval: When the Agent Retrieves Wrong or Irrelevant Documents

> Learn systematic approaches to debugging RAG retrieval failures including query analysis, embedding inspection, relevance scoring evaluation, and chunk quality review for more accurate AI agent responses.

## The Right Question, the Wrong Answer

Your RAG-powered agent has access to thousands of documents. A user asks a straightforward question. The agent retrieves three chunks, synthesizes a response, and delivers it confidently. The response is wrong — not because the model hallucinated, but because it was given the wrong documents to work with.

RAG retrieval failures are particularly dangerous because the agent has no way to know it retrieved bad chunks. It trusts what it receives and generates a plausible-sounding answer from irrelevant source material. Debugging this requires inspecting every stage of the retrieval pipeline.

## The RAG Retrieval Pipeline

Every RAG query passes through four stages, and failures can occur at each one:

```mermaid
flowchart LR
    Q(["User query"])
    EMB["Embed query
text-embedding-3"]
    VEC[("Vector DB
pgvector or Pinecone")]
    RET["Top-k retrieval
k = 8"]
    PROMPT["Augmented prompt
system plus context"]
    LLM["LLM generation
Claude or GPT"]
    CITE["Inline citations
and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
```

1. **Query formation**: The user question is transformed into a search query
2. **Embedding**: The query is converted to a vector
3. **Vector search**: The nearest neighbor chunks are retrieved
4. **Relevance filtering**: Results below a threshold are discarded

Build a debugger that captures data at every stage:

```python
import numpy as np
from dataclasses import dataclass, field

@dataclass
class RetrievalDebugInfo:
    original_query: str = ""
    search_query: str = ""
    query_embedding: list[float] = field(default_factory=list)
    raw_results: list[dict] = field(default_factory=list)
    filtered_results: list[dict] = field(default_factory=list)
    similarity_scores: list[float] = field(default_factory=list)

class RAGDebugger:
    def __init__(self, embedding_client, vector_store):
        self.embedding_client = embedding_client
        self.vector_store = vector_store

    async def debug_retrieve(
        self,
        query: str,
        top_k: int = 5,
        threshold: float = 0.7,
    ) -> RetrievalDebugInfo:
        info = RetrievalDebugInfo(original_query=query)

        # Stage 1: Query formation
        info.search_query = query  # or apply transformation
        print(f"[1] Query: {info.search_query}")

        # Stage 2: Embedding
        response = await self.embedding_client.embeddings.create(
            model="text-embedding-3-small",
            input=info.search_query,
        )
        info.query_embedding = response.data[0].embedding
        print(f"[2] Embedding dim: {len(info.query_embedding)}")

        # Stage 3: Vector search
        results = await self.vector_store.query(
            embedding=info.query_embedding,
            top_k=top_k,
        )
        info.raw_results = results
        info.similarity_scores = [r["score"] for r in results]
        print(f"[3] Raw results: {len(results)}")
        for i, r in enumerate(results):
            print(f"    [{i}] score={r['score']:.4f} | {r['text'][:80]}...")

        # Stage 4: Filtering
        info.filtered_results = [
            r for r in results if r["score"] >= threshold
        ]
        print(f"[4] After filter (>={threshold}): {len(info.filtered_results)}")

        return info
```

## Diagnosing Query-Document Mismatch

The most common RAG failure is a semantic gap between the query and the stored chunks. The user asks one thing, but the embedding model interprets it differently:

```python
async def diagnose_query_mismatch(
    debugger, query: str, expected_doc_ids: list[str]
):
    """Check if expected documents score higher than retrieved ones."""
    info = await debugger.debug_retrieve(query, top_k=20)

    retrieved_ids = {r["id"] for r in info.raw_results}
    expected_set = set(expected_doc_ids)

    found = expected_set & retrieved_ids
    missed = expected_set - retrieved_ids

    print(f"Expected docs found in top-20: {len(found)}/{len(expected_set)}")
    if missed:
        print(f"Missing doc IDs: {missed}")
        # Fetch embeddings for missing docs and compute similarity
        for doc_id in missed:
            doc = await debugger.vector_store.get_by_id(doc_id)
            if doc:
                doc_emb = doc["embedding"]
                query_emb = np.array(info.query_embedding)
                similarity = np.dot(query_emb, np.array(doc_emb)) / (
                    np.linalg.norm(query_emb) * np.linalg.norm(doc_emb)
                )
                print(f"  {doc_id}: similarity={similarity:.4f}")
                print(f"    Content: {doc['text'][:100]}...")
```

## Inspecting Chunk Quality

Bad chunking is a silent killer of RAG accuracy. Chunks that split important information across boundaries lose semantic coherence:

```python
class ChunkQualityAnalyzer:
    def __init__(self, embedding_client):
        self.client = embedding_client

    async def analyze_chunks(self, chunks: list[str], query: str):
        """Score each chunk for self-containedness and relevance."""
        # Embed query and all chunks
        all_texts = [query] + chunks
        response = await self.client.embeddings.create(
            model="text-embedding-3-small",
            input=all_texts,
        )
        embeddings = [d.embedding for d in response.data]
        query_emb = np.array(embeddings[0])

        print(f"Analyzing {len(chunks)} chunks against query")
        print("-" * 60)

        for i, chunk in enumerate(chunks):
            chunk_emb = np.array(embeddings[i + 1])
            similarity = float(np.dot(query_emb, chunk_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb)
            ))
            word_count = len(chunk.split())
            has_incomplete_sentence = (
                not chunk.strip().endswith((".", "!", "?", '."', ".'"))
            )

            print(f"Chunk {i}: similarity={similarity:.4f}, "
                  f"words={word_count}, "
                  f"incomplete={'YES' if has_incomplete_sentence else 'no'}")
            if has_incomplete_sentence:
                print(f"  Ends with: ...{chunk[-60:]}")
```

## Testing with Known-Good Queries

Build a test suite of queries with expected document matches to catch retrieval regressions:

```python
class RAGTestSuite:
    def __init__(self, debugger):
        self.debugger = debugger
        self.test_cases = []

    def add_case(self, query: str, expected_doc_ids: list[str], threshold=0.7):
        self.test_cases.append({
            "query": query,
            "expected": expected_doc_ids,
            "threshold": threshold,
        })

    async def run(self):
        results = []
        for case in self.test_cases:
            info = await self.debugger.debug_retrieve(
                case["query"], top_k=10, threshold=case["threshold"]
            )
            retrieved_ids = {r["id"] for r in info.filtered_results}
            expected = set(case["expected"])
            recall = len(expected & retrieved_ids) / len(expected) if expected else 1.0

            results.append({
                "query": case["query"],
                "recall": recall,
                "pass": recall >= 0.8,
            })
            status = "PASS" if recall >= 0.8 else "FAIL"
            print(f"[{status}] recall={recall:.0%} | {case['query'][:60]}")
        return results
```

## FAQ

### My RAG retrieves documents that are topically related but do not answer the specific question. How do I fix this?

This is a precision problem. Increase your similarity threshold to filter out loosely related chunks. Also consider using a reranker model as a second-stage filter — cross-encoder rerankers like Cohere Rerank or BGE Reranker evaluate query-document pairs more accurately than cosine similarity on embeddings alone.

### Should I embed the user question directly or rewrite it before searching?

Query rewriting often improves retrieval significantly. Use the LLM to expand abbreviations, resolve pronouns from conversation history, and rephrase colloquial language into terminology that matches your documents. A simple rewriting step can increase recall by 20 to 40 percent.

### How do I decide the right chunk size for my documents?

There is no universal answer — it depends on your content. Start with 500 to 800 tokens with 100-token overlap. Test with your actual queries and measure recall. If chunks are too small, they lack context. If too large, they dilute relevance. Technical documentation often benefits from smaller chunks while narrative content works better with larger ones.

---

#Debugging #RAG #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/debugging-rag-retrieval-wrong-irrelevant-documents