What Is RAG: Retrieval-Augmented Generation Explained from Scratch

The Problem RAG Solves

Large language models are trained on static datasets with a fixed knowledge cutoff. When you ask an LLM about your company's internal documentation, last week's product changelog, or a proprietary research paper, it either hallucinates an answer or admits it does not know. Fine-tuning can inject new knowledge, but it is expensive, slow, and the model still cannot access information that changes daily.

Retrieval-Augmented Generation (RAG) solves this by giving the LLM a search engine at inference time. Instead of relying solely on parametric memory baked into the model's weights, RAG retrieves relevant documents from an external knowledge base and passes them into the prompt as context. The model then generates an answer grounded in those documents.

Core Architecture

A RAG system has two phases that execute in sequence for every user query:

flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

Phase 1 — Retrieval: The user's question is converted into a vector embedding, then compared against pre-indexed document embeddings in a vector database. The top-k most similar chunks are returned.

Phase 2 — Generation: The retrieved chunks are injected into the LLM prompt alongside the original question. The model synthesizes an answer using only (or primarily) the provided context.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

The data flow looks like this:

User Query
    |
    v
Embedding Model --> Query Vector
    |
    v
Vector Database (similarity search)
    |
    v
Top-K Document Chunks
    |
    v
LLM Prompt = System Instructions + Retrieved Context + User Query
    |
    v
Generated Answer (grounded in retrieved documents)

Offline Ingestion Pipeline

Before queries can be answered, documents must be preprocessed and indexed. This happens offline:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# 1. Load documents
documents = load_your_documents()  # PDFs, markdown, HTML, etc.

# 2. Chunk them
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

Each chunk gets an embedding vector. These vectors are stored in the vector database alongside the original text.

RAG vs Fine-Tuning: When to Use Which

The decision between RAG and fine-tuning depends on four factors:

Factor	RAG	Fine-Tuning
Data changes frequently	Excellent — re-index new docs	Poor — retrain required
Need citations / sources	Built-in — retrieved docs are traceable	Not possible
Domain-specific style/tone	Weaker — model writes in its default style	Strong — model learns the style
Latency budget	Higher — retrieval adds 100-500ms	Lower — single model call
Cost	Lower — no GPU training costs	Higher — compute for training

Use RAG when your knowledge base changes often, you need source attribution, or you cannot afford fine-tuning compute. Use fine-tuning when you need the model to adopt a specific writing style or deeply understand a narrow domain.

In practice, many production systems combine both: fine-tune for tone and format, then use RAG for factual grounding.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

A Minimal RAG Query in Python

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Load the pre-built vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Retrieve
query = "What is our refund policy for enterprise plans?"
docs = retriever.invoke(query)
context = "\n\n".join([doc.page_content for doc in docs])

# Generate
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = f"""Answer the question based ONLY on the following context.
If the context does not contain the answer, say "I don't have that information."

Context:
{context}

Question: {query}"""

response = llm.invoke(prompt)
print(response.content)

Common Pitfalls

Chunks too large: The model gets flooded with irrelevant text and misses the key passage. Keep chunks between 256-1024 tokens.

No overlap between chunks: Important information that spans a chunk boundary gets split and lost. Use 10-15% overlap.

Ignoring retrieval quality: Teams focus on the generation model but the answer quality ceiling is set by retrieval. If the right document is not retrieved, no model can produce a correct answer.

FAQ

How is RAG different from just pasting documents into a prompt?

RAG is selective — it retrieves only the most relevant chunks rather than dumping entire documents into the context window. This keeps costs low, avoids hitting token limits, and reduces noise so the model focuses on pertinent information.

Can RAG work with open-source LLMs or only OpenAI models?

RAG is model-agnostic. The retrieval phase uses embedding models (which can be open-source like sentence-transformers) and the generation phase works with any LLM — Llama, Mistral, Gemma, or any other model that accepts a text prompt.

When should I NOT use RAG?

Skip RAG when all the knowledge the model needs is already in its training data (general knowledge tasks), when you need sub-50ms latency with no room for a retrieval step, or when your use case is purely generative (creative writing, brainstorming) with no factual grounding requirement.

#RAG #RetrievalAugmentedGeneration #LLM #VectorSearch #AIArchitecture #AgenticAI #LearnAI #AIEngineering

What Is RAG: Retrieval-Augmented Generation Explained from Scratch

The Problem RAG Solves

Core Architecture

Offline Ingestion Pipeline

RAG vs Fine-Tuning: When to Use Which

A Minimal RAG Query in Python

Common Pitfalls

FAQ

How is RAG different from just pasting documents into a prompt?

Can RAG work with open-source LLMs or only OpenAI models?

When should I NOT use RAG?

Try CallSphere AI Voice Agents

Related Articles You May Like

Chatbot for Answering Questions: How to Build One That Works

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

How To Create A Chatbot In 2026: A Founder's Practical Guide

Long-Running Agent Workflows: The 2026 Enterprise Blueprint

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines