---
title: "Building Your First RAG Pipeline in Python: End-to-End Tutorial"
description: "A complete hands-on tutorial that walks you through building a working RAG pipeline from scratch — loading documents, chunking, embedding, storing in a vector database, retrieving, and generating answers."
canonical: https://callsphere.ai/blog/building-first-rag-pipeline-python-end-to-end-tutorial
category: "Learn Agentic AI"
tags: ["RAG", "Python", "Tutorial", "LangChain", "Vector Search", "OpenAI"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T18:49:58.410Z
---

# Building Your First RAG Pipeline in Python: End-to-End Tutorial

> A complete hands-on tutorial that walks you through building a working RAG pipeline from scratch — loading documents, chunking, embedding, storing in a vector database, retrieving, and generating answers.

## What You Will Build

By the end of this tutorial, you will have a fully working RAG pipeline that can answer questions about any collection of documents. The pipeline includes six stages: load, chunk, embed, store, retrieve, and generate. Every line of code is explained.

## Prerequisites

Install the required packages:

```mermaid
flowchart LR
    Q(["User query"])
    EMB["Embed query
text-embedding-3"]
    VEC[("Vector DB
pgvector or Pinecone")]
    RET["Top-k retrieval
k = 8"]
    PROMPT["Augmented prompt
system plus context"]
    LLM["LLM generation
Claude or GPT"]
    CITE["Inline citations
and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```bash
pip install langchain langchain-openai langchain-community chromadb pypdf python-dotenv
```

Set your OpenAI API key:

```bash
export OPENAI_API_KEY="sk-proj-your-key-here"
```

## Step 1: Load Documents

We will use LangChain's document loaders to read PDF files from a directory. The same pattern works for markdown, HTML, CSV, and dozens of other formats.

```python
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

loader = DirectoryLoader(
    "./docs",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
)

raw_documents = loader.load()
print(f"Loaded {len(raw_documents)} pages from PDF files")

# Each document has page_content (text) and metadata (source, page number)
for doc in raw_documents[:2]:
    print(f"Source: {doc.metadata['source']}, Page: {doc.metadata.get('page', 'N/A')}")
    print(f"Content preview: {doc.page_content[:150]}...")
    print()
```

## Step 2: Chunk Documents

Split the loaded documents into smaller, semantically coherent chunks using recursive character splitting:

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

chunks = text_splitter.split_documents(raw_documents)
print(f"Split {len(raw_documents)} pages into {len(chunks)} chunks")

# Inspect chunk size distribution
sizes = [len(c.page_content) for c in chunks]
print(f"Chunk sizes — min: {min(sizes)}, max: {max(sizes)}, avg: {sum(sizes)//len(sizes)}")
```

Each chunk retains the metadata from its parent document (source file, page number), which is critical for source attribution in answers.

## Step 3: Create Embeddings and Store in Vector DB

We use OpenAI's `text-embedding-3-small` model and Chroma as the vector store:

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-small",  # 1536 dimensions, $0.02/1M tokens
)

# Build the vector store — this embeds all chunks and stores them
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_rag_db",
    collection_name="my_documents",
)

print(f"Stored {len(chunks)} chunks in Chroma vector store")
```

This step makes API calls to OpenAI to generate embeddings for every chunk. For 1,000 chunks of 512 characters each, the cost is roughly $0.01.

## Step 4: Build the Retriever

The retriever wraps the vector store and provides a clean interface for finding relevant chunks:

```python
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 5,  # return top 5 most similar chunks
    },
)

# Test retrieval
test_query = "What are the main product features?"
retrieved_docs = retriever.invoke(test_query)

print(f"Retrieved {len(retrieved_docs)} chunks for: '{test_query}'")
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Chunk {i+1} (from {doc.metadata.get('source', 'unknown')}) ---")
    print(doc.page_content[:200])
```

## Step 5: Build the Generation Chain

Now we connect retrieval to generation. The LLM receives the retrieved context and produces a grounded answer:

```python
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o", temperature=0)

RAG_PROMPT = ChatPromptTemplate.from_template("""You are a helpful assistant that answers questions based on the provided context.

Instructions:
- Answer the question using ONLY the information in the context below.
- If the context does not contain enough information, say "I don't have sufficient information to answer that question."
- Cite the source document when possible.
- Be concise and direct.

Context:
{context}

Question: {question}

Answer:""")

def format_docs(docs):
    formatted = []
    for i, doc in enumerate(docs):
        source = doc.metadata.get("source", "unknown")
        formatted.append(f"[Source: {source}]\n{doc.page_content}")
    return "\n\n---\n\n".join(formatted)
```

## Step 6: Put It All Together

```python
def ask(question: str) -> str:
    """Complete RAG pipeline: retrieve then generate."""
    # Retrieve relevant chunks
    docs = retriever.invoke(question)

    # Format context
    context = format_docs(docs)

    # Generate answer
    prompt = RAG_PROMPT.format(context=context, question=question)
    response = llm.invoke(prompt)

    return response.content

# Try it
answer = ask("What are the main product features?")
print(answer)

answer = ask("What is the pricing for the enterprise plan?")
print(answer)
```

## Adding Source Attribution

A production-quality RAG system should tell users where the answer came from:

```python
def ask_with_sources(question: str) -> dict:
    """RAG pipeline that returns answer with sources."""
    docs = retriever.invoke(question)
    context = format_docs(docs)

    prompt = RAG_PROMPT.format(context=context, question=question)
    response = llm.invoke(prompt)

    sources = list(set(
        doc.metadata.get("source", "unknown") for doc in docs
    ))

    return {
        "answer": response.content,
        "sources": sources,
        "num_chunks_used": len(docs),
    }

result = ask_with_sources("What is the refund policy?")
print(f"Answer: {result['answer']}")
print(f"Sources: {', '.join(result['sources'])}")
print(f"Chunks used: {result['num_chunks_used']}")
```

## Testing Your Pipeline

Verify the pipeline works correctly by testing edge cases:

```python
test_questions = [
    "What is the refund policy?",           # should find answer
    "What is the capital of Mars?",          # should say not in context
    "Summarize the main features",           # broad question
]

for q in test_questions:
    result = ask_with_sources(q)
    print(f"Q: {q}")
    print(f"A: {result['answer'][:200]}")
    print(f"Sources: {result['sources']}")
    print()
```

## FAQ

### How much does it cost to run this pipeline?

Embedding costs are minimal — roughly $0.02 per million tokens with `text-embedding-3-small`. The main cost is the generation LLM call. With GPT-4o, each query costs about $0.005-0.02 depending on context length. For most applications this totals a few dollars per thousand queries.

### Can I use a local LLM instead of OpenAI for generation?

Yes. Replace `ChatOpenAI` with any LangChain-compatible LLM wrapper. For local models, use `ChatOllama` with Llama or Mistral. The retrieval pipeline remains identical — only the generation step changes.

### How do I update the knowledge base when documents change?

Re-run the ingestion pipeline (Steps 1-3) on the new or updated documents. For incremental updates, add new chunks to the existing Chroma collection using `vectorstore.add_documents(new_chunks)`. For deletions, use Chroma's delete API with document IDs.

---

#RAG #Python #Tutorial #LangChain #VectorSearch #OpenAI #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/building-first-rag-pipeline-python-end-to-end-tutorial