Skip to content
Learn Agentic AI
Learn Agentic AI11 min read2 views

Document Chunking Strategies for RAG: Fixed-Size, Semantic, and Recursive

Learn the most effective document chunking methods for RAG pipelines including fixed-size, semantic, and recursive splitting, with guidance on overlap, chunk sizes, and markdown-aware strategies.

Why Chunking Matters More Than You Think

Chunking is the single most impactful decision in a RAG pipeline. If your chunks are too large, they contain too much noise and the embedding becomes a blurry average of unrelated ideas. If they are too small, they lose context and the retrieved snippet is meaningless on its own. The embedding model and the LLM both perform best when each chunk represents one coherent idea.

This post covers the three primary chunking strategies, their tradeoffs, and production-ready implementations.

Strategy 1: Fixed-Size Chunking

The simplest approach splits text into chunks of a fixed token or character count with optional overlap.

flowchart TD
    START["Document Chunking Strategies for RAG: Fixed-Size,…"] --> A
    A["Why Chunking Matters More Than You Think"]
    A --> B
    B["Strategy 1: Fixed-Size Chunking"]
    B --> C
    C["Strategy 2: Recursive Character Splitti…"]
    C --> D
    D["Strategy 3: Semantic Chunking"]
    D --> E
    E["Markdown-Aware Splitting"]
    E --> F
    F["Choosing Chunk Size: A Practical Guide"]
    F --> G
    G["Overlap: Why It Matters"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,       # characters
    chunk_overlap=50,     # overlap between consecutive chunks
    length_function=len,
)

chunks = splitter.split_text(document_text)
print(f"Created {len(chunks)} chunks")
print(f"Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")

Pros: Simple, predictable chunk sizes, easy to reason about token costs.

Cons: Splits mid-sentence and mid-paragraph, breaking semantic coherence. A chunk might start with "...the patient should take 200mg" without any indication of which medication is being discussed.

Best for: Unstructured plain text where no natural boundaries exist, or as a baseline to compare against smarter methods.

Strategy 2: Recursive Character Splitting

This is the most popular strategy in production RAG systems. It tries to split on natural boundaries — paragraphs first, then sentences, then words — and only falls back to character-level splits when necessary.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=[
        "\n\n",   # Try paragraph breaks first
        "\n",     # Then line breaks
        ". ",      # Then sentence endings
        ", ",      # Then clause boundaries
        " ",       # Then word boundaries
        ""         # Last resort: character-level
    ]
)

chunks = splitter.split_text(document_text)

The algorithm walks through the separator list in order. It first tries to split on double newlines (paragraphs). If a resulting chunk exceeds chunk_size, it recursively splits that chunk using the next separator.

Pros: Preserves semantic boundaries in most cases. Paragraphs stay intact when possible.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Cons: Chunk sizes still vary. Does not understand the actual meaning of the text.

Strategy 3: Semantic Chunking

Semantic chunking uses embedding similarity to detect topic boundaries. It embeds each sentence, then groups consecutive sentences that are semantically similar into the same chunk. When the similarity drops below a threshold, a new chunk begins.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75,  # split at 75th percentile dissimilarity
)

chunks = chunker.split_text(document_text)

for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i}: {len(chunk)} chars | Preview: {chunk[:80]}...")

Pros: Each chunk genuinely covers one coherent topic. Embedding quality improves significantly because the vector represents a single concept.

Cons: Requires an embedding API call for every sentence during indexing (higher cost). Chunk sizes are unpredictable. Slower ingestion.

Markdown-Aware Splitting

Technical documentation, wikis, and README files use markdown headers as natural section boundaries. A markdown-aware splitter respects these headings:

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = md_splitter.split_text(markdown_text)

# Each chunk carries its header hierarchy as metadata
for chunk in chunks[:2]:
    print(f"Headers: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:100]}...")
    print()

The metadata (which headers this chunk falls under) is extremely valuable for retrieval. You can prepend headers to the chunk text before embedding so the vector captures the full context.

Choosing Chunk Size: A Practical Guide

There is no universal optimal chunk size. Here are guidelines based on production experience:

Use Case Chunk Size Overlap Reasoning
Q&A over docs 256-512 tokens 10-15% Small focused chunks match specific questions
Summarization 1024-2048 tokens 5% Larger chunks preserve narrative flow
Code search 64-256 tokens 0 Functions/classes are natural boundaries
Legal/medical 512-1024 tokens 15-20% Higher overlap prevents splitting critical clauses

Overlap: Why It Matters

Overlap ensures that information spanning a chunk boundary is not lost. Consider a document where paragraph A ends with a key fact and paragraph B provides the explanation. Without overlap, the fact and its explanation land in separate chunks. With a 64-token overlap, the end of chunk N is repeated at the start of chunk N+1.

# Visualize overlap
for i in range(min(3, len(chunks) - 1)):
    end_of_current = chunks[i][-80:]
    start_of_next = chunks[i + 1][:80]
    overlap = set(end_of_current.split()) & set(start_of_next.split())
    print(f"Chunks {i}-{i+1} share {len(overlap)} words in overlap region")

FAQ

What chunk size should I start with for a new RAG project?

Start with 512 tokens using recursive character splitting with 64-token overlap. This works well for most question-answering use cases. Then measure retrieval quality and adjust — decrease chunk size if retrieved chunks contain too much irrelevant text, increase if chunks lack sufficient context.

Should I use semantic chunking in production?

Semantic chunking produces higher-quality chunks but is slower and more expensive during ingestion because every sentence requires an embedding call. Use it when ingestion is infrequent (you index documents once or nightly) and retrieval quality is critical. For real-time or high-volume ingestion, recursive splitting is more practical.

How do I handle tables and images in documents?

Tables should be extracted as structured text (CSV or markdown table format) and chunked as complete units — never split a table row across chunks. For images, use a multimodal embedding model or generate a text description of the image and embed that description alongside the surrounding text.


#RAG #DocumentChunking #TextSplitting #NLP #VectorSearch #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

guides

Understanding AI Voice Technology: A Beginner's Guide

A plain-English guide to AI voice technology — LLMs, STT, TTS, RAG, function calling, and latency budgets. Learn how modern voice agents actually work.

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

Technical Guides

Post-Call Analytics with GPT-4o-mini: Sentiment, Lead Scoring, and Intent

Build a post-call analytics pipeline with GPT-4o-mini — sentiment, intent, lead scoring, satisfaction, and escalation detection.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Large Language Models

Why Enterprises Need Custom LLMs: Base vs Fine-Tuned Models in 2026

Custom LLMs outperform base models for enterprise use cases by 40-65%. Learn when to fine-tune, RAG, or build custom models — with architecture patterns and ROI data.