Parent-Child Chunking for RAG: Small Chunks for Search, Large Chunks for Context

The Chunking Dilemma

Every RAG system faces a fundamental tension in chunk sizing. Small chunks (100-200 tokens) produce precise embeddings that match specific queries accurately, but they lack the surrounding context needed for the LLM to generate comprehensive answers. Large chunks (1000-2000 tokens) provide rich context for generation, but their embeddings average over too many concepts, reducing retrieval precision.

This is not a theoretical problem. In practice, a 100-token chunk containing "The annual renewal rate increased to 94% in Q3" will match a revenue retention query perfectly. But the LLM needs the surrounding paragraphs to understand what drove that increase, which segments improved, and what caveats apply. Conversely, a 2000-token chunk about Q3 performance might not rank highly for a specific retention query because the embedding averages over dozens of different topics.

Parent-child chunking resolves this by decoupling search from context.

How Parent-Child Chunking Works

The strategy maintains two levels of chunks:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

Child chunks (small, 100-300 tokens) — Used for embedding and similarity search. These are precise and topically focused.
Parent chunks (large, 1000-2000 tokens) — Used for context in generation. Each parent contains multiple children.

When a query comes in, the system searches against child chunk embeddings. When a child matches, the system retrieves its parent chunk and sends that larger context to the LLM.

Implementation

from dataclasses import dataclass, field
from openai import OpenAI
import hashlib
import uuid

client = OpenAI()

@dataclass
class Chunk:
    id: str
    content: str
    parent_id: str | None = None
    children: list[str] = field(default_factory=list)
    embedding: list[float] | None = None

class ParentChildChunker:
    def __init__(
        self,
        parent_size: int = 1500,
        child_size: int = 300,
        child_overlap: int = 50,
    ):
        self.parent_size = parent_size
        self.child_size = child_size
        self.child_overlap = child_overlap
        self.parents: dict[str, Chunk] = {}
        self.children: dict[str, Chunk] = {}

    def chunk_document(self, text: str) -> list[Chunk]:
        """Split document into parent and child chunks."""
        words = text.split()
        all_children = []

        # Create parent chunks
        for i in range(0, len(words), self.parent_size):
            parent_text = " ".join(
                words[i:i + self.parent_size]
            )
            parent_id = str(uuid.uuid4())
            parent = Chunk(
                id=parent_id, content=parent_text
            )
            self.parents[parent_id] = parent

            # Create child chunks within this parent
            parent_words = parent_text.split()
            step = self.child_size - self.child_overlap

            for j in range(0, len(parent_words), step):
                child_text = " ".join(
                    parent_words[j:j + self.child_size]
                )
                if len(child_text.split()) < 20:
                    continue  # Skip tiny fragments

                child_id = str(uuid.uuid4())
                child = Chunk(
                    id=child_id,
                    content=child_text,
                    parent_id=parent_id,
                )
                self.children[child_id] = child
                parent.children.append(child_id)
                all_children.append(child)

        return all_children

Embedding and Retrieval

Only the child chunks get embedded and stored in the vector index:

from openai import OpenAI

client = OpenAI()

def embed_children(
    chunker: ParentChildChunker,
) -> list[Chunk]:
    """Embed only child chunks for search indexing."""
    children = list(chunker.children.values())
    batch_size = 100

    for i in range(0, len(children), batch_size):
        batch = children[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=[c.content for c in batch],
        )
        for chunk, emb in zip(batch, response.data):
            chunk.embedding = emb.embedding

    return children

def parent_child_search(
    query: str,
    chunker: ParentChildChunker,
    vectorstore,
    k: int = 5,
) -> list[str]:
    """Search children, return parents for context."""
    # Search against child embeddings
    child_results = vectorstore.similarity_search(query, k=k)

    # Retrieve unique parent chunks
    seen_parents = set()
    parent_contexts = []

    for child_doc in child_results:
        child_id = child_doc.metadata["chunk_id"]
        child = chunker.children.get(child_id)
        if child and child.parent_id not in seen_parents:
            seen_parents.add(child.parent_id)
            parent = chunker.parents[child.parent_id]
            parent_contexts.append(parent.content)

    return parent_contexts

Handling Section-Aware Parent Chunks

For structured documents, align parent chunks with document sections rather than using fixed token counts:

import re

def section_aware_chunking(
    markdown_text: str,
) -> list[tuple[str, str]]:
    """Create parent chunks aligned with document sections."""
    # Split on headings
    sections = re.split(
        r'(?=^##?s)', markdown_text, flags=re.MULTILINE
    )

    parents = []
    for section in sections:
        section = section.strip()
        if not section:
            continue

        # Extract heading as metadata
        lines = section.split("\n")
        heading = lines[0].strip("# ").strip()
        body = "\n".join(lines[1:]).strip()

        if len(body.split()) > 50:  # Skip near-empty sections
            parents.append((heading, body))

    return parents

Choosing Chunk Sizes

The optimal sizes depend on your documents and queries. Here are guidelines based on empirical testing:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Technical documentation: Parent 1500 tokens, Child 200 tokens. Technical queries are precise and benefit from small child chunks.
Legal contracts: Parent 2000 tokens, Child 300 tokens. Legal context requires broad surrounding text for accurate interpretation.
Support conversations: Parent 1000 tokens, Child 150 tokens. Individual messages are short but need thread context.

Always evaluate on your specific query patterns. Measure retrieval precision at the child level and answer quality at the parent level.

FAQ

Does parent-child chunking increase storage requirements?

It increases storage by roughly 5-15% compared to single-level chunking because child chunks overlap within parents. However, you only embed and index the children, so vector storage scales with the number of children, not parents. The parent documents can be stored in a simple key-value store.

Can I use more than two levels in the hierarchy?

Yes, three-level hierarchies (grandparent-parent-child) work well for very long documents. Grandparent chunks represent entire sections, parents represent subsections, and children represent individual paragraphs. However, more levels add complexity to the retrieval logic, so only add a level if two levels provably underperform on your evaluation dataset.

How does this compare to overlapping windows in standard chunking?

Overlapping windows add context at the edges of each chunk but do not solve the core precision-context tradeoff. A 500-token chunk with 100-token overlap is still a compromise. Parent-child chunking fully decouples search precision from generation context, giving you the best of both worlds.

#ChunkingStrategy #RAG #ParentChildChunks #VectorSearch #DocumentProcessing #AgenticAI #LearnAI #AIEngineering

Parent-Child Chunking for RAG: Small Chunks for Search, Large Chunks for Context

The Chunking Dilemma

How Parent-Child Chunking Works

Implementation

Embedding and Retrieval

Handling Section-Aware Parent Chunks

Choosing Chunk Sizes

FAQ

Does parent-child chunking increase storage requirements?

Can I use more than two levels in the hierarchy?

How does this compare to overlapping windows in standard chunking?

Try CallSphere AI Voice Agents

Related Articles You May Like

Chatbot for Answering Questions: How to Build One That Works

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

How To Create A Chatbot In 2026: A Founder's Practical Guide

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026