Skip to content
Large Language Models
Large Language Models5 min read5 views

LLM-Powered Data Extraction and Document Processing: Patterns That Work in 2026

Practical architectures for using LLMs to extract structured data from unstructured documents, covering schema design, chunking strategies, and production reliability patterns.

From Unstructured to Structured at Scale

Every enterprise sits on mountains of unstructured data: contracts, invoices, medical records, research papers, emails, support tickets. Extracting structured information from these documents has traditionally required custom NLP pipelines, regex patterns, and domain-specific models for each document type.

LLMs have changed this. A single model can extract structured data from virtually any document type with minimal customization. But doing this reliably at scale requires careful architecture.

The Basic Extraction Pattern

At its simplest, LLM-based extraction involves sending a document with a schema and asking the model to populate it:

from pydantic import BaseModel, Field
from typing import Optional

class InvoiceData(BaseModel):
    vendor_name: str
    invoice_number: str
    invoice_date: str = Field(description="ISO 8601 format")
    due_date: Optional[str] = None
    line_items: list[LineItem]
    subtotal: float
    tax: float
    total: float
    currency: str = Field(default="USD")

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    total: float

# With Anthropic's structured output
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system="Extract invoice data from the provided document. "
           "Return ONLY data explicitly stated in the document.",
    messages=[{"role": "user", "content": document_text}],
    tool_choice={"type": "tool", "name": "extract_invoice"},
    tools=[{
        "name": "extract_invoice",
        "description": "Extract structured invoice data",
        "input_schema": InvoiceData.model_json_schema()
    }]
)

Chunking Strategies for Long Documents

Documents that exceed the model's context window (or are too expensive to process whole) need chunking. But naive chunking breaks extraction because relevant information may span chunk boundaries.

Sliding Window with Overlap:

flowchart TD
    CENTER(("LLM Pipeline"))
    CENTER --> N0["Page-level extraction: Extract data fro…"]
    CENTER --> N1["Merge and deduplicate: Combine results …"]
    CENTER --> N2["Both agree: high confidence, auto-accept"]
    CENTER --> N3["One extraction has the field, other doe…"]
    CENTER --> N4["Both have different values: low confide…"]
    CENTER --> N5["Using cheaper models Haiku, GPT-4o mini…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
def chunk_document(text, chunk_size=3000, overlap=500):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Section-Aware Chunking: Parse the document structure first (headings, tables, paragraphs) and chunk at logical boundaries. This preserves the semantic integrity of each chunk.

Two-Pass Extraction: First pass identifies which sections contain relevant information. Second pass extracts from only those sections.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Handling Multi-Page Documents

For complex documents like contracts or medical records:

  1. Page-level extraction: Extract data from each page independently
  2. Merge and deduplicate: Combine results across pages, resolving conflicts
  3. Cross-reference validation: Check extracted values for consistency (e.g., does the sum of line items equal the total?)
async def extract_from_document(pages: list[str], schema: type[BaseModel]):
    # Extract from each page in parallel
    page_results = await asyncio.gather(*[
        extract_page(page, schema) for page in pages
    ])

    # Merge results with conflict resolution
    merged = merge_extractions(page_results, strategy="highest_confidence")

    # Validate consistency
    validation_errors = validate_extraction(merged)
    if validation_errors:
        # Re-extract with targeted prompts for inconsistent fields
        merged = await resolve_conflicts(merged, validation_errors, pages)

    return merged

Quality Assurance Patterns

Confidence Scoring

Ask the model to rate its confidence for each extracted field:

class ExtractedField(BaseModel):
    value: str
    confidence: float = Field(ge=0, le=1, description="Extraction confidence 0-1")
    source_text: str = Field(description="Exact text from document supporting this value")

Route low-confidence extractions to human review.

Dual Extraction

Run extraction twice (potentially with different models or prompts) and compare results. Disagreements flag potential errors:

  • Both agree: high confidence, auto-accept
  • One extraction has the field, other does not: medium confidence, review if critical
  • Both have different values: low confidence, always route to human review

Schema Validation

Use Pydantic validators to catch impossible values:

from pydantic import validator

class InvoiceData(BaseModel):
    total: float
    line_items: list[LineItem]

    @validator('total')
    def total_matches_line_items(cls, v, values):
        if 'line_items' in values:
            expected = sum(item.total for item in values['line_items'])
            if abs(v - expected) > 0.01:
                raise ValueError(f"Total {v} doesn't match sum of line items {expected}")
        return v

Production Architecture

A production document processing pipeline typically looks like:

Document Upload -> OCR (if scanned) -> Text Extraction
    -> Classification (what type of document?)
    -> Schema Selection (which extraction schema to use?)
    -> Chunking -> Parallel Extraction -> Merge -> Validation
    -> Confidence Routing:
        High confidence -> Auto-accept -> Database
        Low confidence -> Human Review Queue -> Database

Cost Optimization

Document extraction can be expensive at scale. Optimize by:

  • Using cheaper models (Haiku, GPT-4o mini) for classification and simple extractions
  • Reserving expensive models for complex documents or low-confidence re-extraction
  • Caching extraction results for identical documents (hash-based dedup)
  • Batch processing during off-peak hours for non-urgent documents

Sources: Anthropic Structured Output | LlamaIndex Document Processing | Unstructured.io

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Healthcare

AI Voice Agents for Prior Authorization: Automating the Payer Phone Call Hellscape

A technical playbook for deploying AI voice agents that place prior authorization calls to payer IVRs, navigate hold queues, and capture auth numbers autonomously.

Voice AI Agents

AI Voice Agent Appointment Booking Automation Guide

Learn how AI voice agents automate appointment booking, reduce no-shows by up to 35%, and free staff for higher-value work across industries.

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Learn Agentic AI

Creating an AI Email Assistant Agent: Triage, Draft, and Schedule with Gmail API

Build an AI email assistant that reads your inbox, classifies urgency, drafts context-aware responses, and schedules sends using OpenAI Agents SDK and Gmail API.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Building Document Processing Agents: PDF, Email, and Spreadsheet Automation

Technical guide to building AI agents that automate document processing — PDF parsing and extraction, email classification and routing, and spreadsheet analysis with reporting.