Claude PDF and Document Analysis Agent: Processing Complex Documents at Scale

Claude's Native PDF Understanding

Claude can process PDF documents directly through the Messages API. Rather than converting PDFs to text first (losing formatting, tables, and layout information), Claude analyzes the rendered pages as images while simultaneously processing any embedded text. This dual understanding — visual layout plus textual content — makes it exceptionally capable at extracting structured data from complex documents.

This capability is particularly valuable for contracts, financial reports, research papers, invoices, and any document where layout carries meaning.

Uploading PDFs to Claude

PDFs are sent as base64-encoded content in the message:

flowchart LR
    USERS(["Traffic"])
    LB["Geo LB plus<br/>Anycast"]
    EDGE["Edge cache plus<br/>rate limit"]
    APP["Stateless app pods<br/>HPA on QPS"]
    QUEUE[(Async work queue)]
    WORKER["Worker pool<br/>GPU or CPU"]
    CACHE[("Redis cache<br/>LLM responses")]
    DB[("Read replicas<br/>and primary")]
    OBS[(Observability)]
    USERS --> LB --> EDGE --> APP
    APP --> CACHE
    APP --> QUEUE --> WORKER
    APP --> DB
    APP --> OBS
    style LB fill:#4f46e5,stroke:#4338ca,color:#fff
    style WORKER fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style CACHE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#0ea5e9,stroke:#0369a1,color:#fff

import anthropic
import base64

client = anthropic.Anthropic()

def analyze_pdf(file_path: str, question: str) -> str:
    with open(file_path, "rb") as f:
        pdf_data = base64.standard_b64encode(f.read()).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    }
                },
                {
                    "type": "text",
                    "text": question,
                }
            ]
        }]
    )
    return response.content[0].text

Claude processes each page of the PDF, understanding both the text content and the visual layout. This means it can correctly interpret tables, charts, headers, footnotes, and multi-column layouts.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Page-Level Analysis

For large documents, you may want to analyze specific page ranges or process pages individually. Send targeted questions about specific sections:

def analyze_pages(file_path: str, analyses: list[dict]) -> list[dict]:
    """Run multiple analyses on a single PDF."""
    with open(file_path, "rb") as f:
        pdf_data = base64.standard_b64encode(f.read()).decode()

    results = []
    for analysis in analyses:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "document",
                        "source": {
                            "type": "base64",
                            "media_type": "application/pdf",
                            "data": pdf_data,
                        }
                    },
                    {
                        "type": "text",
                        "text": analysis["question"],
                    }
                ]
            }]
        )
        results.append({
            "analysis": analysis["name"],
            "result": response.content[0].text
        })
    return results

# Usage
results = analyze_pages("annual_report.pdf", [
    {"name": "financial_summary", "question": "Extract all revenue figures, costs, and profit margins from the financial statements."},
    {"name": "risk_factors", "question": "List all risk factors mentioned in the document with their severity."},
    {"name": "key_metrics", "question": "What are the key performance indicators and their year-over-year changes?"},
])

Structured Data Extraction with Tools

Combine PDF analysis with tool use to extract structured data that can be programmatically processed:

extraction_tool = {
    "name": "extract_invoice_data",
    "description": "Extract structured data from an invoice document",
    "input_schema": {
        "type": "object",
        "properties": {
            "vendor_name": {"type": "string"},
            "invoice_number": {"type": "string"},
            "invoice_date": {"type": "string", "description": "ISO format date"},
            "due_date": {"type": "string", "description": "ISO format date"},
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "quantity": {"type": "number"},
                        "unit_price": {"type": "number"},
                        "total": {"type": "number"}
                    },
                    "required": ["description", "quantity", "unit_price", "total"]
                }
            },
            "subtotal": {"type": "number"},
            "tax": {"type": "number"},
            "total": {"type": "number"},
            "currency": {"type": "string"}
        },
        "required": ["vendor_name", "invoice_number", "invoice_date", "line_items", "total"]
    }
}

def extract_invoice(pdf_path: str) -> dict:
    with open(pdf_path, "rb") as f:
        pdf_data = base64.standard_b64encode(f.read()).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        tools=[extraction_tool],
        tool_choice={"type": "tool", "name": "extract_invoice_data"},
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    }
                },
                {"type": "text", "text": "Extract all invoice data from this document."}
            ]
        }]
    )

    for block in response.content:
        if block.type == "tool_use":
            return block.input
    return {}

Forcing tool use with tool_choice guarantees structured JSON output that you can insert directly into a database or feed to a downstream system.

Multi-Document Comparison

One of Claude's strongest capabilities is comparing information across multiple documents in a single conversation:

def compare_documents(pdf_paths: list[str], comparison_prompt: str) -> str:
    content = []

    for i, path in enumerate(pdf_paths):
        with open(path, "rb") as f:
            pdf_data = base64.standard_b64encode(f.read()).decode()

        content.append({
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": pdf_data,
            }
        })
        content.append({
            "type": "text",
            "text": f"The above is Document {i + 1}: {path}",
        })

    content.append({"type": "text", "text": comparison_prompt})

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

# Compare two contracts
result = compare_documents(
    ["contract_v1.pdf", "contract_v2.pdf"],
    "Compare these two contract versions. List every change including "
    "additions, deletions, and modifications to terms. Flag any changes "
    "that affect liability, payment terms, or termination clauses."
)

Scaling Document Processing

For batch document processing, combine PDF analysis with the Batches API:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

def batch_analyze_pdfs(pdf_paths: list[str], question: str) -> str:
    requests = []
    for i, path in enumerate(pdf_paths):
        with open(path, "rb") as f:
            pdf_data = base64.standard_b64encode(f.read()).decode()

        requests.append({
            "custom_id": f"pdf-{i}-{path}",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 2048,
                "messages": [{
                    "role": "user",
                    "content": [
                        {
                            "type": "document",
                            "source": {
                                "type": "base64",
                                "media_type": "application/pdf",
                                "data": pdf_data,
                            }
                        },
                        {"type": "text", "text": question}
                    ]
                }]
            }
        })

    batch = client.messages.batches.create(requests=requests)
    return batch.id

This approach processes hundreds of PDFs at 50% cost while handling rate limits automatically.

FAQ

What is the maximum PDF size Claude can process?

Each PDF is converted to images internally. Claude can handle PDFs up to approximately 100 pages per request, though performance is optimal with shorter documents. For very large documents, split them into sections and process each section separately, then use a final synthesis step.

Can Claude extract data from scanned PDFs without OCR?

Yes. Because Claude processes PDF pages as images, it can read text from scanned documents directly — no OCR preprocessing required. This works for most print quality scans. Very low resolution scans or heavily distorted documents may need preprocessing with image enhancement tools first.

How accurate is table extraction from PDFs?

Claude's table extraction is highly accurate for standard table layouts — rows, columns, headers, and merged cells are handled well. Complex nested tables or tables that span multiple pages may require additional prompting to handle correctly. Always validate extracted numerical data against known totals when accuracy is critical.

#Claude #PDFProcessing #DocumentAnalysis #DataExtraction #Python #AgenticAI #LearnAI #AIEngineering

Claude PDF and Document Analysis Agent: Processing Complex Documents at Scale

Claude's Native PDF Understanding

Uploading PDFs to Claude

Page-Level Analysis

Structured Data Extraction with Tools

Multi-Document Comparison

Scaling Document Processing

FAQ

What is the maximum PDF size Claude can process?

Can Claude extract data from scanned PDFs without OCR?

How accurate is table extraction from PDFs?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Use Multiple Chat AIs at Once (and Why You Might)

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Anthropic and Moody's Data Partnership: Why Grounding Matters in Finance

Anthropic Microsoft 365 Integration: What Changes for Office Knowledge Workers

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Raleigh Startups Building on the Claude Agent SDK