---
title: "Claude Vision: Building Multi-Modal Agents That Understand Images and Documents"
description: "Build multi-modal agents that process images, PDFs, and diagrams using Claude's vision capabilities. Learn how to send image data via the API, analyze documents, and combine vision with tool use."
canonical: https://callsphere.ai/blog/claude-vision-building-multi-modal-agents-images-documents
category: "Learn Agentic AI"
tags: ["Anthropic", "Claude", "Vision", "Multi-Modal", "Document Analysis"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:43.555Z
---

# Claude Vision: Building Multi-Modal Agents That Understand Images and Documents

> Build multi-modal agents that process images, PDFs, and diagrams using Claude's vision capabilities. Learn how to send image data via the API, analyze documents, and combine vision with tool use.

## Claude's Vision Capabilities

Claude can process images as part of its input, enabling agents that understand screenshots, photographs, diagrams, charts, documents, and handwritten text. This is not a separate vision API — images are simply another content type within the standard messages API, meaning you can combine vision with tool use, system prompts, and multi-turn conversations seamlessly.

Claude's vision excels at understanding context within images: reading text, interpreting charts, describing scenes, analyzing UI layouts, and extracting structured data from documents. This makes it particularly powerful for document processing agents, QA testing agents, and data extraction workflows.

## Sending Images via Base64

The most common approach is encoding images as base64 and including them in the message content:

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

# Read and encode the image
image_data = Path("screenshot.png").read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": base64_image
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this screenshot. Identify any UI elements and their states."
                }
            ]
        }
    ]
)

print(message.content[0].text)
```

The `content` field accepts a list of content blocks — you can mix text and image blocks freely within a single message. Supported image formats include PNG, JPEG, GIF, and WebP.

## Sending Images via URL

For publicly accessible images, you can provide a URL directly:

```python
import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/chart.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Extract all data points from this bar chart and return them as a JSON array."
                }
            ]
        }
    ]
)

print(message.content[0].text)
```

URL-based images avoid the overhead of base64 encoding and reduce request payload size, making them preferable when the image is already hosted.

## Building a Document Analysis Agent

Combine vision with structured output for production document processing:

```python
import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

def analyze_invoice(image_path: str) -> dict:
    image_data = Path(image_path).read_bytes()
    base64_image = base64.standard_b64encode(image_data).decode("utf-8")

    # Determine media type
    suffix = Path(image_path).suffix.lower()
    media_types = {".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg"}
    media_type = media_types.get(suffix, "image/png")

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="""You are an invoice processing agent. Extract structured data from invoice images.
Always return valid JSON with these fields:
- vendor_name: string
- invoice_number: string
- date: string (YYYY-MM-DD)
- line_items: array of {description, quantity, unit_price, total}
- subtotal: number
- tax: number
- total: number""",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": base64_image
                        }
                    },
                    {
                        "type": "text",
                        "text": "Extract all data from this invoice and return it as JSON."
                    }
                ]
            }
        ]
    )

    return json.loads(message.content[0].text)

result = analyze_invoice("sample_invoice.png")
print(json.dumps(result, indent=2))
```

This pattern works for invoices, receipts, forms, business cards, and any structured document. The system prompt defines the exact output schema, and Claude extracts the relevant fields from the image.

## Multi-Image Analysis

Claude can process multiple images in a single request, enabling comparison tasks:

```python
import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def encode_image(path: str) -> dict:
    data = base64.standard_b64encode(Path(path).read_bytes()).decode("utf-8")
    suffix = Path(path).suffix.lower()
    media_type = "image/jpeg" if suffix in [".jpg", ".jpeg"] else "image/png"
    return {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}}

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                encode_image("design_v1.png"),
                encode_image("design_v2.png"),
                {
                    "type": "text",
                    "text": "Compare these two UI designs. List specific differences in layout, color, typography, and component placement."
                }
            ]
        }
    ]
)

print(message.content[0].text)
```

This is powerful for UI regression testing, before/after comparisons, and visual QA agents that need to spot differences between designs and implementations.

## Vision Combined with Tool Use

The most powerful pattern is combining vision with tools so the agent can see and act:

```python
import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

tools = [
    {
        "name": "create_jira_ticket",
        "description": "Create a Jira ticket for a UI bug.",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "description": {"type": "string"},
                "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]}
            },
            "required": ["title", "description", "severity"]
        }
    }
]

image_data = base64.standard_b64encode(Path("bug_screenshot.png").read_bytes()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    tools=tools,
    system="You are a QA agent. Analyze screenshots for bugs and file tickets for any issues found.",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_data}},
                {"type": "text", "text": "Review this screenshot and file tickets for any visual bugs you find."}
            ]
        }
    ]
)
```

This creates a QA agent that can look at a screenshot, identify visual bugs, and automatically file tickets — a complete vision-to-action pipeline.

## FAQ

### What is the maximum image size Claude can process?

Claude accepts images up to approximately 20 megapixels. For larger images, resize before sending. The API also has a payload size limit, so very large base64-encoded images may need compression. In practice, most screenshots and document scans work without any preprocessing.

### Can Claude read PDFs directly?

Claude supports PDF input via base64 encoding with `media_type: "application/pdf"`. You can send multi-page PDFs and Claude will analyze all pages. For very long documents, consider splitting into page ranges and processing them separately to stay within token limits.

### How accurate is Claude's OCR compared to dedicated OCR tools?

Claude's text extraction from images is remarkably accurate for printed text, typed documents, and clean handwriting. For degraded images, unusual fonts, or historical documents, a dedicated OCR tool like Tesseract or Google Vision may perform better. Many production systems use a hybrid approach: OCR for raw text extraction, then Claude for understanding and structuring the extracted content.

---

#Anthropic #Claude #Vision #MultiModal #DocumentAnalysis #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/claude-vision-building-multi-modal-agents-images-documents