Claude Vision: Building Multi-Modal Agents That Understand Images and Documents

Claude's Vision Capabilities

Claude can process images as part of its input, enabling agents that understand screenshots, photographs, diagrams, charts, documents, and handwritten text. This is not a separate vision API — images are simply another content type within the standard messages API, meaning you can combine vision with tool use, system prompts, and multi-turn conversations seamlessly.

Claude's vision excels at understanding context within images: reading text, interpreting charts, describing scenes, analyzing UI layouts, and extracting structured data from documents. This makes it particularly powerful for document processing agents, QA testing agents, and data extraction workflows.

Sending Images via Base64

The most common approach is encoding images as base64 and including them in the message content:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

# Read and encode the image
image_data = Path("screenshot.png").read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": base64_image
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this screenshot. Identify any UI elements and their states."
                }
            ]
        }
    ]
)

print(message.content[0].text)

The content field accepts a list of content blocks — you can mix text and image blocks freely within a single message. Supported image formats include PNG, JPEG, GIF, and WebP.

Sending Images via URL

For publicly accessible images, you can provide a URL directly:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/chart.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Extract all data points from this bar chart and return them as a JSON array."
                }
            ]
        }
    ]
)

print(message.content[0].text)

URL-based images avoid the overhead of base64 encoding and reduce request payload size, making them preferable when the image is already hosted.

Building a Document Analysis Agent

Combine vision with structured output for production document processing:

import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

def analyze_invoice(image_path: str) -> dict:
    image_data = Path(image_path).read_bytes()
    base64_image = base64.standard_b64encode(image_data).decode("utf-8")

    # Determine media type
    suffix = Path(image_path).suffix.lower()
    media_types = {".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg"}
    media_type = media_types.get(suffix, "image/png")

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="""You are an invoice processing agent. Extract structured data from invoice images.
Always return valid JSON with these fields:
- vendor_name: string
- invoice_number: string
- date: string (YYYY-MM-DD)
- line_items: array of {description, quantity, unit_price, total}
- subtotal: number
- tax: number
- total: number""",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": base64_image
                        }
                    },
                    {
                        "type": "text",
                        "text": "Extract all data from this invoice and return it as JSON."
                    }
                ]
            }
        ]
    )

    return json.loads(message.content[0].text)

result = analyze_invoice("sample_invoice.png")
print(json.dumps(result, indent=2))

This pattern works for invoices, receipts, forms, business cards, and any structured document. The system prompt defines the exact output schema, and Claude extracts the relevant fields from the image.

Multi-Image Analysis

Claude can process multiple images in a single request, enabling comparison tasks:

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def encode_image(path: str) -> dict:
    data = base64.standard_b64encode(Path(path).read_bytes()).decode("utf-8")
    suffix = Path(path).suffix.lower()
    media_type = "image/jpeg" if suffix in [".jpg", ".jpeg"] else "image/png"
    return {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}}

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                encode_image("design_v1.png"),
                encode_image("design_v2.png"),
                {
                    "type": "text",
                    "text": "Compare these two UI designs. List specific differences in layout, color, typography, and component placement."
                }
            ]
        }
    ]
)

print(message.content[0].text)

This is powerful for UI regression testing, before/after comparisons, and visual QA agents that need to spot differences between designs and implementations.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Vision Combined with Tool Use

The most powerful pattern is combining vision with tools so the agent can see and act:

import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

tools = [
    {
        "name": "create_jira_ticket",
        "description": "Create a Jira ticket for a UI bug.",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "description": {"type": "string"},
                "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]}
            },
            "required": ["title", "description", "severity"]
        }
    }
]

image_data = base64.standard_b64encode(Path("bug_screenshot.png").read_bytes()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    tools=tools,
    system="You are a QA agent. Analyze screenshots for bugs and file tickets for any issues found.",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_data}},
                {"type": "text", "text": "Review this screenshot and file tickets for any visual bugs you find."}
            ]
        }
    ]
)

This creates a QA agent that can look at a screenshot, identify visual bugs, and automatically file tickets — a complete vision-to-action pipeline.

FAQ

What is the maximum image size Claude can process?

Claude accepts images up to approximately 20 megapixels. For larger images, resize before sending. The API also has a payload size limit, so very large base64-encoded images may need compression. In practice, most screenshots and document scans work without any preprocessing.

Can Claude read PDFs directly?

Claude supports PDF input via base64 encoding with media_type: "application/pdf". You can send multi-page PDFs and Claude will analyze all pages. For very long documents, consider splitting into page ranges and processing them separately to stay within token limits.

How accurate is Claude's OCR compared to dedicated OCR tools?

Claude's text extraction from images is remarkably accurate for printed text, typed documents, and clean handwriting. For degraded images, unusual fonts, or historical documents, a dedicated OCR tool like Tesseract or Google Vision may perform better. Many production systems use a hybrid approach: OCR for raw text extraction, then Claude for understanding and structuring the extracted content.

#Anthropic #Claude #Vision #MultiModal #DocumentAnalysis #AgenticAI #LearnAI #AIEngineering

Claude Vision: Building Multi-Modal Agents That Understand Images and Documents

Claude's Vision Capabilities

Sending Images via Base64

Sending Images via URL

Building a Document Analysis Agent

Multi-Image Analysis

Vision Combined with Tool Use

FAQ

What is the maximum image size Claude can process?

Can Claude read PDFs directly?

How accurate is Claude's OCR compared to dedicated OCR tools?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Use Multiple Chat AIs at Once (and Why You Might)

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Project Arc vs Anthropic Managed Agents: Enterprise Agent Comparison

Model-Native Harness: Why OpenAI and Anthropic Are Killing ReAct Loops