Skip to content
Learn Agentic AI
Learn Agentic AI12 min read1 views

Claude Vision: Building Multi-Modal Agents That Understand Images and Documents

Build multi-modal agents that process images, PDFs, and diagrams using Claude's vision capabilities. Learn how to send image data via the API, analyze documents, and combine vision with tool use.

Claude's Vision Capabilities

Claude can process images as part of its input, enabling agents that understand screenshots, photographs, diagrams, charts, documents, and handwritten text. This is not a separate vision API — images are simply another content type within the standard messages API, meaning you can combine vision with tool use, system prompts, and multi-turn conversations seamlessly.

Claude's vision excels at understanding context within images: reading text, interpreting charts, describing scenes, analyzing UI layouts, and extracting structured data from documents. This makes it particularly powerful for document processing agents, QA testing agents, and data extraction workflows.

Sending Images via Base64

The most common approach is encoding images as base64 and including them in the message content:

flowchart TD
    START["Claude Vision: Building Multi-Modal Agents That U…"] --> A
    A["Claude39s Vision Capabilities"]
    A --> B
    B["Sending Images via Base64"]
    B --> C
    C["Sending Images via URL"]
    C --> D
    D["Building a Document Analysis Agent"]
    D --> E
    E["Multi-Image Analysis"]
    E --> F
    F["Vision Combined with Tool Use"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

# Read and encode the image
image_data = Path("screenshot.png").read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": base64_image
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this screenshot. Identify any UI elements and their states."
                }
            ]
        }
    ]
)

print(message.content[0].text)

The content field accepts a list of content blocks — you can mix text and image blocks freely within a single message. Supported image formats include PNG, JPEG, GIF, and WebP.

Sending Images via URL

For publicly accessible images, you can provide a URL directly:

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/chart.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Extract all data points from this bar chart and return them as a JSON array."
                }
            ]
        }
    ]
)

print(message.content[0].text)

URL-based images avoid the overhead of base64 encoding and reduce request payload size, making them preferable when the image is already hosted.

Building a Document Analysis Agent

Combine vision with structured output for production document processing:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

def analyze_invoice(image_path: str) -> dict:
    image_data = Path(image_path).read_bytes()
    base64_image = base64.standard_b64encode(image_data).decode("utf-8")

    # Determine media type
    suffix = Path(image_path).suffix.lower()
    media_types = {".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg"}
    media_type = media_types.get(suffix, "image/png")

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="""You are an invoice processing agent. Extract structured data from invoice images.
Always return valid JSON with these fields:
- vendor_name: string
- invoice_number: string
- date: string (YYYY-MM-DD)
- line_items: array of {description, quantity, unit_price, total}
- subtotal: number
- tax: number
- total: number""",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": base64_image
                        }
                    },
                    {
                        "type": "text",
                        "text": "Extract all data from this invoice and return it as JSON."
                    }
                ]
            }
        ]
    )

    return json.loads(message.content[0].text)

result = analyze_invoice("sample_invoice.png")
print(json.dumps(result, indent=2))

This pattern works for invoices, receipts, forms, business cards, and any structured document. The system prompt defines the exact output schema, and Claude extracts the relevant fields from the image.

Multi-Image Analysis

Claude can process multiple images in a single request, enabling comparison tasks:

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def encode_image(path: str) -> dict:
    data = base64.standard_b64encode(Path(path).read_bytes()).decode("utf-8")
    suffix = Path(path).suffix.lower()
    media_type = "image/jpeg" if suffix in [".jpg", ".jpeg"] else "image/png"
    return {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}}

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                encode_image("design_v1.png"),
                encode_image("design_v2.png"),
                {
                    "type": "text",
                    "text": "Compare these two UI designs. List specific differences in layout, color, typography, and component placement."
                }
            ]
        }
    ]
)

print(message.content[0].text)

This is powerful for UI regression testing, before/after comparisons, and visual QA agents that need to spot differences between designs and implementations.

Vision Combined with Tool Use

The most powerful pattern is combining vision with tools so the agent can see and act:

import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

tools = [
    {
        "name": "create_jira_ticket",
        "description": "Create a Jira ticket for a UI bug.",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "description": {"type": "string"},
                "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]}
            },
            "required": ["title", "description", "severity"]
        }
    }
]

image_data = base64.standard_b64encode(Path("bug_screenshot.png").read_bytes()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    tools=tools,
    system="You are a QA agent. Analyze screenshots for bugs and file tickets for any issues found.",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_data}},
                {"type": "text", "text": "Review this screenshot and file tickets for any visual bugs you find."}
            ]
        }
    ]
)

This creates a QA agent that can look at a screenshot, identify visual bugs, and automatically file tickets — a complete vision-to-action pipeline.

FAQ

What is the maximum image size Claude can process?

Claude accepts images up to approximately 20 megapixels. For larger images, resize before sending. The API also has a payload size limit, so very large base64-encoded images may need compression. In practice, most screenshots and document scans work without any preprocessing.

Can Claude read PDFs directly?

Claude supports PDF input via base64 encoding with media_type: "application/pdf". You can send multi-page PDFs and Claude will analyze all pages. For very long documents, consider splitting into page ranges and processing them separately to stay within token limits.

How accurate is Claude's OCR compared to dedicated OCR tools?

Claude's text extraction from images is remarkably accurate for printed text, typed documents, and clean handwriting. For degraded images, unusual fonts, or historical documents, a dedicated OCR tool like Tesseract or Google Vision may perform better. Many production systems use a hybrid approach: OCR for raw text extraction, then Claude for understanding and structuring the extracted content.


#Anthropic #Claude #Vision #MultiModal #DocumentAnalysis #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

AI Interview Prep

7 Agentic AI & Multi-Agent System Interview Questions for 2026

Real agentic AI and multi-agent system interview questions from Anthropic, OpenAI, and Microsoft in 2026. Covers agent design patterns, memory systems, safety, orchestration frameworks, tool calling, and evaluation.

Learn Agentic AI

MCP Ecosystem Hits 5,000 Servers: Model Context Protocol Production Guide 2026

The MCP ecosystem has grown to 5,000+ servers. This production guide covers building MCP servers, enterprise adoption patterns, the 2026 roadmap, and integration best practices.

AI Interview Prep

6 AI Safety & Alignment Interview Questions From Anthropic & OpenAI (2026)

Real AI safety and alignment interview questions from Anthropic and OpenAI in 2026. Covers alignment challenges, RLHF vs DPO, responsible scaling, red-teaming, safety-first decisions, and autonomous agent oversight.