Skip to content
Learn Agentic AI
Learn Agentic AI13 min read0 views

Building an Image Analysis Agent: OCR, Object Detection, and Visual QA

Build a Python-based image analysis agent that performs OCR text extraction, object detection, and visual question answering. Includes preprocessing pipelines and structured output formatting.

What an Image Analysis Agent Does

An image analysis agent accepts an image and a natural language question, then uses a combination of computer vision tools — OCR, object detection, and visual question answering — to produce a structured answer. Unlike a simple API call to a vision model, an agent can decide which tools to apply based on the question, chain multiple analysis steps, and format results according to the user's needs.

Setting Up the Vision Toolbox

The agent needs three core capabilities. Start by installing the dependencies:

flowchart TD
    START["Building an Image Analysis Agent: OCR, Object Det…"] --> A
    A["What an Image Analysis Agent Does"]
    A --> B
    B["Setting Up the Vision Toolbox"]
    B --> C
    C["Image Preprocessing Pipeline"]
    C --> D
    D["Building the OCR Tool"]
    D --> E
    E["Object Detection with YOLO"]
    E --> F
    F["The Agent: Routing Questions to Tools"]
    F --> G
    G["Structured Output Formatting"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
pip install openai pillow pytesseract ultralytics

Each tool serves a distinct purpose:

  • OCR (Tesseract) — extracts text from images, useful for documents, signs, and labels
  • Object Detection (YOLO) — identifies and locates objects with bounding boxes
  • Visual QA (GPT-4o) — answers open-ended questions about image content

Image Preprocessing Pipeline

Raw images often need preprocessing before analysis. Resizing, normalization, and format conversion improve accuracy across all tools:

from PIL import Image, ImageEnhance, ImageFilter
import io


def preprocess_image(
    image_bytes: bytes,
    max_dimension: int = 2048,
    enhance_for_ocr: bool = False,
) -> Image.Image:
    """Preprocess an image for analysis."""
    img = Image.open(io.BytesIO(image_bytes))

    # Convert to RGB if necessary
    if img.mode != "RGB":
        img = img.convert("RGB")

    # Resize if too large (preserves aspect ratio)
    if max(img.size) > max_dimension:
        ratio = max_dimension / max(img.size)
        new_size = (int(img.width * ratio), int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)

    # Enhance for OCR: sharpen and increase contrast
    if enhance_for_ocr:
        img = img.filter(ImageFilter.SHARPEN)
        enhancer = ImageEnhance.Contrast(img)
        img = enhancer.enhance(1.5)

    return img

Building the OCR Tool

Tesseract handles text extraction. Wrap it as an agent tool with structured output:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import pytesseract
from dataclasses import dataclass


@dataclass
class OCRResult:
    full_text: str
    confidence: float
    word_count: int
    blocks: list[dict]


def extract_text(img: Image.Image) -> OCRResult:
    """Extract text from an image using Tesseract OCR."""
    # Get detailed data including confidence scores
    data = pytesseract.image_to_data(
        img, output_type=pytesseract.Output.DICT
    )

    words = []
    confidences = []
    for i, text in enumerate(data["text"]):
        conf = int(data["conf"][i])
        if conf > 0 and text.strip():
            words.append(text.strip())
            confidences.append(conf)

    full_text = " ".join(words)
    avg_confidence = (
        sum(confidences) / len(confidences) if confidences else 0.0
    )

    # Build text blocks by grouping lines
    blocks = []
    current_block = []
    current_block_num = -1
    for i, text in enumerate(data["text"]):
        if not text.strip():
            continue
        block_num = data["block_num"][i]
        if block_num != current_block_num:
            if current_block:
                blocks.append({"text": " ".join(current_block)})
            current_block = [text.strip()]
            current_block_num = block_num
        else:
            current_block.append(text.strip())
    if current_block:
        blocks.append({"text": " ".join(current_block)})

    return OCRResult(
        full_text=full_text,
        confidence=avg_confidence,
        word_count=len(words),
        blocks=blocks,
    )

Object Detection with YOLO

The YOLO model identifies objects and their locations within an image:

from ultralytics import YOLO


@dataclass
class DetectedObject:
    label: str
    confidence: float
    bbox: tuple[int, int, int, int]  # x1, y1, x2, y2


def detect_objects(
    img: Image.Image, confidence_threshold: float = 0.5
) -> list[DetectedObject]:
    """Detect objects in an image using YOLOv8."""
    model = YOLO("yolov8n.pt")  # nano model for speed
    results = model(img, verbose=False)

    detected = []
    for result in results:
        for box in result.boxes:
            conf = float(box.conf[0])
            if conf >= confidence_threshold:
                x1, y1, x2, y2 = box.xyxy[0].tolist()
                label = result.names[int(box.cls[0])]
                detected.append(DetectedObject(
                    label=label,
                    confidence=round(conf, 3),
                    bbox=(int(x1), int(y1), int(x2), int(y2)),
                ))
    return detected

The Agent: Routing Questions to Tools

The agent decides which tools to use based on the user's question. A keyword-based router works well for most cases:

import openai
import base64


class ImageAnalysisAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    def _select_tools(self, question: str) -> list[str]:
        """Select which tools to run based on the question."""
        q = question.lower()
        tools = []
        if any(kw in q for kw in ["text", "read", "ocr", "written", "says"]):
            tools.append("ocr")
        if any(kw in q for kw in ["object", "detect", "find", "count", "how many"]):
            tools.append("detection")
        # Always include VQA as the reasoning backbone
        tools.append("vqa")
        return tools

    async def analyze(
        self, image_bytes: bytes, question: str
    ) -> dict:
        selected_tools = self._select_tools(question)
        context_parts = []

        img = preprocess_image(image_bytes)

        if "ocr" in selected_tools:
            ocr_result = extract_text(
                preprocess_image(image_bytes, enhance_for_ocr=True)
            )
            context_parts.append(
                f"OCR extracted text ({ocr_result.word_count} words, "
                f"confidence {ocr_result.confidence:.1f}%): "
                f"{ocr_result.full_text}"
            )

        if "detection" in selected_tools:
            objects = detect_objects(img)
            obj_summary = ", ".join(
                f"{o.label} ({o.confidence:.0%})" for o in objects
            )
            context_parts.append(
                f"Detected objects: {obj_summary or 'none'}"
            )

        # VQA with GPT-4o, enriched by tool outputs
        b64 = base64.b64encode(image_bytes).decode()
        tool_context = "\n".join(context_parts)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": (
                            f"Tool analysis results:\n{tool_context}\n\n"
                            f"Question: {question}"
                        ),
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{b64}"},
                    },
                ],
            }],
        )
        return {
            "answer": response.choices[0].message.content,
            "tools_used": selected_tools,
        }

Structured Output Formatting

For programmatic consumers, format the analysis results as structured JSON:

from pydantic import BaseModel


class ImageAnalysisResult(BaseModel):
    answer: str
    extracted_text: str | None = None
    detected_objects: list[dict] | None = None
    tools_used: list[str]
    confidence: float

FAQ

When should I use OCR versus a vision language model for text extraction?

Use Tesseract OCR when you need precise character-level extraction from clean documents, invoices, or printed text. Use a vision language model like GPT-4o when the text is embedded in complex scenes, handwritten, or when you also need to understand the context around the text. For best results, run both and let the agent cross-reference the outputs.

How do I handle images that are too large for the API?

Resize images to a maximum dimension of 2048 pixels while preserving the aspect ratio, as shown in the preprocessing function. For GPT-4o specifically, the API automatically handles resizing, but sending smaller images reduces latency and cost. If detail is critical for a specific region, crop that region and send it as a separate analysis request.

Can this agent process multiple images in a single request?

Yes. Extend the analyze method to accept a list of image bytes. Process each image independently through the tool pipeline, then send all results along with all images to the VQA step. GPT-4o supports multiple images in a single message, so the reasoning model can compare and cross-reference across images.


#ImageAnalysis #OCR #ObjectDetection #VisualQA #ComputerVision #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

Desktop Application Automation with PyAutoGUI and AI Vision: Beyond Web Browsers

Learn to automate desktop applications using PyAutoGUI combined with AI vision models. Covers screen recognition, coordinate mapping, multi-monitor setups, keyboard automation, and building robust desktop agents.

Learn Agentic AI

Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction

Learn how to build an end-to-end document intelligence agent that combines Tesseract OCR, layout detection, zone classification, and structured information extraction to process any document type automatically.

Learn Agentic AI

UFO's Visual Understanding: How GPT-4V Interprets Windows Application Screenshots

Explore how UFO captures, annotates, and sends Windows application screenshots to GPT-4V for UI element detection, control identification, and intelligent action mapping at each automation step.

Learn Agentic AI

Using GPT-4 Vision to Understand Web Pages: Screenshot Analysis for AI Agents

Learn how to capture web page screenshots and send them to GPT-4 Vision for element identification, layout understanding, and structured analysis that powers browser automation agents.

Learn Agentic AI

AI Agent for Automated Data Entry: Reading Source Documents and Filling Web Forms

Build an AI agent that reads source documents using OCR and vision models, maps extracted data to web form fields, fills forms automatically, and validates entries with intelligent error correction.

Learn Agentic AI

Building a Visual QA Agent: Answering Natural Language Questions About Any Image

Build a visual question answering agent that understands images, routes questions to specialized analysis modules, performs multi-modal reasoning, and generates accurate natural language answers about any image content.