---
title: "Building an Image Analysis Agent: OCR, Object Detection, and Visual QA"
description: "Build a Python-based image analysis agent that performs OCR text extraction, object detection, and visual question answering. Includes preprocessing pipelines and structured output formatting."
canonical: https://callsphere.ai/blog/building-image-analysis-agent-ocr-object-detection-visual-qa
category: "Learn Agentic AI"
tags: ["Image Analysis", "OCR", "Object Detection", "Visual QA", "Computer Vision"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:43.702Z
---

# Building an Image Analysis Agent: OCR, Object Detection, and Visual QA

> Build a Python-based image analysis agent that performs OCR text extraction, object detection, and visual question answering. Includes preprocessing pipelines and structured output formatting.

## What an Image Analysis Agent Does

An image analysis agent accepts an image and a natural language question, then uses a combination of computer vision tools — OCR, object detection, and visual question answering — to produce a structured answer. Unlike a simple API call to a vision model, an agent can decide which tools to apply based on the question, chain multiple analysis steps, and format results according to the user's needs.

## Setting Up the Vision Toolbox

The agent needs three core capabilities. Start by installing the dependencies:

```mermaid
flowchart LR
    PDF(["PDF or image"])
    OCR["OCR plus layout
LayoutLM or Donut"]
    DETECT["Table detector
bounding boxes"]
    STRUCT["Cell structure
rows and columns"]
    LLM["LLM normalization
headers and types"]
    VAL["Schema validation
Pydantic"]
    DB[(Structured store)]
    OUT(["Clean rows"])
    PDF --> OCR --> DETECT --> STRUCT --> LLM --> VAL --> DB --> OUT
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style VAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```bash
pip install openai pillow pytesseract ultralytics
```

Each tool serves a distinct purpose:

- **OCR (Tesseract)** — extracts text from images, useful for documents, signs, and labels
- **Object Detection (YOLO)** — identifies and locates objects with bounding boxes
- **Visual QA (GPT-4o)** — answers open-ended questions about image content

## Image Preprocessing Pipeline

Raw images often need preprocessing before analysis. Resizing, normalization, and format conversion improve accuracy across all tools:

```python
from PIL import Image, ImageEnhance, ImageFilter
import io

def preprocess_image(
    image_bytes: bytes,
    max_dimension: int = 2048,
    enhance_for_ocr: bool = False,
) -> Image.Image:
    """Preprocess an image for analysis."""
    img = Image.open(io.BytesIO(image_bytes))

    # Convert to RGB if necessary
    if img.mode != "RGB":
        img = img.convert("RGB")

    # Resize if too large (preserves aspect ratio)
    if max(img.size) > max_dimension:
        ratio = max_dimension / max(img.size)
        new_size = (int(img.width * ratio), int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)

    # Enhance for OCR: sharpen and increase contrast
    if enhance_for_ocr:
        img = img.filter(ImageFilter.SHARPEN)
        enhancer = ImageEnhance.Contrast(img)
        img = enhancer.enhance(1.5)

    return img
```

## Building the OCR Tool

Tesseract handles text extraction. Wrap it as an agent tool with structured output:

```python
import pytesseract
from dataclasses import dataclass

@dataclass
class OCRResult:
    full_text: str
    confidence: float
    word_count: int
    blocks: list[dict]

def extract_text(img: Image.Image) -> OCRResult:
    """Extract text from an image using Tesseract OCR."""
    # Get detailed data including confidence scores
    data = pytesseract.image_to_data(
        img, output_type=pytesseract.Output.DICT
    )

    words = []
    confidences = []
    for i, text in enumerate(data["text"]):
        conf = int(data["conf"][i])
        if conf > 0 and text.strip():
            words.append(text.strip())
            confidences.append(conf)

    full_text = " ".join(words)
    avg_confidence = (
        sum(confidences) / len(confidences) if confidences else 0.0
    )

    # Build text blocks by grouping lines
    blocks = []
    current_block = []
    current_block_num = -1
    for i, text in enumerate(data["text"]):
        if not text.strip():
            continue
        block_num = data["block_num"][i]
        if block_num != current_block_num:
            if current_block:
                blocks.append({"text": " ".join(current_block)})
            current_block = [text.strip()]
            current_block_num = block_num
        else:
            current_block.append(text.strip())
    if current_block:
        blocks.append({"text": " ".join(current_block)})

    return OCRResult(
        full_text=full_text,
        confidence=avg_confidence,
        word_count=len(words),
        blocks=blocks,
    )
```

## Object Detection with YOLO

The YOLO model identifies objects and their locations within an image:

```python
from ultralytics import YOLO

@dataclass
class DetectedObject:
    label: str
    confidence: float
    bbox: tuple[int, int, int, int]  # x1, y1, x2, y2

def detect_objects(
    img: Image.Image, confidence_threshold: float = 0.5
) -> list[DetectedObject]:
    """Detect objects in an image using YOLOv8."""
    model = YOLO("yolov8n.pt")  # nano model for speed
    results = model(img, verbose=False)

    detected = []
    for result in results:
        for box in result.boxes:
            conf = float(box.conf[0])
            if conf >= confidence_threshold:
                x1, y1, x2, y2 = box.xyxy[0].tolist()
                label = result.names[int(box.cls[0])]
                detected.append(DetectedObject(
                    label=label,
                    confidence=round(conf, 3),
                    bbox=(int(x1), int(y1), int(x2), int(y2)),
                ))
    return detected
```

## The Agent: Routing Questions to Tools

The agent decides which tools to use based on the user's question. A keyword-based router works well for most cases:

```python
import openai
import base64

class ImageAnalysisAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    def _select_tools(self, question: str) -> list[str]:
        """Select which tools to run based on the question."""
        q = question.lower()
        tools = []
        if any(kw in q for kw in ["text", "read", "ocr", "written", "says"]):
            tools.append("ocr")
        if any(kw in q for kw in ["object", "detect", "find", "count", "how many"]):
            tools.append("detection")
        # Always include VQA as the reasoning backbone
        tools.append("vqa")
        return tools

    async def analyze(
        self, image_bytes: bytes, question: str
    ) -> dict:
        selected_tools = self._select_tools(question)
        context_parts = []

        img = preprocess_image(image_bytes)

        if "ocr" in selected_tools:
            ocr_result = extract_text(
                preprocess_image(image_bytes, enhance_for_ocr=True)
            )
            context_parts.append(
                f"OCR extracted text ({ocr_result.word_count} words, "
                f"confidence {ocr_result.confidence:.1f}%): "
                f"{ocr_result.full_text}"
            )

        if "detection" in selected_tools:
            objects = detect_objects(img)
            obj_summary = ", ".join(
                f"{o.label} ({o.confidence:.0%})" for o in objects
            )
            context_parts.append(
                f"Detected objects: {obj_summary or 'none'}"
            )

        # VQA with GPT-4o, enriched by tool outputs
        b64 = base64.b64encode(image_bytes).decode()
        tool_context = "\n".join(context_parts)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": (
                            f"Tool analysis results:\n{tool_context}\n\n"
                            f"Question: {question}"
                        ),
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{b64}"},
                    },
                ],
            }],
        )
        return {
            "answer": response.choices[0].message.content,
            "tools_used": selected_tools,
        }
```

## Structured Output Formatting

For programmatic consumers, format the analysis results as structured JSON:

```python
from pydantic import BaseModel

class ImageAnalysisResult(BaseModel):
    answer: str
    extracted_text: str | None = None
    detected_objects: list[dict] | None = None
    tools_used: list[str]
    confidence: float
```

## FAQ

### When should I use OCR versus a vision language model for text extraction?

Use Tesseract OCR when you need precise character-level extraction from clean documents, invoices, or printed text. Use a vision language model like GPT-4o when the text is embedded in complex scenes, handwritten, or when you also need to understand the context around the text. For best results, run both and let the agent cross-reference the outputs.

### How do I handle images that are too large for the API?

Resize images to a maximum dimension of 2048 pixels while preserving the aspect ratio, as shown in the preprocessing function. For GPT-4o specifically, the API automatically handles resizing, but sending smaller images reduces latency and cost. If detail is critical for a specific region, crop that region and send it as a separate analysis request.

### Can this agent process multiple images in a single request?

Yes. Extend the `analyze` method to accept a list of image bytes. Process each image independently through the tool pipeline, then send all results along with all images to the VQA step. GPT-4o supports multiple images in a single message, so the reasoning model can compare and cross-reference across images.

---

#ImageAnalysis #OCR #ObjectDetection #VisualQA #ComputerVision #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/building-image-analysis-agent-ocr-object-detection-visual-qa
