Skip to content
Learn Agentic AI
Learn Agentic AI14 min read2 views

Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction

Learn how to build an end-to-end document intelligence agent that combines Tesseract OCR, layout detection, zone classification, and structured information extraction to process any document type automatically.

Why Document Intelligence Needs More Than OCR

Traditional OCR converts pixels to characters, but that is only the first step. Real document intelligence requires understanding the spatial layout — headers, paragraphs, tables, footnotes — and extracting structured information that downstream systems can consume. A document intelligence agent orchestrates these stages, deciding which regions need deeper analysis and which extraction strategy fits each zone.

The core pipeline follows four stages: image preprocessing, OCR with confidence scoring, layout analysis to identify semantic zones, and structured extraction that maps content to fields your application expects.

Setting Up the Foundation

Install the necessary libraries for the full pipeline:

flowchart TD
    START["Building a Document Intelligence Agent: OCR, Layo…"] --> A
    A["Why Document Intelligence Needs More Th…"]
    A --> B
    B["Setting Up the Foundation"]
    B --> C
    C["Building the Document Preprocessing Lay…"]
    C --> D
    D["OCR with Confidence Scoring"]
    D --> E
    E["Zone Classification with Layout Analysis"]
    E --> F
    F["The Agent Orchestrator"]
    F --> G
    G["Handling Low-Confidence Regions"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
pip install pytesseract Pillow layoutparser opencv-python-headless pydantic openai

Make sure Tesseract is installed on your system:

# Ubuntu/Debian
sudo apt-get install tesseract-ocr tesseract-ocr-eng

# macOS
brew install tesseract

Building the Document Preprocessing Layer

Raw scans often arrive skewed, poorly lit, or at inconsistent resolutions. Preprocessing normalizes images before OCR:

import cv2
import numpy as np
from PIL import Image


def preprocess_document(image_path: str) -> np.ndarray:
    """Prepare a document image for OCR and layout analysis."""
    img = cv2.imread(image_path)

    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Deskew: detect angle and rotate
    coords = np.column_stack(np.where(gray > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    if abs(angle) > 0.5:
        h, w = gray.shape
        center = (w // 2, h // 2)
        matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
        gray = cv2.warpAffine(
            gray, matrix, (w, h),
            flags=cv2.INTER_CUBIC,
            borderMode=cv2.BORDER_REPLICATE
        )

    # Adaptive thresholding for variable lighting
    binary = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    # Noise removal
    denoised = cv2.medianBlur(binary, 3)

    return denoised

OCR with Confidence Scoring

Tesseract provides word-level confidence scores through its detailed output mode. This lets the agent flag low-confidence regions for human review:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import pytesseract
from dataclasses import dataclass


@dataclass
class OCRResult:
    text: str
    confidence: float
    bbox: tuple  # (x, y, width, height)
    block_num: int
    line_num: int


def extract_with_confidence(image: np.ndarray) -> list[OCRResult]:
    """Run OCR and return word-level results with confidence."""
    data = pytesseract.image_to_data(
        image, output_type=pytesseract.Output.DICT
    )

    results = []
    for i in range(len(data["text"])):
        text = data["text"][i].strip()
        conf = int(data["conf"][i])

        if text and conf > 0:
            results.append(OCRResult(
                text=text,
                confidence=conf / 100.0,
                bbox=(
                    data["left"][i], data["top"][i],
                    data["width"][i], data["height"][i]
                ),
                block_num=data["block_num"][i],
                line_num=data["line_num"][i],
            ))

    return results

Zone Classification with Layout Analysis

Layout analysis segments the page into semantic regions — title, body text, table, figure, footer — so the agent can apply the right extraction strategy per zone:

from enum import Enum


class ZoneType(Enum):
    HEADER = "header"
    BODY = "body"
    TABLE = "table"
    FOOTER = "footer"
    SIDEBAR = "sidebar"


def classify_zones(
    ocr_results: list[OCRResult],
    page_height: int
) -> dict[ZoneType, list[OCRResult]]:
    """Classify OCR results into semantic zones by position."""
    zones: dict[ZoneType, list[OCRResult]] = {z: [] for z in ZoneType}

    for result in ocr_results:
        y_ratio = result.bbox[1] / page_height

        if y_ratio < 0.1:
            zones[ZoneType.HEADER].append(result)
        elif y_ratio > 0.9:
            zones[ZoneType.FOOTER].append(result)
        else:
            zones[ZoneType.BODY].append(result)

    return zones

The Agent Orchestrator

The agent ties all stages together, using an LLM to interpret extracted content and produce structured output:

from pydantic import BaseModel
from openai import OpenAI


class DocumentFields(BaseModel):
    title: str | None = None
    date: str | None = None
    author: str | None = None
    summary: str | None = None
    key_entities: list[str] = []
    confidence_score: float = 0.0


def run_document_agent(image_path: str) -> DocumentFields:
    """Full pipeline: preprocess, OCR, classify, extract."""
    preprocessed = preprocess_document(image_path)
    ocr_results = extract_with_confidence(preprocessed)

    h, _ = preprocessed.shape[:2]
    zones = classify_zones(ocr_results, h)

    header_text = " ".join(r.text for r in zones[ZoneType.HEADER])
    body_text = " ".join(r.text for r in zones[ZoneType.BODY])
    avg_conf = np.mean([r.confidence for r in ocr_results]) if ocr_results else 0

    client = OpenAI()
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Extract structured fields from this document text. "
                "Return title, date, author, summary, and key entities."
            )},
            {"role": "user", "content": (
                f"HEADER: {header_text}\n\nBODY: {body_text}"
            )},
        ],
        response_format=DocumentFields,
    )

    result = response.choices[0].message.parsed
    result.confidence_score = round(avg_conf, 3)
    return result

Handling Low-Confidence Regions

A production agent should flag uncertain results rather than silently producing bad data:

def identify_review_regions(
    ocr_results: list[OCRResult],
    threshold: float = 0.6
) -> list[dict]:
    """Flag regions where OCR confidence is below threshold."""
    flagged = []
    for result in ocr_results:
        if result.confidence < threshold:
            flagged.append({
                "text": result.text,
                "confidence": result.confidence,
                "bbox": result.bbox,
                "suggestion": "Route to human reviewer",
            })
    return flagged

This human-in-the-loop pattern is essential for any document processing system where accuracy is critical, such as legal or financial documents.

FAQ

How accurate is Tesseract compared to cloud OCR services?

Tesseract v5 achieves 95-98% accuracy on clean printed text but drops to 70-85% on degraded scans, handwriting, or unusual fonts. Cloud services like Google Document AI and AWS Textract often outperform it on difficult inputs because they use deep learning models trained on massive datasets. However, Tesseract is free, runs locally, and handles most standard business documents well.

Can layout analysis work on multi-column documents?

Yes, but it requires more sophisticated approaches than simple Y-coordinate thresholding. Libraries like LayoutParser use deep learning models trained on document layout datasets (PubLayNet, DocBank) to detect columns, tables, and figures regardless of their position. For production systems, combining LayoutParser with Tesseract yields much better results on complex layouts.

How should I handle documents in multiple languages?

Tesseract supports over 100 languages. Install the relevant language packs and either specify the language explicitly or use a language detection step first. For mixed-language documents, run OCR multiple times with different language hints and merge results by comparing confidence scores per region.


#DocumentAI #OCR #Tesseract #LayoutAnalysis #InformationExtraction #VisionAI #AgenticAI #Python

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

Table Extraction from Images and PDFs with AI: Building Reliable Data Pipelines

Build an AI-powered table extraction pipeline that detects tables in images and PDFs, recognizes cell boundaries, infers structure, and outputs clean CSV data for downstream consumption.

Learn Agentic AI

Handwriting Recognition with AI Agents: Processing Handwritten Forms and Notes

Build an AI agent pipeline for handwriting recognition that processes handwritten forms and notes, extracts field values with confidence scoring, and routes low-confidence results to human reviewers for correction.

Learn Agentic AI

AI Agent for Automated Data Entry: Reading Source Documents and Filling Web Forms

Build an AI agent that reads source documents using OCR and vision models, maps extracted data to web form fields, fills forms automatically, and validates entries with intelligent error correction.

Learn Agentic AI

AI Agent for Tax Preparation: Document Collection, Categorization, and Form Filling

Learn to build an AI agent that collects tax documents, classifies them by type, extracts key financial data, and maps values to the correct tax form fields.

Learn Agentic AI

Information Extraction Pipelines: Turning Unstructured Text into Agent-Readable Data

Build end-to-end information extraction pipelines for AI agents that convert unstructured text into structured data using extraction patterns, relation extraction, template filling, and validation.

Learn Agentic AI

PDF Processing Agent: Extracting Text, Tables, and Charts from Documents

Build a PDF processing agent that extracts text, tables, and charts from documents using Python. Covers page-level parsing, table detection with pdfplumber, chart analysis with vision models, and structured output generation.