Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction

Why Document Intelligence Needs More Than OCR

Traditional OCR converts pixels to characters, but that is only the first step. Real document intelligence requires understanding the spatial layout — headers, paragraphs, tables, footnotes — and extracting structured information that downstream systems can consume. A document intelligence agent orchestrates these stages, deciding which regions need deeper analysis and which extraction strategy fits each zone.

The core pipeline follows four stages: image preprocessing, OCR with confidence scoring, layout analysis to identify semantic zones, and structured extraction that maps content to fields your application expects.

Setting Up the Foundation

Install the necessary libraries for the full pipeline:

flowchart LR
    PDF(["PDF or image"])
    OCR["OCR plus layout<br/>LayoutLM or Donut"]
    DETECT["Table detector<br/>bounding boxes"]
    STRUCT["Cell structure<br/>rows and columns"]
    LLM["LLM normalization<br/>headers and types"]
    VAL["Schema validation<br/>Pydantic"]
    DB[(Structured store)]
    OUT(["Clean rows"])
    PDF --> OCR --> DETECT --> STRUCT --> LLM --> VAL --> DB --> OUT
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style VAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OUT fill:#059669,stroke:#047857,color:#fff

pip install pytesseract Pillow layoutparser opencv-python-headless pydantic openai

Make sure Tesseract is installed on your system:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

# Ubuntu/Debian
sudo apt-get install tesseract-ocr tesseract-ocr-eng

# macOS
brew install tesseract

Building the Document Preprocessing Layer

Raw scans often arrive skewed, poorly lit, or at inconsistent resolutions. Preprocessing normalizes images before OCR:

import cv2
import numpy as np
from PIL import Image

def preprocess_document(image_path: str) -> np.ndarray:
    """Prepare a document image for OCR and layout analysis."""
    img = cv2.imread(image_path)

    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Deskew: detect angle and rotate
    coords = np.column_stack(np.where(gray > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    if abs(angle) > 0.5:
        h, w = gray.shape
        center = (w // 2, h // 2)
        matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
        gray = cv2.warpAffine(
            gray, matrix, (w, h),
            flags=cv2.INTER_CUBIC,
            borderMode=cv2.BORDER_REPLICATE
        )

    # Adaptive thresholding for variable lighting
    binary = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    # Noise removal
    denoised = cv2.medianBlur(binary, 3)

    return denoised

OCR with Confidence Scoring

Tesseract provides word-level confidence scores through its detailed output mode. This lets the agent flag low-confidence regions for human review:

import pytesseract
from dataclasses import dataclass

@dataclass
class OCRResult:
    text: str
    confidence: float
    bbox: tuple  # (x, y, width, height)
    block_num: int
    line_num: int

def extract_with_confidence(image: np.ndarray) -> list[OCRResult]:
    """Run OCR and return word-level results with confidence."""
    data = pytesseract.image_to_data(
        image, output_type=pytesseract.Output.DICT
    )

    results = []
    for i in range(len(data["text"])):
        text = data["text"][i].strip()
        conf = int(data["conf"][i])

        if text and conf > 0:
            results.append(OCRResult(
                text=text,
                confidence=conf / 100.0,
                bbox=(
                    data["left"][i], data["top"][i],
                    data["width"][i], data["height"][i]
                ),
                block_num=data["block_num"][i],
                line_num=data["line_num"][i],
            ))

    return results

Zone Classification with Layout Analysis

Layout analysis segments the page into semantic regions — title, body text, table, figure, footer — so the agent can apply the right extraction strategy per zone:

from enum import Enum

class ZoneType(Enum):
    HEADER = "header"
    BODY = "body"
    TABLE = "table"
    FOOTER = "footer"
    SIDEBAR = "sidebar"

def classify_zones(
    ocr_results: list[OCRResult],
    page_height: int
) -> dict[ZoneType, list[OCRResult]]:
    """Classify OCR results into semantic zones by position."""
    zones: dict[ZoneType, list[OCRResult]] = {z: [] for z in ZoneType}

    for result in ocr_results:
        y_ratio = result.bbox[1] / page_height

        if y_ratio < 0.1:
            zones[ZoneType.HEADER].append(result)
        elif y_ratio > 0.9:
            zones[ZoneType.FOOTER].append(result)
        else:
            zones[ZoneType.BODY].append(result)

    return zones

The Agent Orchestrator

The agent ties all stages together, using an LLM to interpret extracted content and produce structured output:

from pydantic import BaseModel
from openai import OpenAI

class DocumentFields(BaseModel):
    title: str | None = None
    date: str | None = None
    author: str | None = None
    summary: str | None = None
    key_entities: list[str] = []
    confidence_score: float = 0.0

def run_document_agent(image_path: str) -> DocumentFields:
    """Full pipeline: preprocess, OCR, classify, extract."""
    preprocessed = preprocess_document(image_path)
    ocr_results = extract_with_confidence(preprocessed)

    h, _ = preprocessed.shape[:2]
    zones = classify_zones(ocr_results, h)

    header_text = " ".join(r.text for r in zones[ZoneType.HEADER])
    body_text = " ".join(r.text for r in zones[ZoneType.BODY])
    avg_conf = np.mean([r.confidence for r in ocr_results]) if ocr_results else 0

    client = OpenAI()
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Extract structured fields from this document text. "
                "Return title, date, author, summary, and key entities."
            )},
            {"role": "user", "content": (
                f"HEADER: {header_text}\n\nBODY: {body_text}"
            )},
        ],
        response_format=DocumentFields,
    )

    result = response.choices[0].message.parsed
    result.confidence_score = round(avg_conf, 3)
    return result

Handling Low-Confidence Regions

A production agent should flag uncertain results rather than silently producing bad data:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

def identify_review_regions(
    ocr_results: list[OCRResult],
    threshold: float = 0.6
) -> list[dict]:
    """Flag regions where OCR confidence is below threshold."""
    flagged = []
    for result in ocr_results:
        if result.confidence < threshold:
            flagged.append({
                "text": result.text,
                "confidence": result.confidence,
                "bbox": result.bbox,
                "suggestion": "Route to human reviewer",
            })
    return flagged

This human-in-the-loop pattern is essential for any document processing system where accuracy is critical, such as legal or financial documents.

FAQ

How accurate is Tesseract compared to cloud OCR services?

Tesseract v5 achieves 95-98% accuracy on clean printed text but drops to 70-85% on degraded scans, handwriting, or unusual fonts. Cloud services like Google Document AI and AWS Textract often outperform it on difficult inputs because they use deep learning models trained on massive datasets. However, Tesseract is free, runs locally, and handles most standard business documents well.

Can layout analysis work on multi-column documents?

Yes, but it requires more sophisticated approaches than simple Y-coordinate thresholding. Libraries like LayoutParser use deep learning models trained on document layout datasets (PubLayNet, DocBank) to detect columns, tables, and figures regardless of their position. For production systems, combining LayoutParser with Tesseract yields much better results on complex layouts.

How should I handle documents in multiple languages?

Tesseract supports over 100 languages. Install the relevant language packs and either specify the language explicitly or use a language detection step first. For mixed-language documents, run OCR multiple times with different language hints and merge results by comparing confidence scores per region.

#DocumentAI #OCR #Tesseract #LayoutAnalysis #InformationExtraction #VisionAI #AgenticAI #Python

Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction

Why Document Intelligence Needs More Than OCR

Setting Up the Foundation

Building the Document Preprocessing Layer

OCR with Confidence Scoring

Zone Classification with Layout Analysis

The Agent Orchestrator

Handling Low-Confidence Regions

FAQ

How accurate is Tesseract compared to cloud OCR services?

Can layout analysis work on multi-column documents?

How should I handle documents in multiple languages?

Try CallSphere AI Voice Agents

Related Articles You May Like

HIPAA + AI — April 2026 OCR guidance for AI agents in healthcare

Chat Agents With File Upload and OCR: PDFs, Scans, and Forms in 2026

AI Agent for Automated Data Entry: Reading Source Documents and Filling Web Forms

Handwriting Recognition with AI Agents: Processing Handwritten Forms and Notes

Table Extraction from Images and PDFs with AI: Building Reliable Data Pipelines

AI Agent for Expense Reporting: Receipt Scanning, Categorization, and Policy Compliance