---
title: "Building a Document Comparison Agent: AI-Powered Contract and Document Diff"
description: "Build an AI agent that extracts text from documents, aligns corresponding sections, detects meaningful differences between versions, and generates clear summaries highlighting what changed and why it matters."
canonical: https://callsphere.ai/blog/building-document-comparison-agent-contract-diff
category: "Learn Agentic AI"
tags: ["Document Comparison", "Text Extraction", "Contracts", "Diff", "AI Agents"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T01:49:30.113Z
---

# Building a Document Comparison Agent: AI-Powered Contract and Document Diff

> Build an AI agent that extracts text from documents, aligns corresponding sections, detects meaningful differences between versions, and generates clear summaries highlighting what changed and why it matters.

## Beyond Simple Text Diff

Standard diff tools compare text line by line. They will tell you that line 47 changed from "30 days" to "45 days" — but they will not tell you this is a payment terms extension that affects your cash flow. A document comparison agent understands context. It groups changes by section, classifies their significance (cosmetic, substantive, material), and explains the business impact of each change.

This is especially valuable for contract review, policy updates, regulatory filings, and any document where the meaning of changes matters as much as their location.

## Text Extraction Tool

Documents arrive in various formats. This tool extracts clean text from PDFs, DOCX files, and plain text:

```mermaid
flowchart TD
    Q{"What matters most
for your team?"}
    DIM1["Time to first
production deploy"]
    DIM2["Total cost of
ownership at scale"]
    DIM3["Debuggability and
observability"]
    DIM4["Ecosystem and
community support"]
    PICK{Score the
four axes}
    A(["Pick
Option A"])
    B(["Pick
Option B"])
    Q --> DIM1 --> PICK
    Q --> DIM2 --> PICK
    Q --> DIM3 --> PICK
    Q --> DIM4 --> PICK
    PICK -->|Speed and ecosystem| A
    PICK -->|Control and TCO| B
    style Q fill:#4f46e5,stroke:#4338ca,color:#fff
    style PICK fill:#f59e0b,stroke:#d97706,color:#1f2937
    style A fill:#0ea5e9,stroke:#0369a1,color:#fff
    style B fill:#059669,stroke:#047857,color:#fff
```

```python
from pathlib import Path
from agents import Agent, Runner, function_tool

@function_tool
def extract_text(file_path: str) -> str:
    """Extract text content from a document file.
    Supports .txt, .pdf, and .docx formats."""
    path = Path(file_path)
    suffix = path.suffix.lower()

    try:
        if suffix == ".txt":
            return path.read_text(encoding="utf-8")

        elif suffix == ".pdf":
            import pymupdf
            doc = pymupdf.open(file_path)
            pages = []
            for page in doc:
                pages.append(page.get_text())
            doc.close()
            return "\n\n".join(pages)

        elif suffix == ".docx":
            from docx import Document
            doc = Document(file_path)
            paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
            return "\n\n".join(paragraphs)

        else:
            return f"Unsupported format: {suffix}"

    except Exception as e:
        return f"Extraction error: {e}"
```

## Section Alignment Tool

Contracts and legal documents are structured into sections. This tool splits documents into sections and aligns them between versions:

```python
import re
import difflib

_documents: dict[str, str] = {}

@function_tool
def load_document(label: str, file_path: str) -> str:
    """Load and store a document for comparison. Use labels like
    'original' and 'revised'."""
    from pathlib import Path
    path = Path(file_path)
    if path.suffix == ".txt":
        text = path.read_text()
    else:
        # Delegate to extract_text for other formats
        return f"Use extract_text for {path.suffix} files, then call store_text."

    _documents[label] = text
    word_count = len(text.split())
    section_count = len(re.split(r"\n(?=\d+\.|Section |Article |ARTICLE )", text))
    return f"Loaded '{label}': {word_count} words, ~{section_count} sections."

@function_tool
def store_text(label: str, text: str) -> str:
    """Store already-extracted text under a label for comparison."""
    _documents[label] = text
    return f"Stored '{label}': {len(text.split())} words."
```

## Difference Detection Tool

This tool finds the actual differences between two document versions:

```python
@function_tool
def compute_diff(label_a: str, label_b: str) -> str:
    """Compute differences between two loaded documents.
    Returns additions, deletions, and modifications."""
    if label_a not in _documents or label_b not in _documents:
        available = ", ".join(_documents.keys())
        return f"Missing document. Available: {available}"

    lines_a = _documents[label_a].splitlines()
    lines_b = _documents[label_b].splitlines()

    differ = difflib.unified_diff(
        lines_a, lines_b,
        fromfile=label_a, tofile=label_b,
        lineterm="",
    )
    diff_lines = list(differ)

    if not diff_lines:
        return "Documents are identical."

    # Summarize changes
    additions = sum(1 for l in diff_lines if l.startswith("+") and not l.startswith("+++"))
    deletions = sum(1 for l in diff_lines if l.startswith("-") and not l.startswith("---"))

    # Extract changed sections (context around changes)
    changes = []
    current_change = []
    for line in diff_lines:
        if line.startswith("@@"):
            if current_change:
                changes.append("\n".join(current_change))
            current_change = [line]
        elif current_change is not None:
            current_change.append(line)
    if current_change:
        changes.append("\n".join(current_change))

    output = (
        f"Diff Summary: {additions} additions, {deletions} deletions, "
        f"{len(changes)} changed sections\n\n"
    )
    # Show first 10 change blocks
    for i, change in enumerate(changes[:10]):
        output += f"--- Change {i+1} ---\n{change}\n\n"

    if len(changes) > 10:
        output += f"... and {len(changes) - 10} more change blocks."

    return output
```

## Similarity Scoring Tool

Quantify how different two documents are overall:

```python
@function_tool
def similarity_score(label_a: str, label_b: str) -> str:
    """Calculate overall similarity between two documents."""
    if label_a not in _documents or label_b not in _documents:
        return "Missing document."

    text_a = _documents[label_a]
    text_b = _documents[label_b]

    # Sequence matcher for overall similarity
    ratio = difflib.SequenceMatcher(None, text_a, text_b).ratio()

    # Word-level comparison
    words_a = set(text_a.lower().split())
    words_b = set(text_b.lower().split())
    jaccard = len(words_a & words_b) / len(words_a | words_b) if (words_a | words_b) else 0

    return (
        f"Similarity between '{label_a}' and '{label_b}':\n"
        f"  Character-level similarity: {ratio:.1%}\n"
        f"  Word overlap (Jaccard): {jaccard:.1%}\n"
        f"  Unique to '{label_a}': {len(words_a - words_b)} words\n"
        f"  Unique to '{label_b}': {len(words_b - words_a)} words"
    )
```

## Assembling the Document Comparison Agent

```python
doc_agent = Agent(
    name="Document Comparator",
    instructions="""You are a document comparison agent specializing in contracts
and legal documents. When given two document versions:

1. Extract text from both documents using extract_text.
2. Store them with store_text using labels 'original' and 'revised'.
3. Call similarity_score for an overall comparison metric.
4. Call compute_diff to get the detailed differences.
5. Analyze each change block and classify it as:
   - Cosmetic: formatting, typos, rephrasing with same meaning
   - Substantive: meaningful change to terms, obligations, or rights
   - Material: high-impact change affecting financial terms, liability,
     termination, or indemnification
6. Produce a report with:
   - Executive Summary (overall similarity, number of material changes)
   - Material Changes (each with before/after text and impact analysis)
   - Substantive Changes (grouped by section)
   - Cosmetic Changes (brief list)
   - Risk Assessment (what the changes mean for the parties involved)""",
    tools=[extract_text, load_document, store_text, compute_diff, similarity_score],
)
```

## Example Usage

```python
result = Runner.run_sync(
    doc_agent,
    "Compare the original contract at /docs/contract_v1.pdf with the "
    "revised version at /docs/contract_v2.pdf. Focus on any changes to "
    "payment terms, liability clauses, and termination conditions.",
)
print(result.final_output)
```

The agent extracts text from both PDFs, computes a 94.2% similarity score, identifies 12 change blocks, classifies 2 as material (payment terms extended from 30 to 60 days, liability cap increased from $1M to $5M), 5 as substantive (new force majeure clause, updated data handling provisions), and 5 as cosmetic. The risk assessment highlights the cash flow impact of extended payment terms.

## FAQ

### Can this agent handle scanned PDFs without selectable text?

Not directly — scanned PDFs require OCR. Add a preprocessing step using `pytesseract` or a cloud OCR service like Google Document AI. Extract the text via OCR first, then feed it to the comparison agent through `store_text`.

### How does the agent handle documents with completely different structures?

The diff tool works best when documents share a similar structure. For documents with reorganized sections, add a section-matching tool that uses semantic similarity (embeddings) to align sections by content rather than position before computing differences.

### Is this suitable for comparing legal contracts in production?

This agent provides a strong first pass that saves hours of manual review. However, for legally binding decisions, always have a qualified attorney review the agent's findings. The agent excels at surfacing changes that might be missed during manual review, not at replacing legal judgment.

---

#DocumentComparison #TextExtraction #Contracts #Diff #AIAgents #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/building-document-comparison-agent-contract-diff
