---
title: "AI-Powered Document Comparison: Redline Generation and Change Tracking with Vision"
description: "Build an AI agent that compares two versions of a document, identifies additions, deletions, and modifications, generates visual redlines, and produces annotated change summaries for legal, contract, and policy review workflows."
canonical: https://callsphere.ai/blog/ai-powered-document-comparison-redline-change-tracking-vision
category: "Learn Agentic AI"
tags: ["Document Comparison", "Redline Generation", "Change Tracking", "Legal AI", "NLP"]
author: "CallSphere Team"
published: 2026-03-18T00:00:00.000Z
updated: 2026-05-07T07:42:36.662Z
---

# AI-Powered Document Comparison: Redline Generation and Change Tracking with Vision

> Build an AI agent that compares two versions of a document, identifies additions, deletions, and modifications, generates visual redlines, and produces annotated change summaries for legal, contract, and policy review workflows.

## Why Document Comparison Needs AI

Traditional diff tools work character-by-character or line-by-line. That works for code but fails for documents. When a lawyer restructures a paragraph — moving sentences around, changing "shall" to "must," and splitting a clause into two — a naive diff shows the entire paragraph as deleted and re-added. What you actually want is a semantic understanding of what changed and whether those changes matter.

AI-powered document comparison works at the meaning level. It aligns paragraphs across document versions, detects rewording versus substantive changes, and generates human-readable summaries of what shifted and why it might matter.

## The Comparison Pipeline

The system works in four stages: text extraction from both documents, alignment of corresponding sections, change detection and classification, and output generation (redlines, annotations, summary).

```mermaid
flowchart TD
    Q{"What matters most
for your team?"}
    DIM1["Time to first
production deploy"]
    DIM2["Total cost of
ownership at scale"]
    DIM3["Debuggability and
observability"]
    DIM4["Ecosystem and
community support"]
    PICK{Score the
four axes}
    A(["Pick
Option A"])
    B(["Pick
Option B"])
    Q --> DIM1 --> PICK
    Q --> DIM2 --> PICK
    Q --> DIM3 --> PICK
    Q --> DIM4 --> PICK
    PICK -->|Speed and ecosystem| A
    PICK -->|Control and TCO| B
    style Q fill:#4f46e5,stroke:#4338ca,color:#fff
    style PICK fill:#f59e0b,stroke:#d97706,color:#1f2937
    style A fill:#0ea5e9,stroke:#0369a1,color:#fff
    style B fill:#059669,stroke:#047857,color:#fff
```

## Text Extraction and Segmentation

First, extract and segment both documents into comparable units:

```python
import pdfplumber
from dataclasses import dataclass

@dataclass
class DocumentSection:
    index: int
    heading: str | None
    text: str
    page: int
    section_type: str  # "heading", "paragraph", "list", "table"

def extract_sections(pdf_path: str) -> list[DocumentSection]:
    """Extract structured sections from a PDF document."""
    sections = []
    current_idx = 0

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            text = page.extract_text() or ""
            paragraphs = text.split("\n\n")

            for para in paragraphs:
                para = para.strip()
                if not para:
                    continue

                section_type = classify_section(para)
                heading = para if section_type == "heading" else None

                sections.append(DocumentSection(
                    index=current_idx,
                    heading=heading,
                    text=para,
                    page=page_num + 1,
                    section_type=section_type,
                ))
                current_idx += 1

    return sections

def classify_section(text: str) -> str:
    """Classify a text block as heading, paragraph, or list."""
    lines = text.strip().split("\n")

    if len(lines) == 1 and len(text)  list[list[float]]:
    """Get embeddings for a list of text sections."""
    client = OpenAI()
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(
        np.dot(a_arr, b_arr) /
        (np.linalg.norm(a_arr) * np.linalg.norm(b_arr) + 1e-10)
    )

def align_sections(
    old_sections: list[DocumentSection],
    new_sections: list[DocumentSection],
    threshold: float = 0.75,
) -> list[dict]:
    """Align sections between old and new document versions."""
    old_texts = [s.text for s in old_sections]
    new_texts = [s.text for s in new_sections]

    old_embeds = get_embeddings(old_texts)
    new_embeds = get_embeddings(new_texts)

    alignments = []
    used_new = set()

    for i, old_embed in enumerate(old_embeds):
        best_score = 0.0
        best_j = -1

        for j, new_embed in enumerate(new_embeds):
            if j in used_new:
                continue
            score = cosine_similarity(old_embed, new_embed)
            if score > best_score:
                best_score = score
                best_j = j

        if best_score >= threshold:
            alignments.append({
                "old": old_sections[i],
                "new": new_sections[best_j],
                "similarity": best_score,
                "status": "modified" if best_score  dict:
    """Classify the type and severity of a detected change."""
    if alignment["status"] == "added":
        return {**alignment, "change_type": ChangeType.ADDITION, "severity": "high"}
    if alignment["status"] == "deleted":
        return {**alignment, "change_type": ChangeType.DELETION, "severity": "high"}
    if alignment["status"] == "unchanged":
        return {**alignment, "change_type": None, "severity": "none"}

    # For modified sections, use LLM to classify
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Compare these two text versions and classify the change as "
                "'cosmetic' (rewording without meaning change), "
                "'substantive' (meaning, obligation, or number changed), "
                "or 'structural' (reorganized but same content). "
                "Respond with just the classification and a one-sentence explanation."
            )},
            {"role": "user", "content": (
                f"OLD: {alignment['old'].text}\n\n"
                f"NEW: {alignment['new'].text}"
            )},
        ],
    )

    classification = response.choices[0].message.content.lower()
    if "substantive" in classification:
        change_type = ChangeType.SUBSTANTIVE
        severity = "high"
    elif "structural" in classification:
        change_type = ChangeType.STRUCTURAL
        severity = "medium"
    else:
        change_type = ChangeType.COSMETIC
        severity = "low"

    return {
        **alignment,
        "change_type": change_type,
        "severity": severity,
        "explanation": response.choices[0].message.content,
    }
```

## Generating the Redline Output

Produce an HTML redline document showing additions in green and deletions in red:

```python
import difflib

def generate_redline_html(
    classified_changes: list[dict],
) -> str:
    """Generate an HTML redline document from classified changes."""
    html_parts = [
        "",
        ".added { background: #d4edda; color: #155724; }",
        ".deleted { background: #f8d7da; color: #721c24; text-decoration: line-through; }",
        ".modified { background: #fff3cd; color: #856404; }",
        ".section { margin: 16px 0; padding: 12px; border-left: 4px solid #ccc; }",
        ".severity-high { border-left-color: #dc3545; }",
        ".severity-medium { border-left-color: #ffc107; }",
        ".severity-low { border-left-color: #28a745; }",
        "",
    ]

    for change in classified_changes:
        severity = change.get("severity", "none")

        if change["status"] == "unchanged":
            html_parts.append(f'{change["old"].text}')
        elif change["status"] == "added":
            html_parts.append(
                f''
                f'{change["new"].text}'
            )
        elif change["status"] == "deleted":
            html_parts.append(
                f''
                f'{change["old"].text}'
            )
        elif change["status"] == "modified":
            old_words = change["old"].text.split()
            new_words = change["new"].text.split()
            diff = difflib.ndiff(old_words, new_words)

            diff_html = []
            for token in diff:
                if token.startswith("- "):
                    diff_html.append(f'{token[2:]}')
                elif token.startswith("+ "):
                    diff_html.append(f'{token[2:]}')
                elif token.startswith("  "):
                    diff_html.append(token[2:])

            html_parts.append(
                f''
                f'{" ".join(diff_html)}'
            )

    html_parts.append("")
    return "\n".join(html_parts)
```

## Change Summary Generation

Produce a high-level summary for reviewers who need the highlights without reading every redline:

```python
def generate_change_summary(
    classified_changes: list[dict],
) -> str:
    """Generate a human-readable summary of all changes."""
    substantive = [c for c in classified_changes if c.get("change_type") == ChangeType.SUBSTANTIVE]
    additions = [c for c in classified_changes if c["status"] == "added"]
    deletions = [c for c in classified_changes if c["status"] == "deleted"]

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Summarize the key changes between two document versions. "
                "Focus on substantive changes that affect meaning, "
                "obligations, or numbers. Be concise and precise."
            )},
            {"role": "user", "content": (
                f"Substantive changes ({len(substantive)}):\n" +
                "\n".join(c.get("explanation", "") for c in substantive) +
                f"\n\nNew sections added: {len(additions)}" +
                f"\nSections removed: {len(deletions)}"
            )},
        ],
    )

    return response.choices[0].message.content
```

## FAQ

### How does semantic comparison differ from traditional diff tools?

Traditional diff tools operate at the character or line level — they see every reworded sentence as a delete-then-add. Semantic comparison uses embeddings to understand meaning, so it can recognize that "The vendor shall deliver goods within 30 days" and "Goods must be delivered by the vendor within thirty days" are the same clause with cosmetic rewording, not a deletion and addition.

### Can this handle comparing documents in different formats (Word vs PDF)?

Yes, but you need format-specific extractors. Use python-docx for Word files and pdfplumber for PDFs. The key insight is that comparison happens at the extracted text level, not the file format level. Extract sections from both documents into the same DocumentSection structure, then the rest of the pipeline works identically regardless of source format.

### What about legal documents with numbered clause references?

Clause renumbering is a common trap. When a new clause is inserted, all subsequent numbers shift, making every following clause appear "changed." Handle this by stripping clause numbers before comparison and treating numbering as metadata. After alignment, regenerate the numbering analysis as a separate section of the change report.

---

#DocumentComparison #RedlineGeneration #ChangeTracking #LegalAI #NLP #AgenticAI #Python #ContractReview

---

Source: https://callsphere.ai/blog/ai-powered-document-comparison-redline-change-tracking-vision