Skip to content
Learn Agentic AI
Learn Agentic AI13 min read4 views

Table Extraction from Images and PDFs with AI: Building Reliable Data Pipelines

Build an AI-powered table extraction pipeline that detects tables in images and PDFs, recognizes cell boundaries, infers structure, and outputs clean CSV data for downstream consumption.

The Table Extraction Challenge

Tables are one of the most information-dense structures in documents, yet they are among the hardest to extract reliably. A table in a PDF might be a true table object with embedded coordinates, a scanned image of a printed table, or text that is visually aligned but has no structural markup at all. Each case requires a different extraction strategy.

A reliable table extraction pipeline needs four stages: detection (finding tables on the page), structure recognition (identifying rows, columns, and cell boundaries), content extraction (reading the text in each cell), and output formatting (producing clean structured data).

Setting Up the Pipeline

pip install camelot-py[cv] tabula-py pdfplumber img2table opencv-python-headless pandas

For image-based table extraction, you also need Tesseract installed on your system.

flowchart TD
    START["Table Extraction from Images and PDFs with AI: Bu…"] --> A
    A["The Table Extraction Challenge"]
    A --> B
    B["Setting Up the Pipeline"]
    B --> C
    C["Stage 1: Table Detection"]
    C --> D
    D["Stage 2: Structure Recognition"]
    D --> E
    E["Stage 3: Cell Content Extraction"]
    E --> F
    F["Stage 4: Output Formatting"]
    F --> G
    G["Combining Native and Image Pipelines"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Stage 1: Table Detection

The first step is locating tables within a document. For PDFs with embedded structure, pdfplumber excels:

import pdfplumber
from dataclasses import dataclass


@dataclass
class DetectedTable:
    page_number: int
    bbox: tuple  # (x0, y0, x1, y1)
    row_count: int
    col_count: int
    source: str  # "native" or "image"


def detect_tables_native(pdf_path: str) -> list[DetectedTable]:
    """Detect tables in PDFs with embedded structure."""
    detected = []

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            tables = page.find_tables()
            for table in tables:
                rows = table.extract()
                if rows and len(rows) > 1:
                    detected.append(DetectedTable(
                        page_number=i + 1,
                        bbox=table.bbox,
                        row_count=len(rows),
                        col_count=max(len(r) for r in rows),
                        source="native",
                    ))

    return detected

For scanned documents where tables exist only as images, use contour-based detection:

import cv2
import numpy as np


def detect_tables_in_image(image_path: str) -> list[dict]:
    """Detect table regions in scanned document images."""
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    binary = cv2.adaptiveThreshold(
        img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY_INV, 15, 5
    )

    # Detect horizontal lines
    h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
    h_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, h_kernel)

    # Detect vertical lines
    v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))
    v_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, v_kernel)

    # Combine to find grid intersections
    table_mask = cv2.add(h_lines, v_lines)

    contours, _ = cv2.findContours(
        table_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
    )

    tables = []
    for contour in contours:
        x, y, w, h = cv2.boundingRect(contour)
        if w > 100 and h > 50:  # Filter noise
            tables.append({
                "bbox": (x, y, x + w, y + h),
                "area": w * h,
            })

    return sorted(tables, key=lambda t: t["area"], reverse=True)

Stage 2: Structure Recognition

Once a table region is identified, the next step is figuring out the row-column structure:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

flowchart LR
    S0["Stage 1: Table Detection"]
    S0 --> S1
    S1["Stage 2: Structure Recognition"]
    S1 --> S2
    S2["Stage 3: Cell Content Extraction"]
    S2 --> S3
    S3["Stage 4: Output Formatting"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S3 fill:#059669,stroke:#047857,color:#fff
def extract_grid_structure(
    binary_image: np.ndarray,
    bbox: tuple
) -> dict:
    """Identify row and column boundaries within a table region."""
    x0, y0, x1, y1 = bbox
    table_region = binary_image[y0:y1, x0:x1]

    # Project horizontally to find row boundaries
    h_projection = np.sum(table_region, axis=1)
    row_boundaries = find_boundaries(h_projection, axis="horizontal")

    # Project vertically to find column boundaries
    v_projection = np.sum(table_region, axis=0)
    col_boundaries = find_boundaries(v_projection, axis="vertical")

    return {
        "rows": row_boundaries,
        "cols": col_boundaries,
        "cell_count": (len(row_boundaries) - 1) * (len(col_boundaries) - 1),
    }


def find_boundaries(projection: np.ndarray, axis: str) -> list[int]:
    """Find row or column boundaries from pixel projection."""
    threshold = np.max(projection) * 0.3
    in_gap = True
    boundaries = [0]

    for i, val in enumerate(projection):
        if in_gap and val > threshold:
            boundaries.append(i)
            in_gap = False
        elif not in_gap and val <= threshold:
            in_gap = True

    boundaries.append(len(projection))
    return boundaries

Stage 3: Cell Content Extraction

With the grid structure known, extract text from each cell using OCR:

import pytesseract
from PIL import Image


def extract_cell_contents(
    image: np.ndarray,
    rows: list[int],
    cols: list[int],
    table_offset: tuple
) -> list[list[str]]:
    """Extract text from each cell in the detected grid."""
    ox, oy = table_offset[0], table_offset[1]
    table_data = []

    for r in range(len(rows) - 1):
        row_data = []
        for c in range(len(cols) - 1):
            cell = image[
                oy + rows[r]:oy + rows[r + 1],
                ox + cols[c]:ox + cols[c + 1]
            ]

            cell_pil = Image.fromarray(cell)
            text = pytesseract.image_to_string(
                cell_pil, config="--psm 6"
            ).strip()

            row_data.append(text)
        table_data.append(row_data)

    return table_data

Stage 4: Output Formatting

Convert the extracted data to a clean DataFrame with header detection:

import pandas as pd


def table_to_dataframe(
    raw_data: list[list[str]],
    has_header: bool = True
) -> pd.DataFrame:
    """Convert extracted table data to a pandas DataFrame."""
    if not raw_data:
        return pd.DataFrame()

    if has_header:
        headers = [
            cell.replace("\n", " ").strip()
            for cell in raw_data[0]
        ]
        df = pd.DataFrame(raw_data[1:], columns=headers)
    else:
        df = pd.DataFrame(raw_data)

    # Clean up whitespace and empty columns
    df = df.apply(lambda col: col.str.strip() if col.dtype == "object" else col)
    df = df.dropna(axis=1, how="all")

    return df


def export_tables(tables: list[pd.DataFrame], output_dir: str):
    """Export extracted tables to CSV files."""
    for i, df in enumerate(tables):
        path = f"{output_dir}/table_{i + 1}.csv"
        df.to_csv(path, index=False)
        print(f"Exported {len(df)} rows to {path}")

Combining Native and Image Pipelines

A robust agent should automatically choose the right extraction strategy:

def extract_tables_auto(pdf_path: str) -> list[pd.DataFrame]:
    """Automatically select the best extraction method."""
    native_tables = detect_tables_native(pdf_path)

    if native_tables:
        # Use pdfplumber for native PDF tables
        results = []
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                for table in page.find_tables():
                    rows = table.extract()
                    if rows:
                        results.append(table_to_dataframe(rows))
        return results
    else:
        # Fallback to image-based extraction
        print("No native tables found, using image-based extraction")
        return extract_tables_from_images(pdf_path)

FAQ

How do I handle merged cells in tables?

Merged cells are one of the hardest problems in table extraction. When a cell spans multiple rows or columns, the grid structure becomes irregular. The best approach is to detect merged cells by looking for cells where the boundary lines are absent, then use spanning metadata to reconstruct the logical structure. Libraries like img2table handle this better than raw contour detection.

What accuracy can I expect from table extraction?

On clean, well-formatted tables with clear gridlines, extraction accuracy typically reaches 95%+ for both structure and content. Borderless tables drop to 70-85% accuracy because column alignment must be inferred from whitespace. Always validate extracted data by checking row/column counts against expectations and flagging anomalies.

Can this pipeline handle tables that span multiple pages?

Yes, but it requires additional logic to detect continuation tables. Look for tables that start at the top of a page without a header row, or tables on consecutive pages with matching column counts and widths. Merge them by concatenating rows and deduplicating any repeated header rows.


#TableExtraction #PDFProcessing #DataPipelines #DocumentAI #ComputerVision #OCR #Python #AgenticAI

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

Using GPT-4 Vision to Understand Web Pages: Screenshot Analysis for AI Agents

Learn how to capture web page screenshots and send them to GPT-4 Vision for element identification, layout understanding, and structured analysis that powers browser automation agents.

Learn Agentic AI

Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction

Learn how to build an end-to-end document intelligence agent that combines Tesseract OCR, layout detection, zone classification, and structured information extraction to process any document type automatically.

Learn Agentic AI

Claude Vision for PDF Processing in the Browser: Reading Documents Without Download

Use Claude Computer Use to read PDFs rendered in browser viewers — navigating pages, extracting text and tables, detecting annotations, and converting visual PDF content to structured data without file downloads.

Learn Agentic AI

UFO's Visual Understanding: How GPT-4V Interprets Windows Application Screenshots

Explore how UFO captures, annotates, and sends Windows application screenshots to GPT-4V for UI element detection, control identification, and intelligent action mapping at each automation step.

Learn Agentic AI

Desktop Application Automation with PyAutoGUI and AI Vision: Beyond Web Browsers

Learn to automate desktop applications using PyAutoGUI combined with AI vision models. Covers screen recognition, coordinate mapping, multi-monitor setups, keyboard automation, and building robust desktop agents.

Learn Agentic AI

Building a Whiteboard-to-Code Agent: Converting Hand-Drawn Diagrams to Working Software

Learn how to build an AI agent that recognizes hand-drawn diagrams on whiteboards, classifies shapes and connections, and generates working code including Mermaid diagrams, database schemas, and API stubs.