---
title: "PDF Processing Agent: Extracting Text, Tables, and Charts from Documents"
description: "Build a PDF processing agent that extracts text, tables, and charts from documents using Python. Covers page-level parsing, table detection with pdfplumber, chart analysis with vision models, and structured output generation."
canonical: https://callsphere.ai/blog/pdf-processing-agent-extracting-text-tables-charts-documents
category: "Learn Agentic AI"
tags: ["PDF Processing", "Document AI", "Table Extraction", "Chart Analysis", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:45.052Z
---

# PDF Processing Agent: Extracting Text, Tables, and Charts from Documents

> Build a PDF processing agent that extracts text, tables, and charts from documents using Python. Covers page-level parsing, table detection with pdfplumber, chart analysis with vision models, and structured output generation.

## The Challenge of PDF Processing

PDFs are the most common format for business documents, yet they are notoriously difficult to process programmatically. A single PDF might contain flowing paragraphs, multi-column layouts, embedded tables, charts rendered as vector graphics, and scanned images of handwritten notes. An effective PDF processing agent must detect and handle each of these content types with the right tool.

## Architecture of a PDF Processing Agent

The agent follows a three-stage pipeline:

```mermaid
flowchart LR
    PDF(["PDF or image"])
    OCR["OCR plus layout
LayoutLM or Donut"]
    DETECT["Table detector
bounding boxes"]
    STRUCT["Cell structure
rows and columns"]
    LLM["LLM normalization
headers and types"]
    VAL["Schema validation
Pydantic"]
    DB[(Structured store)]
    OUT(["Clean rows"])
    PDF --> OCR --> DETECT --> STRUCT --> LLM --> VAL --> DB --> OUT
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style VAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OUT fill:#059669,stroke:#047857,color:#fff
```

1. **Page extraction** — convert each page to both text and image representations
2. **Content classification** — determine what type of content each page region contains
3. **Specialized extraction** — apply the right tool to each content type

Install the required dependencies:

```bash
pip install pdfplumber pymupdf pillow openai
```

## Stage 1: Page Extraction

Start by extracting both text and rendered images from each page. Having both representations lets the agent fall back to vision-based analysis when text extraction fails:

```python
import pdfplumber
import fitz  # PyMuPDF
from dataclasses import dataclass, field
from PIL import Image
import io

@dataclass
class PageContent:
    page_number: int
    raw_text: str
    image: Image.Image
    tables: list[list[list[str]]] = field(default_factory=list)
    has_charts: bool = False

def extract_pages(pdf_path: str) -> list[PageContent]:
    """Extract text and images from every page of a PDF."""
    pages = []

    # Use pdfplumber for text and tables
    with pdfplumber.open(pdf_path) as pdf:
        plumber_pages = list(pdf.pages)

    # Use PyMuPDF for page images
    doc = fitz.open(pdf_path)

    for i, plumber_page in enumerate(plumber_pages):
        # Extract raw text
        raw_text = plumber_page.extract_text() or ""

        # Extract tables
        tables = plumber_page.extract_tables() or []
        cleaned_tables = []
        for table in tables:
            cleaned = [
                [cell or "" for cell in row]
                for row in table
                if any(cell for cell in row)
            ]
            if cleaned:
                cleaned_tables.append(cleaned)

        # Render page as image
        mupdf_page = doc[i]
        mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for clarity
        pix = mupdf_page.get_pixmap(matrix=mat)
        img = Image.open(io.BytesIO(pix.tobytes("png")))

        pages.append(PageContent(
            page_number=i + 1,
            raw_text=raw_text,
            image=img,
            tables=cleaned_tables,
        ))

    doc.close()
    return pages
```

## Stage 2: Detecting Charts and Visual Elements

Tables are extracted directly by pdfplumber, but charts — bar graphs, pie charts, line plots — are rendered as graphics with no extractable text. Detect them by checking for visual elements without corresponding text:

```python
def detect_charts(page: PageContent) -> bool:
    """Heuristic: a page likely has charts if it has
    little text but significant visual content."""
    text_density = len(page.raw_text.strip())
    # Pages with tables already accounted for
    if page.tables:
        text_in_tables = sum(
            len(cell)
            for table in page.tables
            for row in table
            for cell in row
        )
        non_table_text = text_density - text_in_tables
    else:
        non_table_text = text_density

    # If page has very little non-table text, likely has
    # charts or figures
    return non_table_text  dict:
    """Use GPT-4o to extract data from a chart image."""
    buf = io.BytesIO()
    img.save(buf, format="PNG")
    b64 = base64.b64encode(buf.getvalue()).decode()

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Analyze this chart. Return a JSON object with: "
                        "chart_type, title, x_axis_label, y_axis_label, "
                        "and data_points as a list of {label, value} objects."
                    ),
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{b64}"},
                },
            ],
        }],
        response_format={"type": "json_object"},
    )
    import json
    return json.loads(response.choices[0].message.content)
```

## Stage 3: The PDF Agent

Combine everything into an agent that answers questions about PDF content:

```python
class PDFProcessingAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()
        self.pages: list[PageContent] = []

    def load(self, pdf_path: str) -> int:
        """Load a PDF and return the page count."""
        self.pages = extract_pages(pdf_path)
        for page in self.pages:
            page.has_charts = detect_charts(page)
        return len(self.pages)

    def _format_tables(self, tables: list[list[list[str]]]) -> str:
        """Convert tables to markdown format."""
        parts = []
        for table in tables:
            if not table:
                continue
            header = "| " + " | ".join(table[0]) + " |"
            sep = "| " + " | ".join("---" for _ in table[0]) + " |"
            rows = [
                "| " + " | ".join(row) + " |"
                for row in table[1:]
            ]
            parts.append("\n".join([header, sep] + rows))
        return "\n\n".join(parts)

    async def query(self, question: str) -> str:
        """Answer a question about the loaded PDF."""
        context_parts = []
        for page in self.pages:
            parts = [f"--- Page {page.page_number} ---"]
            if page.raw_text.strip():
                parts.append(page.raw_text.strip())
            if page.tables:
                parts.append(
                    "Tables:\n" + self._format_tables(page.tables)
                )
            if page.has_charts:
                chart_data = await analyze_chart(
                    page.image, self.client
                )
                parts.append(f"Chart data: {chart_data}")
            context_parts.append("\n".join(parts))

        full_context = "\n\n".join(context_parts)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a document analysis agent. Answer "
                        "questions based on the extracted PDF content."
                    ),
                },
                {
                    "role": "user",
                    "content": (
                        f"Document content:\n{full_context}\n\n"
                        f"Question: {question}"
                    ),
                },
            ],
        )
        return response.choices[0].message.content
```

## Usage Example

```python
import asyncio

async def main():
    agent = PDFProcessingAgent()
    page_count = agent.load("quarterly_report.pdf")
    print(f"Loaded {page_count} pages")

    answer = await agent.query(
        "What was the revenue growth rate in Q3?"
    )
    print(answer)

asyncio.run(main())
```

## FAQ

### How do I handle scanned PDFs with no extractable text?

For scanned PDFs, pdfplumber returns empty text. In that case, fall back to OCR by running Tesseract on the rendered page image. Add a check in the extraction stage: if `raw_text` is empty or very short, apply `pytesseract.image_to_string(page.image)` and use that as the text content.

### What is the best approach for extracting complex nested tables?

Pdfplumber handles simple tables well but struggles with merged cells, nested headers, and spanning rows. For complex tables, send the page image to GPT-4o with a prompt asking it to extract the table as a JSON array. The vision model understands visual table structure better than rule-based parsers for complex layouts.

### How do I process very large PDFs without running out of memory?

Process pages in batches rather than loading the entire document at once. Modify `extract_pages` to yield pages lazily using a generator. For the agent query step, first identify which pages are relevant to the question using a lightweight text search or embedding-based retrieval, then only process those pages in detail.

---

#PDFProcessing #DocumentAI #TableExtraction #ChartAnalysis #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/pdf-processing-agent-extracting-text-tables-charts-documents