Building a Diagram Understanding Agent: Flowcharts, Architecture Diagrams, and Charts

Why Diagram Understanding Is Valuable

Technical documentation is full of diagrams — flowcharts describing business processes, architecture diagrams showing system components, sequence diagrams illustrating API interactions, and data flow charts mapping pipelines. An agent that can read and understand these diagrams can answer questions about system architecture, generate code from flowcharts, identify missing components, and convert visual documentation into machine-readable formats.

Diagram Classification

The first step is identifying what type of diagram the agent is looking at, because each type requires a different extraction strategy:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

import openai
import base64
from pydantic import BaseModel
from enum import Enum

class DiagramType(str, Enum):
    FLOWCHART = "flowchart"
    ARCHITECTURE = "architecture"
    SEQUENCE = "sequence"
    ER_DIAGRAM = "er_diagram"
    DATA_FLOW = "data_flow"
    ORG_CHART = "org_chart"
    CHART = "chart"  # bar, line, pie
    UNKNOWN = "unknown"

class DiagramClassification(BaseModel):
    diagram_type: DiagramType
    confidence: float
    description: str

async def classify_diagram(
    image_bytes: bytes, client: openai.AsyncOpenAI
) -> DiagramClassification:
    """Classify the type of diagram in an image."""
    b64 = base64.b64encode(image_bytes).decode()

    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify this diagram. Identify the type, "
                    "your confidence level (0-1), and a brief "
                    "description of what the diagram shows."
                ),
            },
            {
                "role": "user",
                "content": [{
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{b64}"
                    },
                }],
            },
        ],
        response_format=DiagramClassification,
    )
    return response.choices[0].message.parsed

Extracting Elements and Relationships

Once classified, extract the structural components. For flowcharts, this means nodes and edges. For architecture diagrams, it means components and connections:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

class DiagramNode(BaseModel):
    id: str
    label: str
    node_type: str  # process, decision, start, end, component
    properties: dict = {}

class DiagramEdge(BaseModel):
    source_id: str
    target_id: str
    label: str = ""
    edge_type: str = "directed"  # directed, bidirectional

class DiagramStructure(BaseModel):
    nodes: list[DiagramNode]
    edges: list[DiagramEdge]
    title: str = ""
    notes: list[str] = []

async def extract_structure(
    image_bytes: bytes,
    diagram_type: DiagramType,
    client: openai.AsyncOpenAI,
) -> DiagramStructure:
    """Extract nodes and edges from a diagram."""
    b64 = base64.b64encode(image_bytes).decode()

    type_hints = {
        DiagramType.FLOWCHART: (
            "This is a flowchart. Extract all process steps, "
            "decision points, start/end nodes, and the arrows "
            "connecting them. Use node types: process, decision, "
            "start, end, subprocess."
        ),
        DiagramType.ARCHITECTURE: (
            "This is an architecture diagram. Extract all system "
            "components (services, databases, queues, load "
            "balancers, etc.) and their connections. Use node "
            "types: service, database, queue, cache, gateway, "
            "client, external."
        ),
        DiagramType.SEQUENCE: (
            "This is a sequence diagram. Extract all participants "
            "as nodes and messages as edges in chronological order."
        ),
    }

    hint = type_hints.get(
        diagram_type,
        "Extract all elements and their relationships.",
    )

    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": hint},
            {
                "role": "user",
                "content": [{
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{b64}"
                    },
                }],
            },
        ],
        response_format=DiagramStructure,
    )
    return response.choices[0].message.parsed

Converting Diagrams to Code

One of the most powerful capabilities is converting a visual diagram into executable code or infrastructure-as-code:

async def diagram_to_mermaid(
    structure: DiagramStructure,
    diagram_type: DiagramType,
) -> str:
    """Convert extracted diagram structure to Mermaid syntax."""
    if diagram_type == DiagramType.FLOWCHART:
        lines = ["flowchart TD"]
        for node in structure.nodes:
            shape = {
                "decision": f"{node.id}{{{node.label}}}",
                "start": f"{node.id}([{node.label}])",
                "end": f"{node.id}([{node.label}])",
                "process": f"{node.id}[{node.label}]",
            }.get(node.node_type, f"{node.id}[{node.label}]")
            lines.append(f"    {shape}")

        for edge in structure.edges:
            if edge.label:
                lines.append(
                    f"    {edge.source_id} -->|{edge.label}| "
                    f"{edge.target_id}"
                )
            else:
                lines.append(
                    f"    {edge.source_id} --> {edge.target_id}"
                )

        return "\n".join(lines)

    elif diagram_type == DiagramType.ARCHITECTURE:
        lines = ["flowchart LR"]
        for node in structure.nodes:
            icon = {
                "database": f"{node.id}[({node.label})]",
                "queue": f"{node.id}>{node.label}]",
                "service": f"{node.id}[{node.label}]",
            }.get(node.node_type, f"{node.id}[{node.label}]")
            lines.append(f"    {icon}")

        for edge in structure.edges:
            arrow = (
                " <--> " if edge.edge_type == "bidirectional"
                else " --> "
            )
            lines.append(
                f"    {edge.source_id}{arrow}{edge.target_id}"
            )

        return "\n".join(lines)

    return "# Unsupported diagram type for Mermaid conversion"

The Diagram Agent

class DiagramUnderstandingAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def analyze(self, image_bytes: bytes) -> dict:
        classification = await classify_diagram(
            image_bytes, self.client
        )
        structure = await extract_structure(
            image_bytes, classification.diagram_type, self.client
        )
        mermaid = await diagram_to_mermaid(
            structure, classification.diagram_type
        )

        return {
            "type": classification.diagram_type.value,
            "description": classification.description,
            "nodes": len(structure.nodes),
            "edges": len(structure.edges),
            "structure": structure.model_dump(),
            "mermaid_code": mermaid,
        }

    async def ask(
        self, image_bytes: bytes, question: str
    ) -> str:
        b64 = base64.b64encode(image_bytes).decode()
        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}"
                        },
                    },
                ],
            }],
        )
        return response.choices[0].message.content

FAQ

How accurate is GPT-4o at extracting diagram structures compared to dedicated diagram parsers?

For clean, well-formatted diagrams, GPT-4o extracts nodes and edges with approximately 90% accuracy. It excels at understanding context and labels but can miss precise spatial relationships in dense diagrams. Dedicated parsers like those in draw.io or Lucidchart have access to the underlying XML and achieve near-perfect accuracy on their own formats. Use vision models when you only have a screenshot or image of the diagram.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Can this agent handle hand-drawn diagrams on whiteboards?

Yes, with reduced accuracy. GPT-4o can interpret hand-drawn flowcharts and architecture sketches, identifying boxes, arrows, and labels even when the drawing is rough. For best results, ensure the whiteboard photo has good lighting, minimal glare, and the handwriting is reasonably legible. The classification step still works well because the overall layout patterns — boxes connected by arrows — are recognizable regardless of drawing quality.

How do I validate that the extracted structure is correct?

Convert the extracted structure to Mermaid or Graphviz and render it visually. Compare the rendered output against the original diagram. You can also automate validation by checking that every node has at least one edge (no orphan nodes), decision nodes have exactly two outgoing edges, and start nodes have no incoming edges. These structural constraints catch most extraction errors.

#DiagramAnalysis #Flowcharts #ArchitectureDiagrams #VisualUnderstanding #Python #AgenticAI #LearnAI #AIEngineering

Building a Diagram Understanding Agent: Flowcharts, Architecture Diagrams, and Charts

Why Diagram Understanding Is Valuable

Diagram Classification

Extracting Elements and Relationships

Converting Diagrams to Code

The Diagram Agent

FAQ

How accurate is GPT-4o at extracting diagram structures compared to dedicated diagram parsers?

Can this agent handle hand-drawn diagrams on whiteboards?

How do I validate that the extracted structure is correct?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Deploy a Voice Agent on Modal with Python and Serverless GPU

Pydantic AI April 2026 Update: Typed Agents and Structured Tools

How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings

Docker Multi-Stage AI Agent Images: uv + Distroless = 80MB (2026)