Skip to content
Learn Agentic AI
Learn Agentic AI13 min read2 views

Building a Diagram Understanding Agent: Flowcharts, Architecture Diagrams, and Charts

Create an AI agent that classifies diagram types, extracts elements and relationships from flowcharts and architecture diagrams, and converts visual diagrams into structured data and code representations.

Why Diagram Understanding Is Valuable

Technical documentation is full of diagrams — flowcharts describing business processes, architecture diagrams showing system components, sequence diagrams illustrating API interactions, and data flow charts mapping pipelines. An agent that can read and understand these diagrams can answer questions about system architecture, generate code from flowcharts, identify missing components, and convert visual documentation into machine-readable formats.

Diagram Classification

The first step is identifying what type of diagram the agent is looking at, because each type requires a different extraction strategy:

flowchart TD
    START["Building a Diagram Understanding Agent: Flowchart…"] --> A
    A["Why Diagram Understanding Is Valuable"]
    A --> B
    B["Diagram Classification"]
    B --> C
    C["Extracting Elements and Relationships"]
    C --> D
    D["Converting Diagrams to Code"]
    D --> E
    E["The Diagram Agent"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import openai
import base64
from pydantic import BaseModel
from enum import Enum


class DiagramType(str, Enum):
    FLOWCHART = "flowchart"
    ARCHITECTURE = "architecture"
    SEQUENCE = "sequence"
    ER_DIAGRAM = "er_diagram"
    DATA_FLOW = "data_flow"
    ORG_CHART = "org_chart"
    CHART = "chart"  # bar, line, pie
    UNKNOWN = "unknown"


class DiagramClassification(BaseModel):
    diagram_type: DiagramType
    confidence: float
    description: str


async def classify_diagram(
    image_bytes: bytes, client: openai.AsyncOpenAI
) -> DiagramClassification:
    """Classify the type of diagram in an image."""
    b64 = base64.b64encode(image_bytes).decode()

    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify this diagram. Identify the type, "
                    "your confidence level (0-1), and a brief "
                    "description of what the diagram shows."
                ),
            },
            {
                "role": "user",
                "content": [{
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{b64}"
                    },
                }],
            },
        ],
        response_format=DiagramClassification,
    )
    return response.choices[0].message.parsed

Extracting Elements and Relationships

Once classified, extract the structural components. For flowcharts, this means nodes and edges. For architecture diagrams, it means components and connections:

class DiagramNode(BaseModel):
    id: str
    label: str
    node_type: str  # process, decision, start, end, component
    properties: dict = {}


class DiagramEdge(BaseModel):
    source_id: str
    target_id: str
    label: str = ""
    edge_type: str = "directed"  # directed, bidirectional


class DiagramStructure(BaseModel):
    nodes: list[DiagramNode]
    edges: list[DiagramEdge]
    title: str = ""
    notes: list[str] = []


async def extract_structure(
    image_bytes: bytes,
    diagram_type: DiagramType,
    client: openai.AsyncOpenAI,
) -> DiagramStructure:
    """Extract nodes and edges from a diagram."""
    b64 = base64.b64encode(image_bytes).decode()

    type_hints = {
        DiagramType.FLOWCHART: (
            "This is a flowchart. Extract all process steps, "
            "decision points, start/end nodes, and the arrows "
            "connecting them. Use node types: process, decision, "
            "start, end, subprocess."
        ),
        DiagramType.ARCHITECTURE: (
            "This is an architecture diagram. Extract all system "
            "components (services, databases, queues, load "
            "balancers, etc.) and their connections. Use node "
            "types: service, database, queue, cache, gateway, "
            "client, external."
        ),
        DiagramType.SEQUENCE: (
            "This is a sequence diagram. Extract all participants "
            "as nodes and messages as edges in chronological order."
        ),
    }

    hint = type_hints.get(
        diagram_type,
        "Extract all elements and their relationships.",
    )

    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": hint},
            {
                "role": "user",
                "content": [{
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{b64}"
                    },
                }],
            },
        ],
        response_format=DiagramStructure,
    )
    return response.choices[0].message.parsed

Converting Diagrams to Code

One of the most powerful capabilities is converting a visual diagram into executable code or infrastructure-as-code:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

async def diagram_to_mermaid(
    structure: DiagramStructure,
    diagram_type: DiagramType,
) -> str:
    """Convert extracted diagram structure to Mermaid syntax."""
    if diagram_type == DiagramType.FLOWCHART:
        lines = ["flowchart TD"]
        for node in structure.nodes:
            shape = {
                "decision": f"{node.id}{{{node.label}}}",
                "start": f"{node.id}([{node.label}])",
                "end": f"{node.id}([{node.label}])",
                "process": f"{node.id}[{node.label}]",
            }.get(node.node_type, f"{node.id}[{node.label}]")
            lines.append(f"    {shape}")

        for edge in structure.edges:
            if edge.label:
                lines.append(
                    f"    {edge.source_id} -->|{edge.label}| "
                    f"{edge.target_id}"
                )
            else:
                lines.append(
                    f"    {edge.source_id} --> {edge.target_id}"
                )

        return "\n".join(lines)

    elif diagram_type == DiagramType.ARCHITECTURE:
        lines = ["flowchart LR"]
        for node in structure.nodes:
            icon = {
                "database": f"{node.id}[({node.label})]",
                "queue": f"{node.id}>{node.label}]",
                "service": f"{node.id}[{node.label}]",
            }.get(node.node_type, f"{node.id}[{node.label}]")
            lines.append(f"    {icon}")

        for edge in structure.edges:
            arrow = (
                " <--> " if edge.edge_type == "bidirectional"
                else " --> "
            )
            lines.append(
                f"    {edge.source_id}{arrow}{edge.target_id}"
            )

        return "\n".join(lines)

    return "# Unsupported diagram type for Mermaid conversion"

The Diagram Agent

class DiagramUnderstandingAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def analyze(self, image_bytes: bytes) -> dict:
        classification = await classify_diagram(
            image_bytes, self.client
        )
        structure = await extract_structure(
            image_bytes, classification.diagram_type, self.client
        )
        mermaid = await diagram_to_mermaid(
            structure, classification.diagram_type
        )

        return {
            "type": classification.diagram_type.value,
            "description": classification.description,
            "nodes": len(structure.nodes),
            "edges": len(structure.edges),
            "structure": structure.model_dump(),
            "mermaid_code": mermaid,
        }

    async def ask(
        self, image_bytes: bytes, question: str
    ) -> str:
        b64 = base64.b64encode(image_bytes).decode()
        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}"
                        },
                    },
                ],
            }],
        )
        return response.choices[0].message.content

FAQ

How accurate is GPT-4o at extracting diagram structures compared to dedicated diagram parsers?

For clean, well-formatted diagrams, GPT-4o extracts nodes and edges with approximately 90% accuracy. It excels at understanding context and labels but can miss precise spatial relationships in dense diagrams. Dedicated parsers like those in draw.io or Lucidchart have access to the underlying XML and achieve near-perfect accuracy on their own formats. Use vision models when you only have a screenshot or image of the diagram.

Can this agent handle hand-drawn diagrams on whiteboards?

Yes, with reduced accuracy. GPT-4o can interpret hand-drawn flowcharts and architecture sketches, identifying boxes, arrows, and labels even when the drawing is rough. For best results, ensure the whiteboard photo has good lighting, minimal glare, and the handwriting is reasonably legible. The classification step still works well because the overall layout patterns — boxes connected by arrows — are recognizable regardless of drawing quality.

How do I validate that the extracted structure is correct?

Convert the extracted structure to Mermaid or Graphviz and render it visually. Compare the rendered output against the original diagram. You can also automate validation by checking that every node has at least one edge (no orphan nodes), decision nodes have exactly two outgoing edges, and start nodes have no incoming edges. These structural constraints catch most extraction errors.


#DiagramAnalysis #Flowcharts #ArchitectureDiagrams #VisualUnderstanding #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

OpenAI Agents SDK in 2026: Building Multi-Agent Systems with Handoffs and Guardrails

Complete tutorial on the OpenAI Agents SDK covering agent creation, tool definitions, handoff patterns between specialist agents, and input/output guardrails for safe AI systems.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.

Learn Agentic AI

Build a Customer Support Agent from Scratch: Python, OpenAI, and Twilio in 60 Minutes

Step-by-step tutorial to build a production-ready customer support AI agent using Python FastAPI, OpenAI Agents SDK, and Twilio Voice with five integrated tools.

Learn Agentic AI

LangGraph Agent Patterns 2026: Building Stateful Multi-Step AI Workflows

Complete LangGraph tutorial covering state machines for agents, conditional edges, human-in-the-loop patterns, checkpointing, and parallel execution with full code examples.