---
title: "Building a Form Filler Agent with GPT Vision: Understanding and Completing Web Forms"
description: "Build an AI agent that uses GPT Vision to detect form fields, understand their purpose, map values to the correct inputs, and verify successful submission — all without relying on CSS selectors."
canonical: https://callsphere.ai/blog/form-filler-agent-gpt-vision-understanding-completing-web-forms
category: "Learn Agentic AI"
tags: ["GPT Vision", "Form Automation", "Browser Agent", "Web Forms", "AI Agent"]
author: "CallSphere Team"
published: 2026-03-18T00:00:00.000Z
updated: 2026-05-06T21:54:22.773Z
---

# Building a Form Filler Agent with GPT Vision: Understanding and Completing Web Forms

> Build an AI agent that uses GPT Vision to detect form fields, understand their purpose, map values to the correct inputs, and verify successful submission — all without relying on CSS selectors.

## Why Forms Are Hard for Traditional Automation

Web forms are the most common interaction point for browser automation, and paradoxically, the most fragile. Labels can be associated through `for` attributes, visual proximity, placeholder text, or floating labels that animate on focus. Dropdowns might be native `` elements, custom React components, or headless UI libraries. Date pickers vary wildly across sites.

GPT Vision cuts through this complexity by analyzing the form the way a human does: reading labels, understanding spatial relationships, and identifying what each field expects.

## Detecting Form Structure

The first step is capturing the form and asking GPT-4V to map out its structure.

```mermaid
flowchart LR
    GOAL(["High level goal"])
    PLAN["Planner LLM"]
    SCREEN["Screen capture
every step"]
    VLM["Vision LLM
reads UI state"]
    ACT{"Action type"}
    CLICK["Click coordinate"]
    TYPE["Type text"]
    KEY["Keyboard shortcut"]
    GUARD["Safety filter
allow lists"]
    OS[("OS sandbox
ephemeral VM")]
    DONE(["Goal verified"])
    GOAL --> PLAN --> SCREEN --> VLM --> ACT
    ACT --> CLICK --> GUARD
    ACT --> TYPE --> GUARD
    ACT --> KEY --> GUARD
    GUARD --> OS --> SCREEN
    OS --> DONE
    style PLAN fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style DONE fill:#059669,stroke:#047857,color:#fff
```

```python
from pydantic import BaseModel
from openai import OpenAI

class FormField(BaseModel):
    label: str
    field_type: str  # text, email, phone, date, dropdown, checkbox, etc.
    is_required: bool
    x_center: int
    y_center: int
    placeholder: str
    options: list[str]  # for dropdowns/radio groups
    current_value: str

class FormStructure(BaseModel):
    form_title: str
    fields: list[FormField]
    submit_button_label: str
    submit_button_x: int
    submit_button_y: int

client = OpenAI()

def detect_form(screenshot_b64: str) -> FormStructure:
    """Detect form structure from a screenshot."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a form analysis expert. The viewport is "
                    "1280x720 pixels. Identify every form field, its "
                    "label, type, whether it appears required (asterisk "
                    "or 'required' text), its center coordinates, and "
                    "any visible placeholder text or dropdown options. "
                    "Also locate the submit button."
                ),
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Analyze this form."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{screenshot_b64}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        response_format=FormStructure,
    )
    return response.choices[0].message.parsed
```

## Mapping Data to Fields

Once you know the form structure, you need to map your data to the detected fields. GPT-4V can also handle this mapping intelligently.

```python
class FieldMapping(BaseModel):
    field_label: str
    value_to_enter: str
    interaction_type: str  # type, select, check, click

class FormFillingPlan(BaseModel):
    mappings: list[FieldMapping]
    unmapped_fields: list[str]  # fields with no matching data
    unused_data: list[str]  # data keys with no matching field

def plan_form_filling(
    form: FormStructure, data: dict[str, str]
) -> FormFillingPlan:
    """Map data values to form fields using GPT-4V."""
    fields_desc = "\n".join(
        f"- {f.label} ({f.field_type})" for f in form.fields
    )
    data_desc = "\n".join(f"- {k}: {v}" for k, v in data.items())

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a data-to-form mapping expert. Match "
                    "each data value to the correct form field based "
                    "on semantic understanding. For example, map "
                    "'email_address' to a field labeled 'Email' or "
                    "'E-mail Address'."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Form fields:\n{fields_desc}\n\n"
                    f"Data to enter:\n{data_desc}\n\n"
                    "Create the mapping."
                ),
            },
        ],
        response_format=FormFillingPlan,
    )
    return response.choices[0].message.parsed
```

## Executing the Form Fill

With the plan in hand, execute each field interaction sequentially.

```python
from playwright.async_api import Page
import asyncio

async def fill_form(
    page: Page, form: FormStructure, plan: FormFillingPlan
) -> None:
    """Execute the form filling plan."""
    field_lookup = {f.label.lower(): f for f in form.fields}

    for mapping in plan.mappings:
        field = field_lookup.get(mapping.field_label.lower())
        if not field:
            print(f"Warning: field '{mapping.field_label}' not found")
            continue

        if mapping.interaction_type == "type":
            # Click the field to focus it
            await page.mouse.click(field.x_center, field.y_center)
            await asyncio.sleep(0.3)
            # Clear any existing value
            await page.keyboard.press("Control+a")
            await page.keyboard.press("Backspace")
            # Type the value
            await page.keyboard.type(mapping.value_to_enter, delay=30)

        elif mapping.interaction_type == "select":
            # Click the dropdown to open it
            await page.mouse.click(field.x_center, field.y_center)
            await asyncio.sleep(0.5)
            # Type to filter options, then press Enter
            await page.keyboard.type(mapping.value_to_enter, delay=50)
            await asyncio.sleep(0.3)
            await page.keyboard.press("Enter")

        elif mapping.interaction_type == "check":
            await page.mouse.click(field.x_center, field.y_center)

        await asyncio.sleep(0.2)
```

## Verifying Submission

After filling and submitting, capture a new screenshot and verify the result.

```python
class SubmissionResult(BaseModel):
    success: bool
    confirmation_message: str
    errors: list[str]

async def submit_and_verify(
    page: Page, form: FormStructure, screenshot_fn
) -> SubmissionResult:
    """Submit the form and verify the result."""
    # Click submit
    await page.mouse.click(
        form.submit_button_x, form.submit_button_y
    )
    await page.wait_for_load_state("networkidle")
    await asyncio.sleep(1)

    # Capture post-submission screenshot
    post_screenshot = await screenshot_fn(page)

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Analyze this screenshot taken after a form "
                    "submission. Determine if the submission was "
                    "successful, extract any confirmation message, "
                    "and list any validation errors shown."
                ),
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Was this form submission successful?",
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{post_screenshot}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        response_format=SubmissionResult,
    )
    return response.choices[0].message.parsed
```

## Handling Edge Cases

Real-world forms present several challenges. Multi-step wizard forms require detecting "Next" buttons and tracking progress across pages. CAPTCHA fields need human escalation. Auto-complete dropdowns require waiting for suggestions to load before selecting. Date pickers often need a click-then-navigate approach through month/year selectors.

Build defensive logic: after each field interaction, optionally re-capture and verify the field now shows the expected value. This catch-and-retry pattern prevents silent failures that only surface at submission time.

## FAQ

### How does the agent handle multi-step forms with "Next" buttons?

Treat each step as a separate form detection cycle. After filling visible fields, detect and click the "Next" button, wait for the new step to load, then re-analyze the screenshot for new fields. Track completed steps to avoid repeating data entry if the page reloads.

### What happens when the form has validation errors after submission?

The verification step detects error messages visually. When errors are found, the agent can re-analyze the form screenshot to identify which fields have errors, correct the values, and resubmit. Build a maximum retry count to prevent infinite loops.

### Can GPT Vision handle custom-styled form components like date pickers or color selectors?

GPT-4V recognizes most custom components visually, but interacting with them requires multi-step sequences. For a date picker, the agent might need to click the field, detect the calendar popup in a new screenshot, navigate to the correct month, and click the date. Each sub-interaction needs its own screenshot-action cycle.

---

#FormAutomation #GPTVision #BrowserAgent #WebForms #AIFormFiller #VisualAI #AgenticAI #Python

---

Source: https://callsphere.ai/blog/form-filler-agent-gpt-vision-understanding-completing-web-forms
