Skip to content
Multi-Step Web Tasks with GPT Vision: Complex Workflows Across Multiple Pages
Learn Agentic AI13 min read15 views

Multi-Step Web Tasks with GPT Vision: Complex Workflows Across Multiple Pages

Build GPT Vision agents that handle complex multi-step web workflows spanning multiple pages. Learn task decomposition, state tracking, page transition handling, and verification at each step.

Why Single-Step Vision Is Not Enough

Browsing a single page is straightforward, but real web tasks span multiple pages. Booking a flight requires searching, filtering results, selecting a flight, entering passenger details, choosing seats, and confirming payment. Each page looks different, expects different inputs, and may fail in different ways.

A multi-step vision agent needs three capabilities beyond basic screenshot analysis: task decomposition to plan ahead, state tracking to remember what it has done, and verification to confirm each step succeeded before proceeding.

Task Decomposition

Start by having GPT-4V break a high-level task into discrete steps.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
from pydantic import BaseModel
from openai import OpenAI

class TaskStep(BaseModel):
    step_number: int
    description: str
    expected_page_type: str  # search, results, form, confirmation
    success_indicator: str  # what to look for to confirm step worked
    data_to_extract: list[str]  # info to capture for later steps

class TaskPlan(BaseModel):
    task_description: str
    steps: list[TaskStep]
    estimated_total_steps: int

client = OpenAI()

def decompose_task(task: str) -> TaskPlan:
    """Break a complex web task into steps."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a web task planner. Break complex web tasks "
                    "into discrete steps. Each step should represent one "
                    "page interaction or page transition. Include what "
                    "success looks like for each step and what data needs "
                    "to be extracted for subsequent steps."
                ),
            },
            {
                "role": "user",
                "content": f"Plan the steps for this task: {task}",
            },
        ],
        response_format=TaskPlan,
    )
    return response.choices[0].message.parsed

State Tracking Across Pages

The agent must maintain state as it moves through pages. This includes data extracted from earlier steps, which step it is on, and any errors encountered.

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum

class StepStatus(str, Enum):
    PENDING = "pending"
    IN_PROGRESS = "in_progress"
    COMPLETED = "completed"
    FAILED = "failed"
    RETRYING = "retrying"

@dataclass
class WorkflowState:
    task: str
    plan: TaskPlan
    current_step: int = 0
    extracted_data: dict = field(default_factory=dict)
    step_statuses: dict[int, StepStatus] = field(default_factory=dict)
    screenshots: list[str] = field(default_factory=list)
    errors: list[str] = field(default_factory=list)
    started_at: datetime = field(default_factory=datetime.now)

    @property
    def current_task_step(self) -> TaskStep | None:
        if self.current_step < len(self.plan.steps):
            return self.plan.steps[self.current_step]
        return None

    def advance(self):
        """Move to the next step."""
        self.step_statuses[self.current_step] = StepStatus.COMPLETED
        self.current_step += 1
        if self.current_step < len(self.plan.steps):
            self.step_statuses[self.current_step] = StepStatus.IN_PROGRESS

    def record_error(self, error: str):
        """Record an error for the current step."""
        self.errors.append(
            f"Step {self.current_step}: {error}"
        )
        self.step_statuses[self.current_step] = StepStatus.FAILED

    def get_context_summary(self) -> str:
        """Summarize state for the GPT-4V prompt."""
        lines = [f"Task: {self.task}"]
        lines.append(f"Current step: {self.current_step + 1} "
                      f"of {len(self.plan.steps)}")
        if self.extracted_data:
            lines.append("Extracted data so far:")
            for k, v in self.extracted_data.items():
                lines.append(f"  - {k}: {v}")
        if self.errors:
            lines.append(f"Previous errors: {self.errors[-3:]}")
        return "\n".join(lines)

The Multi-Step Execution Engine

The engine ties together planning, execution, verification, and state management.

import asyncio
import base64
from playwright.async_api import async_playwright, Page

class StepResult(BaseModel):
    success: bool
    action_taken: str
    extracted_data: dict[str, str]
    error: str
    next_action: str  # what to do next: proceed, retry, escalate

class MultiStepAgent:
    def __init__(self, max_retries: int = 2):
        self.client = OpenAI()
        self.max_retries = max_retries

    async def capture(self, page: Page) -> str:
        screenshot = await page.screenshot(type="png")
        return base64.b64encode(screenshot).decode()

    async def execute_step(
        self, page: Page, state: WorkflowState
    ) -> StepResult:
        """Execute a single step with vision guidance."""
        step = state.current_task_step
        screenshot = await self.capture(page)

        response = self.client.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a web automation agent executing a "
                        "multi-step workflow. Analyze the current page "
                        "and determine the action needed for this step. "
                        "The viewport is 1280x720."
                    ),
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": (
                                f"{state.get_context_summary()}\n\n"
                                f"Current step: {step.description}\n"
                                f"Success indicator: "
                                f"{step.success_indicator}\n"
                                f"Data to extract: "
                                f"{step.data_to_extract}\n\n"
                                "Analyze the page and report the result."
                            ),
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": (
                                    "data:image/png;base64,"
                                    f"{screenshot}"
                                ),
                                "detail": "high",
                            },
                        },
                    ],
                },
            ],
            response_format=StepResult,
        )
        return response.choices[0].message.parsed

    async def run_workflow(self, url: str, task: str) -> WorkflowState:
        """Run a complete multi-step workflow."""
        plan = decompose_task(task)
        state = WorkflowState(task=task, plan=plan)
        state.step_statuses[0] = StepStatus.IN_PROGRESS

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page(
                viewport={"width": 1280, "height": 720}
            )
            await page.goto(url, wait_until="networkidle")

            while state.current_step < len(plan.steps):
                retries = 0
                while retries <= self.max_retries:
                    result = await self.execute_step(page, state)

                    if result.success:
                        state.extracted_data.update(
                            result.extracted_data
                        )
                        state.advance()
                        await asyncio.sleep(1)
                        break

                    retries += 1
                    if retries > self.max_retries:
                        state.record_error(result.error)
                        await browser.close()
                        return state

                    await asyncio.sleep(2)

            await browser.close()

        return state

Handling Page Transitions

Page transitions are the trickiest part of multi-step workflows. After clicking a link or submitting a form, the page URL may change, content may load asynchronously, or a modal may appear instead of a navigation.

async def wait_for_page_change(
    page: Page, previous_url: str, timeout: int = 10000
) -> bool:
    """Wait for a page transition or significant content change."""
    try:
        await page.wait_for_url(
            lambda url: url != previous_url, timeout=timeout
        )
        await page.wait_for_load_state("networkidle")
        return True
    except Exception:
        # URL might not change (modal, SPA navigation)
        await asyncio.sleep(1)
        return False

FAQ

How do I handle workflows that require authentication?

Authenticate before starting the workflow. Use Playwright's storage_state to save and restore cookies and local storage. You can log in once manually, save the state with context.storage_state(path="auth.json"), then reuse it in subsequent runs with browser.new_context(storage_state="auth.json").

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

What happens when a step fails partway through a multi-step workflow?

The state tracker records exactly which step failed and why. You have three recovery options: retry the failed step, restart from a known checkpoint (e.g., after login), or escalate to a human operator with the full state and screenshots for manual completion. The extracted_data dictionary preserves everything learned in previous steps.

How do I prevent the agent from getting stuck in infinite loops?

Set hard limits at multiple levels: a maximum number of retries per step (2-3), a maximum total number of actions across the workflow (50), and a wall-clock timeout (5-10 minutes). If any limit is hit, the agent stops and returns the current state for debugging.


#MultiStepTasks #GPTVision #WebWorkflows #StateTracking #TaskDecomposition #BrowserAutomation #AgenticAI #Python

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

Visual Regression Testing with GPT Vision: AI-Powered UI Change Detection

Implement visual regression testing using GPT Vision to detect UI changes, classify their severity, and generate human-readable reports. Move beyond pixel-diff tools to semantic understanding of visual changes.

Learn Agentic AI

Cost Optimization for Vision-Based Browser Agents: Image Compression and Caching

Reduce GPT Vision API costs by 60-80% through image resizing, compression, region cropping, intelligent caching, and token-aware strategies. Essential techniques for production vision-based browser automation.

Learn Agentic AI

Building a Form Filler Agent with GPT Vision: Understanding and Completing Web Forms

Build an AI agent that uses GPT Vision to detect form fields, understand their purpose, map values to the correct inputs, and verify successful submission — all without relying on CSS selectors.

Learn Agentic AI

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Learn how to capture full-page screenshots, element-level screenshots, and record browser session videos with Playwright, then feed them to GPT-4 Vision for automated visual analysis.

Learn Agentic AI

Hierarchical Task Networks for AI Agents: Planning Complex Multi-Step Operations

Master Hierarchical Task Network (HTN) planning for AI agents including task decomposition, method selection, plan refinement, and execution monitoring with complete Python implementations.

Learn Agentic AI

GPT Vision for CAPTCHA and Challenge Detection: Identifying Blocking Elements

Learn how to use GPT Vision to detect CAPTCHAs, cookie banners, paywalls, and other blocking elements that interrupt browser automation — and implement graceful handling strategies.