---
title: "Multi-Step Web Tasks with GPT Vision: Complex Workflows Across Multiple Pages"
description: "Build GPT Vision agents that handle complex multi-step web workflows spanning multiple pages. Learn task decomposition, state tracking, page transition handling, and verification at each step."
canonical: https://callsphere.ai/blog/multi-step-web-tasks-gpt-vision-complex-workflows-pages
category: "Learn Agentic AI"
tags: ["Multi-Step Tasks", "GPT Vision", "Web Workflows", "State Tracking", "Task Decomposition"]
author: "CallSphere Team"
published: 2026-03-18T00:00:00.000Z
updated: 2026-05-19T22:33:00.053Z
---

# Multi-Step Web Tasks with GPT Vision: Complex Workflows Across Multiple Pages

> Build GPT Vision agents that handle complex multi-step web workflows spanning multiple pages. Learn task decomposition, state tracking, page transition handling, and verification at each step.

## Why Single-Step Vision Is Not Enough

Browsing a single page is straightforward, but real web tasks span multiple pages. Booking a flight requires searching, filtering results, selecting a flight, entering passenger details, choosing seats, and confirming payment. Each page looks different, expects different inputs, and may fail in different ways.

A multi-step vision agent needs three capabilities beyond basic screenshot analysis: task decomposition to plan ahead, state tracking to remember what it has done, and verification to confirm each step succeeded before proceeding.

## Task Decomposition

Start by having GPT-4V break a high-level task into discrete steps.

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
from pydantic import BaseModel
from openai import OpenAI

class TaskStep(BaseModel):
    step_number: int
    description: str
    expected_page_type: str  # search, results, form, confirmation
    success_indicator: str  # what to look for to confirm step worked
    data_to_extract: list[str]  # info to capture for later steps

class TaskPlan(BaseModel):
    task_description: str
    steps: list[TaskStep]
    estimated_total_steps: int

client = OpenAI()

def decompose_task(task: str) -> TaskPlan:
    """Break a complex web task into steps."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a web task planner. Break complex web tasks "
                    "into discrete steps. Each step should represent one "
                    "page interaction or page transition. Include what "
                    "success looks like for each step and what data needs "
                    "to be extracted for subsequent steps."
                ),
            },
            {
                "role": "user",
                "content": f"Plan the steps for this task: {task}",
            },
        ],
        response_format=TaskPlan,
    )
    return response.choices[0].message.parsed
```

## State Tracking Across Pages

The agent must maintain state as it moves through pages. This includes data extracted from earlier steps, which step it is on, and any errors encountered.

```python
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum

class StepStatus(str, Enum):
    PENDING = "pending"
    IN_PROGRESS = "in_progress"
    COMPLETED = "completed"
    FAILED = "failed"
    RETRYING = "retrying"

@dataclass
class WorkflowState:
    task: str
    plan: TaskPlan
    current_step: int = 0
    extracted_data: dict = field(default_factory=dict)
    step_statuses: dict[int, StepStatus] = field(default_factory=dict)
    screenshots: list[str] = field(default_factory=list)
    errors: list[str] = field(default_factory=list)
    started_at: datetime = field(default_factory=datetime.now)

    @property
    def current_task_step(self) -> TaskStep | None:
        if self.current_step  str:
        """Summarize state for the GPT-4V prompt."""
        lines = [f"Task: {self.task}"]
        lines.append(f"Current step: {self.current_step + 1} "
                      f"of {len(self.plan.steps)}")
        if self.extracted_data:
            lines.append("Extracted data so far:")
            for k, v in self.extracted_data.items():
                lines.append(f"  - {k}: {v}")
        if self.errors:
            lines.append(f"Previous errors: {self.errors[-3:]}")
        return "\n".join(lines)
```

## The Multi-Step Execution Engine

The engine ties together planning, execution, verification, and state management.

```python
import asyncio
import base64
from playwright.async_api import async_playwright, Page

class StepResult(BaseModel):
    success: bool
    action_taken: str
    extracted_data: dict[str, str]
    error: str
    next_action: str  # what to do next: proceed, retry, escalate

class MultiStepAgent:
    def __init__(self, max_retries: int = 2):
        self.client = OpenAI()
        self.max_retries = max_retries

    async def capture(self, page: Page) -> str:
        screenshot = await page.screenshot(type="png")
        return base64.b64encode(screenshot).decode()

    async def execute_step(
        self, page: Page, state: WorkflowState
    ) -> StepResult:
        """Execute a single step with vision guidance."""
        step = state.current_task_step
        screenshot = await self.capture(page)

        response = self.client.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a web automation agent executing a "
                        "multi-step workflow. Analyze the current page "
                        "and determine the action needed for this step. "
                        "The viewport is 1280x720."
                    ),
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": (
                                f"{state.get_context_summary()}\n\n"
                                f"Current step: {step.description}\n"
                                f"Success indicator: "
                                f"{step.success_indicator}\n"
                                f"Data to extract: "
                                f"{step.data_to_extract}\n\n"
                                "Analyze the page and report the result."
                            ),
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": (
                                    "data:image/png;base64,"
                                    f"{screenshot}"
                                ),
                                "detail": "high",
                            },
                        },
                    ],
                },
            ],
            response_format=StepResult,
        )
        return response.choices[0].message.parsed

    async def run_workflow(self, url: str, task: str) -> WorkflowState:
        """Run a complete multi-step workflow."""
        plan = decompose_task(task)
        state = WorkflowState(task=task, plan=plan)
        state.step_statuses[0] = StepStatus.IN_PROGRESS

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page(
                viewport={"width": 1280, "height": 720}
            )
            await page.goto(url, wait_until="networkidle")

            while state.current_step  self.max_retries:
                        state.record_error(result.error)
                        await browser.close()
                        return state

                    await asyncio.sleep(2)

            await browser.close()

        return state
```

## Handling Page Transitions

Page transitions are the trickiest part of multi-step workflows. After clicking a link or submitting a form, the page URL may change, content may load asynchronously, or a modal may appear instead of a navigation.

```python
async def wait_for_page_change(
    page: Page, previous_url: str, timeout: int = 10000
) -> bool:
    """Wait for a page transition or significant content change."""
    try:
        await page.wait_for_url(
            lambda url: url != previous_url, timeout=timeout
        )
        await page.wait_for_load_state("networkidle")
        return True
    except Exception:
        # URL might not change (modal, SPA navigation)
        await asyncio.sleep(1)
        return False
```

## FAQ

### How do I handle workflows that require authentication?

Authenticate before starting the workflow. Use Playwright's `storage_state` to save and restore cookies and local storage. You can log in once manually, save the state with `context.storage_state(path="auth.json")`, then reuse it in subsequent runs with `browser.new_context(storage_state="auth.json")`.

### What happens when a step fails partway through a multi-step workflow?

The state tracker records exactly which step failed and why. You have three recovery options: retry the failed step, restart from a known checkpoint (e.g., after login), or escalate to a human operator with the full state and screenshots for manual completion. The `extracted_data` dictionary preserves everything learned in previous steps.

### How do I prevent the agent from getting stuck in infinite loops?

Set hard limits at multiple levels: a maximum number of retries per step (2-3), a maximum total number of actions across the workflow (50), and a wall-clock timeout (5-10 minutes). If any limit is hit, the agent stops and returns the current state for debugging.

---

#MultiStepTasks #GPTVision #WebWorkflows #StateTracking #TaskDecomposition #BrowserAutomation #AgenticAI #Python

---

Source: https://callsphere.ai/blog/multi-step-web-tasks-gpt-vision-complex-workflows-pages
