Multi-Step Web Tasks with GPT Vision: Complex Workflows Across Multiple Pages
Build GPT Vision agents that handle complex multi-step web workflows spanning multiple pages. Learn task decomposition, state tracking, page transition handling, and verification at each step.
Why Single-Step Vision Is Not Enough
Browsing a single page is straightforward, but real web tasks span multiple pages. Booking a flight requires searching, filtering results, selecting a flight, entering passenger details, choosing seats, and confirming payment. Each page looks different, expects different inputs, and may fail in different ways.
A multi-step vision agent needs three capabilities beyond basic screenshot analysis: task decomposition to plan ahead, state tracking to remember what it has done, and verification to confirm each step succeeded before proceeding.
Task Decomposition
Start by having GPT-4V break a high-level task into discrete steps.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart LR
INPUT(["User intent"])
PARSE["Parse plus<br/>classify"]
PLAN["Plan and tool<br/>selection"]
AGENT["Agent loop<br/>LLM plus tools"]
GUARD{"Guardrails<br/>and policy"}
EXEC["Execute and<br/>verify result"]
OBS[("Trace and metrics")]
OUT(["Outcome plus<br/>next action"])
INPUT --> PARSE --> PLAN --> AGENT --> GUARD
GUARD -->|Pass| EXEC --> OUT
GUARD -->|Fail| AGENT
AGENT --> OBS
style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style OUT fill:#059669,stroke:#047857,color:#fff
from pydantic import BaseModel
from openai import OpenAI
class TaskStep(BaseModel):
step_number: int
description: str
expected_page_type: str # search, results, form, confirmation
success_indicator: str # what to look for to confirm step worked
data_to_extract: list[str] # info to capture for later steps
class TaskPlan(BaseModel):
task_description: str
steps: list[TaskStep]
estimated_total_steps: int
client = OpenAI()
def decompose_task(task: str) -> TaskPlan:
"""Break a complex web task into steps."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a web task planner. Break complex web tasks "
"into discrete steps. Each step should represent one "
"page interaction or page transition. Include what "
"success looks like for each step and what data needs "
"to be extracted for subsequent steps."
),
},
{
"role": "user",
"content": f"Plan the steps for this task: {task}",
},
],
response_format=TaskPlan,
)
return response.choices[0].message.parsed
State Tracking Across Pages
The agent must maintain state as it moves through pages. This includes data extracted from earlier steps, which step it is on, and any errors encountered.
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
class StepStatus(str, Enum):
PENDING = "pending"
IN_PROGRESS = "in_progress"
COMPLETED = "completed"
FAILED = "failed"
RETRYING = "retrying"
@dataclass
class WorkflowState:
task: str
plan: TaskPlan
current_step: int = 0
extracted_data: dict = field(default_factory=dict)
step_statuses: dict[int, StepStatus] = field(default_factory=dict)
screenshots: list[str] = field(default_factory=list)
errors: list[str] = field(default_factory=list)
started_at: datetime = field(default_factory=datetime.now)
@property
def current_task_step(self) -> TaskStep | None:
if self.current_step < len(self.plan.steps):
return self.plan.steps[self.current_step]
return None
def advance(self):
"""Move to the next step."""
self.step_statuses[self.current_step] = StepStatus.COMPLETED
self.current_step += 1
if self.current_step < len(self.plan.steps):
self.step_statuses[self.current_step] = StepStatus.IN_PROGRESS
def record_error(self, error: str):
"""Record an error for the current step."""
self.errors.append(
f"Step {self.current_step}: {error}"
)
self.step_statuses[self.current_step] = StepStatus.FAILED
def get_context_summary(self) -> str:
"""Summarize state for the GPT-4V prompt."""
lines = [f"Task: {self.task}"]
lines.append(f"Current step: {self.current_step + 1} "
f"of {len(self.plan.steps)}")
if self.extracted_data:
lines.append("Extracted data so far:")
for k, v in self.extracted_data.items():
lines.append(f" - {k}: {v}")
if self.errors:
lines.append(f"Previous errors: {self.errors[-3:]}")
return "\n".join(lines)
The Multi-Step Execution Engine
The engine ties together planning, execution, verification, and state management.
import asyncio
import base64
from playwright.async_api import async_playwright, Page
class StepResult(BaseModel):
success: bool
action_taken: str
extracted_data: dict[str, str]
error: str
next_action: str # what to do next: proceed, retry, escalate
class MultiStepAgent:
def __init__(self, max_retries: int = 2):
self.client = OpenAI()
self.max_retries = max_retries
async def capture(self, page: Page) -> str:
screenshot = await page.screenshot(type="png")
return base64.b64encode(screenshot).decode()
async def execute_step(
self, page: Page, state: WorkflowState
) -> StepResult:
"""Execute a single step with vision guidance."""
step = state.current_task_step
screenshot = await self.capture(page)
response = self.client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a web automation agent executing a "
"multi-step workflow. Analyze the current page "
"and determine the action needed for this step. "
"The viewport is 1280x720."
),
},
{
"role": "user",
"content": [
{
"type": "text",
"text": (
f"{state.get_context_summary()}\n\n"
f"Current step: {step.description}\n"
f"Success indicator: "
f"{step.success_indicator}\n"
f"Data to extract: "
f"{step.data_to_extract}\n\n"
"Analyze the page and report the result."
),
},
{
"type": "image_url",
"image_url": {
"url": (
"data:image/png;base64,"
f"{screenshot}"
),
"detail": "high",
},
},
],
},
],
response_format=StepResult,
)
return response.choices[0].message.parsed
async def run_workflow(self, url: str, task: str) -> WorkflowState:
"""Run a complete multi-step workflow."""
plan = decompose_task(task)
state = WorkflowState(task=task, plan=plan)
state.step_statuses[0] = StepStatus.IN_PROGRESS
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page(
viewport={"width": 1280, "height": 720}
)
await page.goto(url, wait_until="networkidle")
while state.current_step < len(plan.steps):
retries = 0
while retries <= self.max_retries:
result = await self.execute_step(page, state)
if result.success:
state.extracted_data.update(
result.extracted_data
)
state.advance()
await asyncio.sleep(1)
break
retries += 1
if retries > self.max_retries:
state.record_error(result.error)
await browser.close()
return state
await asyncio.sleep(2)
await browser.close()
return state
Handling Page Transitions
Page transitions are the trickiest part of multi-step workflows. After clicking a link or submitting a form, the page URL may change, content may load asynchronously, or a modal may appear instead of a navigation.
async def wait_for_page_change(
page: Page, previous_url: str, timeout: int = 10000
) -> bool:
"""Wait for a page transition or significant content change."""
try:
await page.wait_for_url(
lambda url: url != previous_url, timeout=timeout
)
await page.wait_for_load_state("networkidle")
return True
except Exception:
# URL might not change (modal, SPA navigation)
await asyncio.sleep(1)
return False
FAQ
How do I handle workflows that require authentication?
Authenticate before starting the workflow. Use Playwright's storage_state to save and restore cookies and local storage. You can log in once manually, save the state with context.storage_state(path="auth.json"), then reuse it in subsequent runs with browser.new_context(storage_state="auth.json").
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
What happens when a step fails partway through a multi-step workflow?
The state tracker records exactly which step failed and why. You have three recovery options: retry the failed step, restart from a known checkpoint (e.g., after login), or escalate to a human operator with the full state and screenshots for manual completion. The extracted_data dictionary preserves everything learned in previous steps.
How do I prevent the agent from getting stuck in infinite loops?
Set hard limits at multiple levels: a maximum number of retries per step (2-3), a maximum total number of actions across the workflow (50), and a wall-clock timeout (5-10 minutes). If any limit is hit, the agent stops and returns the current state for debugging.
#MultiStepTasks #GPTVision #WebWorkflows #StateTracking #TaskDecomposition #BrowserAutomation #AgenticAI #Python
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.