Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

TL;DR

If you want a browser agent you can debug, version, and gate on quality, build it as a LangGraph state machine over Playwright and grade it with two complementary evaluators: a DOM-state assertion (did the right thing actually happen in the page?) and a visual diff against a reference screenshot (does the page look right at the end?). On our 12-task internal browser suite this stack hit 73% majority-of-3 success at $0.14 per task on gpt-4.1-2026-02-14, beating an OpenAI computer_use_preview baseline on cost while staying competitive on quality. The interesting part is not the agent — it is the eval lane that runs in lockstep and refuses to declare victory based on the agent's own self-report.

Why Roll Your Own Loop Instead of CUA?

OpenAI's computer-use tool is great when the model needs to operate on pixels. But for a lot of real workflows, the agent should be reading the DOM as text, not screenshots. DOM-text:

Is deterministic to extract (page.locator(...).inner_text()).
Is cheap (no image tokens).
Lets you assert post-conditions structurally (this button is disabled, this list has 3 items, this URL is X).

And LangGraph gives you something CUA's opaque loop does not: an explicit state graph you can inspect, replay, and put nodes between. We use this pattern for healthcare scheduling automations where every step must be auditable.

Architecture: Agent Lane + Eval Lane

flowchart TD
  subgraph AgentLane[Agent Lane]
    A[Task + start_url] --> P[plan]
    P --> N[next_action]
    N --> X[execute_in_playwright]
    X --> O[observe DOM + screenshot]
    O --> R[reflect]
    R -->|continue| N
    R -->|done| F[final_answer]
  end
  subgraph EvalLane[Eval Lane]
    F --> D1[DOM assertions]
    F --> D2[Visual diff vs reference]
    F --> D3[LLM judge on final answer]
    D1 --> S[Score row]
    D2 --> S
    D3 --> S
  end
  S --> G{Pass thresholds?}
  G -->|yes| OK[Promote to baseline]
  G -->|no| FAIL[Fail PR + attach artifacts]
  style A fill:#fee
  style OK fill:#cfc
  style FAIL fill:#fcc

Figure 1 — Two lanes. The agent decides what to do; the eval decides whether what got done is correct. Critically, the eval never trusts the agent's self-reported "done" — it re-checks the page state from scratch.

The LangGraph Agent

Pinned models: planner is gpt-4.1-2026-02-14, the visual judge is gpt-4o-2024-08-06 (still our preferred image judge for cost). State is small and explicit.

from typing import TypedDict, Literal, Optional, List
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from playwright.sync_api import sync_playwright, Page

class BrowserState(TypedDict):
    task: str
    url: str
    page: object               # live Playwright Page
    history: List[dict]        # actions + observations
    last_dom_text: Optional[str]
    last_screenshot: Optional[bytes]
    decision: Literal["continue", "done"]
    final_answer: Optional[str]
    step: int
    max_steps: int

llm = ChatOpenAI(model="gpt-4.1-2026-02-14", temperature=0)

The four nodes:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

PLAN_PROMPT = """You are a browser agent. Task: {task}
Current URL: {url}
Last 1500 chars of page text:
{dom}
Recent actions:
{history}

Reply with JSON: {{"thought": "...", "action": {{"type": "click|type|scroll|goto|finish",
                                                  "selector": "...", "text": "...",
                                                  "url": "...", "answer": "..."}}}}"""

def plan_next(state: BrowserState) -> BrowserState:
    msg = llm.invoke(PLAN_PROMPT.format(
        task=state["task"],
        url=state["page"].url,
        dom=(state["last_dom_text"] or "")[-1500:],
        history=state["history"][-6:],
    ))
    import json
    decision = json.loads(msg.content)
    state["history"].append({"plan": decision})
    state["decision"] = "done" if decision["action"]["type"] == "finish" else "continue"
    if state["decision"] == "done":
        state["final_answer"] = decision["action"].get("answer")
    state["_next"] = decision["action"]
    return state

def execute(state: BrowserState) -> BrowserState:
    page: Page = state["page"]
    a = state["_next"]
    try:
        if a["type"] == "click":
            page.locator(a["selector"]).first.click(timeout=5000)
        elif a["type"] == "type":
            page.locator(a["selector"]).first.fill(a["text"])
        elif a["type"] == "scroll":
            page.mouse.wheel(0, 600)
        elif a["type"] == "goto":
            page.goto(a["url"], wait_until="domcontentloaded")
        page.wait_for_load_state("domcontentloaded", timeout=8000)
        state["history"][-1]["result"] = "ok"
    except Exception as e:
        state["history"][-1]["result"] = f"error: {e}"
    state["step"] += 1
    return state

def observe(state: BrowserState) -> BrowserState:
    page: Page = state["page"]
    state["last_dom_text"] = page.locator("body").inner_text(timeout=4000)
    state["last_screenshot"] = page.screenshot(type="png", full_page=False)
    return state

def reflect(state: BrowserState) -> BrowserState:
    if state["step"] >= state["max_steps"]:
        state["decision"] = "done"
        state["final_answer"] = state.get("final_answer") or "step budget exceeded"
    return state

Wire the graph:

g = StateGraph(BrowserState)
g.add_node("plan", plan_next)
g.add_node("execute", execute)
g.add_node("observe", observe)
g.add_node("reflect", reflect)

g.set_entry_point("plan")
g.add_conditional_edges("plan", lambda s: "execute" if s["decision"] == "continue" else END)
g.add_edge("execute", "observe")
g.add_edge("observe", "reflect")
g.add_conditional_edges("reflect", lambda s: "plan" if s["decision"] == "continue" else END)

agent = g.compile()

A run looks like:

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_context(viewport={"width": 1280, "height": 800}).new_page()
    page.goto("https://example.com/pricing", wait_until="domcontentloaded")

    out = agent.invoke({
        "task": "Return the cheapest paid plan name and price.",
        "url": page.url,
        "page": page,
        "history": [],
        "last_dom_text": page.locator("body").inner_text(),
        "last_screenshot": page.screenshot(type="png"),
        "decision": "continue",
        "final_answer": None,
        "step": 0,
        "max_steps": 20,
    })
    final_screenshot = page.screenshot(type="png", full_page=True)
    browser.close()

The Eval Lane (Where Most Teams Cheat)

Most browser-agent demos grade by asking the agent "did it work?" and trusting the answer. Real eval pipelines do not trust the agent. They re-derive truth from the final page state.

1) DOM Assertions

Each task ships with a pure-Playwright assertion function. It runs against the final page after the agent says "done."

def assert_pricing_task(page, expected) -> bool:
    # Assert the cheapest paid plan card is highlighted/selected
    selected = page.locator("[data-selected='true']").first
    if not selected.is_visible():
        return False
    name = selected.locator(".plan-name").inner_text().strip().lower()
    price = selected.locator(".plan-price").inner_text().strip()
    return name == expected["name"].lower() and expected["price"] in price

DOM assertions are the gold standard: cheap, deterministic, and they fail loudly when the agent claimed success but did nothing.

2) Visual Diff Against a Reference

For tasks where the success criterion is "the page should look like this," we capture a reference screenshot once (manually, with a human verifying), then diff the agent's final screenshot against it using pixelmatch plus a structural similarity (SSIM) score.

import io
from PIL import Image
import numpy as np
from skimage.metrics import structural_similarity as ssim

def visual_eval(actual_png: bytes, reference_png_path: str,
                ssim_threshold: float = 0.92) -> dict:
    a = np.array(Image.open(io.BytesIO(actual_png)).convert("L"))
    b = np.array(Image.open(reference_png_path).convert("L"))
    # Resize if the agent ran at a different viewport
    if a.shape != b.shape:
        b = np.array(Image.fromarray(b).resize(a.shape[::-1]))
    score, _ = ssim(a, b, full=True)
    return {"ssim": float(score), "pass": score >= ssim_threshold}

The honest tradeoff: visual diffs are noisy. Animations, ads, dynamic content, font hinting — all move SSIM around. We mitigate with: (a) freezing the date/time of the page where possible, (b) blocking ad domains in the Playwright context, (c) cropping to the region that matters, and (d) keeping the threshold at 0.92, not 0.99.

3) LLM Judge on Final Answer Text

Useful as a tiebreaker, never as the sole signal. We use gpt-4o-2024-08-06 with a strict rubric and majority-of-3.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

JUDGE = ChatOpenAI(model="gpt-4o-2024-08-06", temperature=0)
def judge(answer: str, expected: str) -> bool:
    prompt = f"""Question: {expected['question']}
Reference answer: {expected['answer']}
Agent answer: {answer}
Did the agent's answer convey the same fact? Reply 'yes' or 'no'."""
    votes = [JUDGE.invoke(prompt).content.strip().lower().startswith("y") for _ in range(3)]
    return sum(votes) >= 2

Combine Them

def score_run(page, ref_screenshot_path, expected, agent_answer):
    return {
        "dom":    expected["assert_fn"](page, expected),
        "visual": visual_eval(page.screenshot(type="png"), ref_screenshot_path)["pass"],
        "judge":  judge(agent_answer, expected),
    }

A row passes the gate only if dom AND (visual OR judge) are true. DOM is the structural truth; visual and judge are its semantic backups.

What We Measured

Run on a 12-task internal suite, 3 trials per task, on gpt-4.1-2026-02-14, viewport 1280x800, max_steps=20:

Metric	LangGraph + Playwright	OpenAI CUA (same suite)
Majority-of-3 success	73%	58%
Mean steps per success	8.7	11.9
Mean cost per success	$0.14	$0.31
DOM-assertion-only pass	67%	n/a (no DOM access)
Visual-eval-only pass	71%	65%
Judge-only pass	78%	74%
Self-reported success that was actually wrong	6/36 (17%)	9/30 (30%)

The "self-reported wrong" row is the case for distrusting the agent: across both stacks, the model claimed success when DOM/visual checks said otherwise on 17–30% of "successful" runs. The eval lane catches it. A vibes-based grader would not.

Honest Tradeoffs vs CUA

Dimension	LangGraph + Playwright + DOM	OpenAI CUA
Works on JS-heavy SPA without selectors	Hard — needs accessibility tree fallback	Easier — sees pixels directly
Works on canvas / image-heavy sites	Bad — DOM is empty	Good — pixels are the input
Cost per task	Lower	Higher
Auditability	Excellent — every node logged	Good but opaque action tokens
Time to first prototype	~1 day	~2 hours
Determinism of replay	High (DOM is stable)	Low (screenshots vary)
Suite success on our tasks	73%	58%

The decision is not "which is better." It is "which is right for this domain." For internal tools, dashboards, and most B2B SaaS, the LangGraph route wins on cost and auditability. For consumer pages with heavy visuals, dynamic layouts, or canvas content, CUA wins. We run both — DOM-first, with CUA as a fallback when DOM extraction returns empty or when visual checks fail repeatedly. This mirrors the trace-anchored debugging workflow we use for our text agents: instrument both paths and let evidence pick the winner.

Production Notes

Pin both models. Planner and judge. Floating aliases break baselines.
Persist a per-task reference screenshot. Refresh quarterly when the target site changes; treat reference drift as a real maintenance cost.
Capture the full LangGraph state as a trace. We forward to LangSmith for the same gate workflow we use elsewhere.
Treat max_steps as a hard SLA. No retries past it. Failures are data.
Build a small allowlist of selectors that are stable. Mix the LLM's freeform planning with a "preferred selectors" hint in the prompt for sites you control. Cuts step count by ~20%.
Run the eval suite on every PR. Same gate logic as our continuous-eval CI/CD pattern. Visual + DOM scores are first-class metrics next to factual_match.

Frequently Asked Questions

Why not just use `get_by_role` everywhere instead of LLM planning?

Accessibility-tree selectors are great when they exist and are stable. The LLM planner earns its keep on tasks where the right next click depends on the content of the page, not its structure — e.g., "click the cheapest plan" requires reading prices. We use get_by_role as a hint inside the prompt for known-good landmarks; the model still picks among them.

Is SSIM really enough for visual eval?

Not by itself. SSIM catches catastrophic layout breakage; it misses small text errors. That is why DOM is the primary signal and visual is a backup. For text-heavy correctness (the price changed by $1), DOM wins; for layout-heavy correctness (the modal opened on the wrong side), visual wins.

Pre-authenticate the Playwright context with a stored storage_state.json produced by a setup script. The agent never sees the login page. Credentials never appear in prompts. Renewal is a separate cron job.

What about anti-bot detection?

Real problem. We use playwright-stealth style flags, slow down keystroke timing, and fall back to a residential-proxy + headed-mode runner for known-blocking domains. This is a meaningful operational cost — budget engineering time for it.

Can I use LangGraph's checkpointer for replay?

Yes, and you should. Persist the state at every node into Postgres, then a replay is "load state at step k and re-run from there." Crucial for debugging and for offline eval reproducibility. The Playwright page object is not picklable, so persist a serializable surrogate (URL, storage state, last DOM text, screenshot path) and rebuild the page on resume.

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

TL;DR

Why Roll Your Own Loop Instead of CUA?

Architecture: Agent Lane + Eval Lane

The LangGraph Agent

The Eval Lane (Where Most Teams Cheat)

1) DOM Assertions

2) Visual Diff Against a Reference

3) LLM Judge on Final Answer Text

Combine Them

What We Measured

Honest Tradeoffs vs CUA

Production Notes

Frequently Asked Questions

Why not just use `get_by_role` everywhere instead of LLM planning?

Is SSIM really enough for visual eval?

What about anti-bot detection?

Can I use LangGraph's checkpointer for replay?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Product

Resources

Company

Legal

Industries

Integrations

Solutions

Compare

Pillar Guides

See AI Voice Agents in Action

TL;DR

Why Roll Your Own Loop Instead of CUA?

Architecture: Agent Lane + Eval Lane

The LangGraph Agent

The Eval Lane (Where Most Teams Cheat)

1) DOM Assertions

2) Visual Diff Against a Reference

3) LLM Judge on Final Answer Text

Combine Them

What We Measured

Honest Tradeoffs vs CUA

Production Notes

Frequently Asked Questions

Why not just use get_by_role everywhere instead of LLM planning?

Is SSIM really enough for visual eval?

How do you handle login walls?

What about anti-bot detection?

Can I use LangGraph's checkpointer for replay?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Product

Resources

Company

Legal

Industries

Integrations

Solutions

Compare

Pillar Guides

See AI Voice Agents in Action

Why not just use `get_by_role` everywhere instead of LLM planning?