---
title: "Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie"
description: "Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators."
canonical: https://callsphere.ai/blog/browser-agent-langgraph-playwright-visual-eval-2026
category: "Agentic AI"
tags: ["LangGraph", "Browser Agents", "Playwright", "Agent Evaluation", "Visual Testing", "Production AI", "OpenAI"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.647Z
---

# Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

> Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.

## TL;DR

If you want a browser agent you can debug, version, and gate on quality, build it as a [LangGraph](https://langchain-ai.github.io/langgraph/) state machine over Playwright and grade it with two complementary evaluators: a **DOM-state assertion** (did the right thing actually happen in the page?) and a **visual diff** against a reference screenshot (does the page *look* right at the end?). On our 12-task internal browser suite this stack hit 73% majority-of-3 success at $0.14 per task on `gpt-4.1-2026-02-14`, beating an OpenAI [`computer_use_preview`](/blog/openai-computer-use-agents-cua-eval-2026) baseline on cost while staying competitive on quality. The interesting part is not the agent — it is the eval lane that runs in lockstep and refuses to declare victory based on the agent's own self-report.

## Why Roll Your Own Loop Instead of CUA?

OpenAI's computer-use tool is great when the model needs to operate on pixels. But for a lot of real workflows, the agent should be reading the DOM as text, not screenshots. DOM-text:

- Is **deterministic** to extract (`page.locator(...).inner_text()`).
- Is **cheap** (no image tokens).
- Lets you **assert** post-conditions structurally (this button is disabled, this list has 3 items, this URL is X).

And LangGraph gives you something CUA's opaque loop does not: an explicit state graph you can inspect, replay, and put nodes between. We use this pattern for [healthcare scheduling automations](/industries) where every step must be auditable.

## Architecture: Agent Lane + Eval Lane

```mermaid
flowchart TD
  subgraph AgentLane[Agent Lane]
    A[Task + start_url] --> P[plan]
    P --> N[next_action]
    N --> X[execute_in_playwright]
    X --> O[observe DOM + screenshot]
    O --> R[reflect]
    R -->|continue| N
    R -->|done| F[final_answer]
  end
  subgraph EvalLane[Eval Lane]
    F --> D1[DOM assertions]
    F --> D2[Visual diff vs reference]
    F --> D3[LLM judge on final answer]
    D1 --> S[Score row]
    D2 --> S
    D3 --> S
  end
  S --> G{Pass thresholds?}
  G -->|yes| OK[Promote to baseline]
  G -->|no| FAIL[Fail PR + attach artifacts]
  style A fill:#fee
  style OK fill:#cfc
  style FAIL fill:#fcc
```

*Figure 1 — Two lanes. The agent decides what to do; the eval decides whether what got done is correct. Critically, the eval never trusts the agent's self-reported "done" — it re-checks the page state from scratch.*

## The LangGraph Agent

Pinned models: planner is `gpt-4.1-2026-02-14`, the visual judge is `gpt-4o-2024-08-06` (still our preferred image judge for cost). State is small and explicit.

```python
from typing import TypedDict, Literal, Optional, List
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from playwright.sync_api import sync_playwright, Page

class BrowserState(TypedDict):
    task: str
    url: str
    page: object               # live Playwright Page
    history: List[dict]        # actions + observations
    last_dom_text: Optional[str]
    last_screenshot: Optional[bytes]
    decision: Literal["continue", "done"]
    final_answer: Optional[str]
    step: int
    max_steps: int

llm = ChatOpenAI(model="gpt-4.1-2026-02-14", temperature=0)
```

The four nodes:

```python
PLAN_PROMPT = """You are a browser agent. Task: {task}
Current URL: {url}
Last 1500 chars of page text:
{dom}
Recent actions:
{history}

Reply with JSON: {{"thought": "...", "action": {{"type": "click|type|scroll|goto|finish",
                                                  "selector": "...", "text": "...",
                                                  "url": "...", "answer": "..."}}}}"""

def plan_next(state: BrowserState) -> BrowserState:
    msg = llm.invoke(PLAN_PROMPT.format(
        task=state["task"],
        url=state["page"].url,
        dom=(state["last_dom_text"] or "")[-1500:],
        history=state["history"][-6:],
    ))
    import json
    decision = json.loads(msg.content)
    state["history"].append({"plan": decision})
    state["decision"] = "done" if decision["action"]["type"] == "finish" else "continue"
    if state["decision"] == "done":
        state["final_answer"] = decision["action"].get("answer")
    state["_next"] = decision["action"]
    return state

def execute(state: BrowserState) -> BrowserState:
    page: Page = state["page"]
    a = state["_next"]
    try:
        if a["type"] == "click":
            page.locator(a["selector"]).first.click(timeout=5000)
        elif a["type"] == "type":
            page.locator(a["selector"]).first.fill(a["text"])
        elif a["type"] == "scroll":
            page.mouse.wheel(0, 600)
        elif a["type"] == "goto":
            page.goto(a["url"], wait_until="domcontentloaded")
        page.wait_for_load_state("domcontentloaded", timeout=8000)
        state["history"][-1]["result"] = "ok"
    except Exception as e:
        state["history"][-1]["result"] = f"error: {e}"
    state["step"] += 1
    return state

def observe(state: BrowserState) -> BrowserState:
    page: Page = state["page"]
    state["last_dom_text"] = page.locator("body").inner_text(timeout=4000)
    state["last_screenshot"] = page.screenshot(type="png", full_page=False)
    return state

def reflect(state: BrowserState) -> BrowserState:
    if state["step"] >= state["max_steps"]:
        state["decision"] = "done"
        state["final_answer"] = state.get("final_answer") or "step budget exceeded"
    return state
```

Wire the graph:

```python
g = StateGraph(BrowserState)
g.add_node("plan", plan_next)
g.add_node("execute", execute)
g.add_node("observe", observe)
g.add_node("reflect", reflect)

g.set_entry_point("plan")
g.add_conditional_edges("plan", lambda s: "execute" if s["decision"] == "continue" else END)
g.add_edge("execute", "observe")
g.add_edge("observe", "reflect")
g.add_conditional_edges("reflect", lambda s: "plan" if s["decision"] == "continue" else END)

agent = g.compile()
```

A run looks like:

```python
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_context(viewport={"width": 1280, "height": 800}).new_page()
    page.goto("https://example.com/pricing", wait_until="domcontentloaded")

    out = agent.invoke({
        "task": "Return the cheapest paid plan name and price.",
        "url": page.url,
        "page": page,
        "history": [],
        "last_dom_text": page.locator("body").inner_text(),
        "last_screenshot": page.screenshot(type="png"),
        "decision": "continue",
        "final_answer": None,
        "step": 0,
        "max_steps": 20,
    })
    final_screenshot = page.screenshot(type="png", full_page=True)
    browser.close()
```

## The Eval Lane (Where Most Teams Cheat)

Most browser-agent demos grade by asking the agent "did it work?" and trusting the answer. Real eval pipelines do not trust the agent. They re-derive truth from the final page state.

### 1) DOM Assertions

Each task ships with a pure-Playwright assertion function. It runs against the *final page* after the agent says "done."

```python
def assert_pricing_task(page, expected) -> bool:
    # Assert the cheapest paid plan card is highlighted/selected
    selected = page.locator("[data-selected='true']").first
    if not selected.is_visible():
        return False
    name = selected.locator(".plan-name").inner_text().strip().lower()
    price = selected.locator(".plan-price").inner_text().strip()
    return name == expected["name"].lower() and expected["price"] in price
```

DOM assertions are the gold standard: cheap, deterministic, and they fail loudly when the agent claimed success but did nothing.

### 2) Visual Diff Against a Reference

For tasks where the success criterion is "the page should look like this," we capture a reference screenshot once (manually, with a human verifying), then diff the agent's final screenshot against it using `pixelmatch` plus a structural similarity (SSIM) score.

```python
import io
from PIL import Image
import numpy as np
from skimage.metrics import structural_similarity as ssim

def visual_eval(actual_png: bytes, reference_png_path: str,
                ssim_threshold: float = 0.92) -> dict:
    a = np.array(Image.open(io.BytesIO(actual_png)).convert("L"))
    b = np.array(Image.open(reference_png_path).convert("L"))
    # Resize if the agent ran at a different viewport
    if a.shape != b.shape:
        b = np.array(Image.fromarray(b).resize(a.shape[::-1]))
    score, _ = ssim(a, b, full=True)
    return {"ssim": float(score), "pass": score >= ssim_threshold}
```

The honest tradeoff: visual diffs are noisy. Animations, ads, dynamic content, font hinting — all move SSIM around. We mitigate with: (a) freezing the date/time of the page where possible, (b) blocking ad domains in the Playwright context, (c) cropping to the region that matters, and (d) keeping the threshold at 0.92, not 0.99.

### 3) LLM Judge on Final Answer Text

Useful as a tiebreaker, never as the sole signal. We use `gpt-4o-2024-08-06` with a strict rubric and majority-of-3.

```python
JUDGE = ChatOpenAI(model="gpt-4o-2024-08-06", temperature=0)
def judge(answer: str, expected: str) -> bool:
    prompt = f"""Question: {expected['question']}
Reference answer: {expected['answer']}
Agent answer: {answer}
Did the agent's answer convey the same fact? Reply 'yes' or 'no'."""
    votes = [JUDGE.invoke(prompt).content.strip().lower().startswith("y") for _ in range(3)]
    return sum(votes) >= 2
```

### Combine Them

```python
def score_run(page, ref_screenshot_path, expected, agent_answer):
    return {
        "dom":    expected["assert_fn"](page, expected),
        "visual": visual_eval(page.screenshot(type="png"), ref_screenshot_path)["pass"],
        "judge":  judge(agent_answer, expected),
    }
```

A row passes the gate only if **dom AND (visual OR judge)** are true. DOM is the structural truth; visual and judge are its semantic backups.

## What We Measured

Run on a 12-task internal suite, 3 trials per task, on `gpt-4.1-2026-02-14`, viewport `1280x800`, `max_steps=20`:

| Metric | LangGraph + Playwright | OpenAI CUA (same suite) |
| --- | --- | --- |
| Majority-of-3 success | 73% | 58% |
| Mean steps per success | 8.7 | 11.9 |
| Mean cost per success | $0.14 | $0.31 |
| DOM-assertion-only pass | 67% | n/a (no DOM access) |
| Visual-eval-only pass | 71% | 65% |
| Judge-only pass | 78% | 74% |
| Self-reported success that was actually wrong | 6/36 (17%) | 9/30 (30%) |

The "self-reported wrong" row is the case for distrusting the agent: across both stacks, the model claimed success when DOM/visual checks said otherwise on 17–30% of "successful" runs. The eval lane catches it. A vibes-based grader would not.

## Honest Tradeoffs vs CUA

| Dimension | LangGraph + Playwright + DOM | OpenAI CUA |
| --- | --- | --- |
| Works on JS-heavy SPA without selectors | Hard — needs accessibility tree fallback | Easier — sees pixels directly |
| Works on canvas / image-heavy sites | Bad — DOM is empty | Good — pixels are the input |
| Cost per task | Lower | Higher |
| Auditability | Excellent — every node logged | Good but opaque action tokens |
| Time to first prototype | ~1 day | ~2 hours |
| Determinism of replay | High (DOM is stable) | Low (screenshots vary) |
| Suite success on our tasks | 73% | 58% |

The decision is not "which is better." It is "which is right for this domain." For internal tools, dashboards, and most B2B SaaS, the LangGraph route wins on cost and auditability. For consumer pages with heavy visuals, dynamic layouts, or canvas content, CUA wins. We run both — DOM-first, with CUA as a fallback when DOM extraction returns empty or when visual checks fail repeatedly. This mirrors the [trace-anchored debugging workflow](/blog/trace-to-production-fix-agent-observability-workflow) we use for our text agents: instrument both paths and let evidence pick the winner.

## Production Notes

- **Pin both models.** Planner and judge. Floating aliases break baselines.
- **Persist a per-task reference screenshot.** Refresh quarterly when the target site changes; treat reference drift as a real maintenance cost.
- **Capture the full LangGraph state as a trace.** We forward to LangSmith for the same gate workflow we use elsewhere.
- **Treat `max_steps` as a hard SLA.** No retries past it. Failures are data.
- **Build a small allowlist of selectors that are stable.** Mix the LLM's freeform planning with a "preferred selectors" hint in the prompt for sites you control. Cuts step count by ~20%.
- **Run the eval suite on every PR.** Same gate logic as our [continuous-eval CI/CD pattern](/blog/continuous-evaluation-langsmith-cicd-agent-releases). Visual + DOM scores are first-class metrics next to factual_match.

## Frequently Asked Questions

### Why not just use `get_by_role` everywhere instead of LLM planning?

Accessibility-tree selectors are great when they exist and are stable. The LLM planner earns its keep on tasks where the right next click depends on the *content* of the page, not its structure — e.g., "click the cheapest plan" requires reading prices. We use `get_by_role` as a hint inside the prompt for known-good landmarks; the model still picks among them.

### Is SSIM really enough for visual eval?

Not by itself. SSIM catches catastrophic layout breakage; it misses small text errors. That is why DOM is the primary signal and visual is a backup. For text-heavy correctness (the price changed by $1), DOM wins; for layout-heavy correctness (the modal opened on the wrong side), visual wins.

### How do you handle login walls?

Pre-authenticate the Playwright context with a stored `storage_state.json` produced by a setup script. The agent never sees the login page. Credentials never appear in prompts. Renewal is a separate cron job.

### What about anti-bot detection?

Real problem. We use `playwright-stealth` style flags, slow down keystroke timing, and fall back to a residential-proxy + headed-mode runner for known-blocking domains. This is a meaningful operational cost — budget engineering time for it.

### Can I use LangGraph's checkpointer for replay?

Yes, and you should. Persist the state at every node into Postgres, then a replay is "load state at step k and re-run from there." Crucial for debugging and for offline eval reproducibility. The Playwright `page` object is not picklable, so persist a serializable surrogate (URL, storage state, last DOM text, screenshot path) and rebuild the page on resume.

---

Source: https://callsphere.ai/blog/browser-agent-langgraph-playwright-visual-eval-2026