Skip to content
Agentic AI
Agentic AI13 min read0 views

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.

TL;DR

If you want a browser agent you can debug, version, and gate on quality, build it as a LangGraph state machine over Playwright and grade it with two complementary evaluators: a DOM-state assertion (did the right thing actually happen in the page?) and a visual diff against a reference screenshot (does the page look right at the end?). On our 12-task internal browser suite this stack hit 73% majority-of-3 success at $0.14 per task on gpt-4.1-2026-02-14, beating an OpenAI computer_use_preview baseline on cost while staying competitive on quality. The interesting part is not the agent — it is the eval lane that runs in lockstep and refuses to declare victory based on the agent's own self-report.

Why Roll Your Own Loop Instead of CUA?

OpenAI's computer-use tool is great when the model needs to operate on pixels. But for a lot of real workflows, the agent should be reading the DOM as text, not screenshots. DOM-text:

  • Is deterministic to extract (page.locator(...).inner_text()).
  • Is cheap (no image tokens).
  • Lets you assert post-conditions structurally (this button is disabled, this list has 3 items, this URL is X).

And LangGraph gives you something CUA's opaque loop does not: an explicit state graph you can inspect, replay, and put nodes between. We use this pattern for healthcare scheduling automations where every step must be auditable.

Architecture: Agent Lane + Eval Lane

flowchart TD
  subgraph AgentLane[Agent Lane]
    A[Task + start_url] --> P[plan]
    P --> N[next_action]
    N --> X[execute_in_playwright]
    X --> O[observe DOM + screenshot]
    O --> R[reflect]
    R -->|continue| N
    R -->|done| F[final_answer]
  end
  subgraph EvalLane[Eval Lane]
    F --> D1[DOM assertions]
    F --> D2[Visual diff vs reference]
    F --> D3[LLM judge on final answer]
    D1 --> S[Score row]
    D2 --> S
    D3 --> S
  end
  S --> G{Pass thresholds?}
  G -->|yes| OK[Promote to baseline]
  G -->|no| FAIL[Fail PR + attach artifacts]
  style A fill:#fee
  style OK fill:#cfc
  style FAIL fill:#fcc

Figure 1 — Two lanes. The agent decides what to do; the eval decides whether what got done is correct. Critically, the eval never trusts the agent's self-reported "done" — it re-checks the page state from scratch.

The LangGraph Agent

Pinned models: planner is gpt-4.1-2026-02-14, the visual judge is gpt-4o-2024-08-06 (still our preferred image judge for cost). State is small and explicit.

from typing import TypedDict, Literal, Optional, List
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from playwright.sync_api import sync_playwright, Page

class BrowserState(TypedDict):
    task: str
    url: str
    page: object               # live Playwright Page
    history: List[dict]        # actions + observations
    last_dom_text: Optional[str]
    last_screenshot: Optional[bytes]
    decision: Literal["continue", "done"]
    final_answer: Optional[str]
    step: int
    max_steps: int

llm = ChatOpenAI(model="gpt-4.1-2026-02-14", temperature=0)

The four nodes:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
PLAN_PROMPT = """You are a browser agent. Task: {task}
Current URL: {url}
Last 1500 chars of page text:
{dom}
Recent actions:
{history}

Reply with JSON: {{"thought": "...", "action": {{"type": "click|type|scroll|goto|finish",
                                                  "selector": "...", "text": "...",
                                                  "url": "...", "answer": "..."}}}}"""

def plan_next(state: BrowserState) -> BrowserState:
    msg = llm.invoke(PLAN_PROMPT.format(
        task=state["task"],
        url=state["page"].url,
        dom=(state["last_dom_text"] or "")[-1500:],
        history=state["history"][-6:],
    ))
    import json
    decision = json.loads(msg.content)
    state["history"].append({"plan": decision})
    state["decision"] = "done" if decision["action"]["type"] == "finish" else "continue"
    if state["decision"] == "done":
        state["final_answer"] = decision["action"].get("answer")
    state["_next"] = decision["action"]
    return state

def execute(state: BrowserState) -> BrowserState:
    page: Page = state["page"]
    a = state["_next"]
    try:
        if a["type"] == "click":
            page.locator(a["selector"]).first.click(timeout=5000)
        elif a["type"] == "type":
            page.locator(a["selector"]).first.fill(a["text"])
        elif a["type"] == "scroll":
            page.mouse.wheel(0, 600)
        elif a["type"] == "goto":
            page.goto(a["url"], wait_until="domcontentloaded")
        page.wait_for_load_state("domcontentloaded", timeout=8000)
        state["history"][-1]["result"] = "ok"
    except Exception as e:
        state["history"][-1]["result"] = f"error: {e}"
    state["step"] += 1
    return state

def observe(state: BrowserState) -> BrowserState:
    page: Page = state["page"]
    state["last_dom_text"] = page.locator("body").inner_text(timeout=4000)
    state["last_screenshot"] = page.screenshot(type="png", full_page=False)
    return state

def reflect(state: BrowserState) -> BrowserState:
    if state["step"] >= state["max_steps"]:
        state["decision"] = "done"
        state["final_answer"] = state.get("final_answer") or "step budget exceeded"
    return state

Wire the graph:

g = StateGraph(BrowserState)
g.add_node("plan", plan_next)
g.add_node("execute", execute)
g.add_node("observe", observe)
g.add_node("reflect", reflect)

g.set_entry_point("plan")
g.add_conditional_edges("plan", lambda s: "execute" if s["decision"] == "continue" else END)
g.add_edge("execute", "observe")
g.add_edge("observe", "reflect")
g.add_conditional_edges("reflect", lambda s: "plan" if s["decision"] == "continue" else END)

agent = g.compile()

A run looks like:

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_context(viewport={"width": 1280, "height": 800}).new_page()
    page.goto("https://example.com/pricing", wait_until="domcontentloaded")

    out = agent.invoke({
        "task": "Return the cheapest paid plan name and price.",
        "url": page.url,
        "page": page,
        "history": [],
        "last_dom_text": page.locator("body").inner_text(),
        "last_screenshot": page.screenshot(type="png"),
        "decision": "continue",
        "final_answer": None,
        "step": 0,
        "max_steps": 20,
    })
    final_screenshot = page.screenshot(type="png", full_page=True)
    browser.close()

The Eval Lane (Where Most Teams Cheat)

Most browser-agent demos grade by asking the agent "did it work?" and trusting the answer. Real eval pipelines do not trust the agent. They re-derive truth from the final page state.

1) DOM Assertions

Each task ships with a pure-Playwright assertion function. It runs against the final page after the agent says "done."

def assert_pricing_task(page, expected) -> bool:
    # Assert the cheapest paid plan card is highlighted/selected
    selected = page.locator("[data-selected='true']").first
    if not selected.is_visible():
        return False
    name = selected.locator(".plan-name").inner_text().strip().lower()
    price = selected.locator(".plan-price").inner_text().strip()
    return name == expected["name"].lower() and expected["price"] in price

DOM assertions are the gold standard: cheap, deterministic, and they fail loudly when the agent claimed success but did nothing.

2) Visual Diff Against a Reference

For tasks where the success criterion is "the page should look like this," we capture a reference screenshot once (manually, with a human verifying), then diff the agent's final screenshot against it using pixelmatch plus a structural similarity (SSIM) score.

import io
from PIL import Image
import numpy as np
from skimage.metrics import structural_similarity as ssim

def visual_eval(actual_png: bytes, reference_png_path: str,
                ssim_threshold: float = 0.92) -> dict:
    a = np.array(Image.open(io.BytesIO(actual_png)).convert("L"))
    b = np.array(Image.open(reference_png_path).convert("L"))
    # Resize if the agent ran at a different viewport
    if a.shape != b.shape:
        b = np.array(Image.fromarray(b).resize(a.shape[::-1]))
    score, _ = ssim(a, b, full=True)
    return {"ssim": float(score), "pass": score >= ssim_threshold}

The honest tradeoff: visual diffs are noisy. Animations, ads, dynamic content, font hinting — all move SSIM around. We mitigate with: (a) freezing the date/time of the page where possible, (b) blocking ad domains in the Playwright context, (c) cropping to the region that matters, and (d) keeping the threshold at 0.92, not 0.99.

3) LLM Judge on Final Answer Text

Useful as a tiebreaker, never as the sole signal. We use gpt-4o-2024-08-06 with a strict rubric and majority-of-3.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

JUDGE = ChatOpenAI(model="gpt-4o-2024-08-06", temperature=0)
def judge(answer: str, expected: str) -> bool:
    prompt = f"""Question: {expected['question']}
Reference answer: {expected['answer']}
Agent answer: {answer}
Did the agent's answer convey the same fact? Reply 'yes' or 'no'."""
    votes = [JUDGE.invoke(prompt).content.strip().lower().startswith("y") for _ in range(3)]
    return sum(votes) >= 2

Combine Them

def score_run(page, ref_screenshot_path, expected, agent_answer):
    return {
        "dom":    expected["assert_fn"](page, expected),
        "visual": visual_eval(page.screenshot(type="png"), ref_screenshot_path)["pass"],
        "judge":  judge(agent_answer, expected),
    }

A row passes the gate only if dom AND (visual OR judge) are true. DOM is the structural truth; visual and judge are its semantic backups.

What We Measured

Run on a 12-task internal suite, 3 trials per task, on gpt-4.1-2026-02-14, viewport 1280x800, max_steps=20:

Metric LangGraph + Playwright OpenAI CUA (same suite)
Majority-of-3 success 73% 58%
Mean steps per success 8.7 11.9
Mean cost per success $0.14 $0.31
DOM-assertion-only pass 67% n/a (no DOM access)
Visual-eval-only pass 71% 65%
Judge-only pass 78% 74%
Self-reported success that was actually wrong 6/36 (17%) 9/30 (30%)

The "self-reported wrong" row is the case for distrusting the agent: across both stacks, the model claimed success when DOM/visual checks said otherwise on 17–30% of "successful" runs. The eval lane catches it. A vibes-based grader would not.

Honest Tradeoffs vs CUA

Dimension LangGraph + Playwright + DOM OpenAI CUA
Works on JS-heavy SPA without selectors Hard — needs accessibility tree fallback Easier — sees pixels directly
Works on canvas / image-heavy sites Bad — DOM is empty Good — pixels are the input
Cost per task Lower Higher
Auditability Excellent — every node logged Good but opaque action tokens
Time to first prototype ~1 day ~2 hours
Determinism of replay High (DOM is stable) Low (screenshots vary)
Suite success on our tasks 73% 58%

The decision is not "which is better." It is "which is right for this domain." For internal tools, dashboards, and most B2B SaaS, the LangGraph route wins on cost and auditability. For consumer pages with heavy visuals, dynamic layouts, or canvas content, CUA wins. We run both — DOM-first, with CUA as a fallback when DOM extraction returns empty or when visual checks fail repeatedly. This mirrors the trace-anchored debugging workflow we use for our text agents: instrument both paths and let evidence pick the winner.

Production Notes

  • Pin both models. Planner and judge. Floating aliases break baselines.
  • Persist a per-task reference screenshot. Refresh quarterly when the target site changes; treat reference drift as a real maintenance cost.
  • Capture the full LangGraph state as a trace. We forward to LangSmith for the same gate workflow we use elsewhere.
  • Treat max_steps as a hard SLA. No retries past it. Failures are data.
  • Build a small allowlist of selectors that are stable. Mix the LLM's freeform planning with a "preferred selectors" hint in the prompt for sites you control. Cuts step count by ~20%.
  • Run the eval suite on every PR. Same gate logic as our continuous-eval CI/CD pattern. Visual + DOM scores are first-class metrics next to factual_match.

Frequently Asked Questions

Why not just use get_by_role everywhere instead of LLM planning?

Accessibility-tree selectors are great when they exist and are stable. The LLM planner earns its keep on tasks where the right next click depends on the content of the page, not its structure — e.g., "click the cheapest plan" requires reading prices. We use get_by_role as a hint inside the prompt for known-good landmarks; the model still picks among them.

Is SSIM really enough for visual eval?

Not by itself. SSIM catches catastrophic layout breakage; it misses small text errors. That is why DOM is the primary signal and visual is a backup. For text-heavy correctness (the price changed by $1), DOM wins; for layout-heavy correctness (the modal opened on the wrong side), visual wins.

How do you handle login walls?

Pre-authenticate the Playwright context with a stored storage_state.json produced by a setup script. The agent never sees the login page. Credentials never appear in prompts. Renewal is a separate cron job.

What about anti-bot detection?

Real problem. We use playwright-stealth style flags, slow down keystroke timing, and fall back to a residential-proxy + headed-mode runner for known-blocking domains. This is a meaningful operational cost — budget engineering time for it.

Can I use LangGraph's checkpointer for replay?

Yes, and you should. Persist the state at every node into Postgres, then a replay is "load state at step k and re-run from there." Crucial for debugging and for offline eval reproducibility. The Playwright page object is not picklable, so persist a serializable surrogate (URL, storage state, last DOM text, screenshot path) and rebuild the page on resume.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.