Building an AI Testing Agent: Automated QA That Explores and Finds Bugs

Beyond Scripted Test Suites

Traditional automated testing follows scripts: visit this URL, click this button, assert that element appears. This approach catches regressions but never discovers new bugs because it only tests paths that a human already thought to check. AI testing agents flip this model. They explore the application like a curious tester, trying unexpected inputs, clicking buttons in unusual orders, and flagging behavior that looks wrong.

The difference is profound. A scripted test suite with 500 tests will always run the same 500 paths. An AI testing agent generates novel test paths on every run, covering UI states and interaction sequences that no human thought to script.

Architecture of an AI Testing Agent

An AI testing agent consists of four components: an explorer that navigates the application, a test case generator that produces structured test scenarios, a bug classifier that determines whether observed behavior is actually a defect, and a report generator that produces actionable output.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

from dataclasses import dataclass, field
from enum import Enum
from datetime import datetime
from typing import Optional

class BugSeverity(Enum):
    CRITICAL = "critical"  # App crashes, data loss
    HIGH = "high"          # Feature broken, no workaround
    MEDIUM = "medium"      # Feature broken, workaround exists
    LOW = "low"            # Cosmetic, minor usability

@dataclass
class TestAction:
    action_type: str  # click, fill, navigate, scroll
    target: str
    value: Optional[str] = None
    screenshot_before: Optional[str] = None
    screenshot_after: Optional[str] = None

@dataclass
class BugReport:
    title: str
    severity: BugSeverity
    description: str
    steps_to_reproduce: list[TestAction]
    expected_behavior: str
    actual_behavior: str
    screenshot_path: Optional[str] = None
    url: str = ""
    discovered_at: datetime = field(
        default_factory=datetime.utcnow
    )

@dataclass
class ExplorationState:
    visited_urls: set = field(default_factory=set)
    clicked_elements: set = field(default_factory=set)
    forms_submitted: int = 0
    bugs_found: list[BugReport] = field(default_factory=list)
    action_history: list[TestAction] = field(default_factory=list)
    error_count: int = 0

The Exploration Engine

The explorer navigates the application systematically, prioritizing unvisited pages and untested interaction patterns. It uses an LLM to decide what to do next based on the current page state and exploration history.

from playwright.async_api import async_playwright, Page
from openai import AsyncOpenAI
import json

class ExplorationEngine:
    def __init__(self, client: AsyncOpenAI, base_url: str):
        self.client = client
        self.base_url = base_url
        self.state = ExplorationState()

    async def explore(self, max_steps: int = 100):
        """Main exploration loop."""
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()

            # Catch console errors and unhandled exceptions
            console_errors = []
            page.on("console", lambda msg: (
                console_errors.append(msg.text)
                if msg.type == "error" else None
            ))
            page.on("pageerror", lambda err: (
                console_errors.append(str(err))
            ))

            await page.goto(self.base_url)
            self.state.visited_urls.add(self.base_url)

            for step in range(max_steps):
                try:
                    action = await self._decide_next_action(page)
                    await self._execute_action(page, action)

                    # Check for bugs after each action
                    bugs = await self._check_for_bugs(
                        page, action, console_errors
                    )
                    self.state.bugs_found.extend(bugs)
                    console_errors.clear()

                except Exception as e:
                    self.state.error_count += 1
                    if self.state.error_count > 10:
                        break

            await browser.close()

        return self.state

    async def _decide_next_action(self, page: Page) -> TestAction:
        """Use LLM to decide the next exploration action."""
        # Get interactive elements
        elements = await page.evaluate("""
            () => {
                const els = document.querySelectorAll(
                    'a, button, input, select, textarea, '
                    + '[onclick], [role="button"]'
                );
                return Array.from(els).slice(0, 50).map(el => ({
                    tag: el.tagName,
                    text: el.textContent?.trim().slice(0, 50),
                    type: el.type || '',
                    href: el.href || '',
                    id: el.id,
                    name: el.name,
                    selector: el.id ? '#' + el.id
                        : el.name ? '[name="' + el.name + '"]'
                        : el.tagName.toLowerCase(),
                }));
            }
        """)

        visited_summary = (
            f"Visited {len(self.state.visited_urls)} pages, "
            f"clicked {len(self.state.clicked_elements)} elements, "
            f"found {len(self.state.bugs_found)} bugs so far."
        )

        response = await self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": (
                    "You are a QA tester exploring a web app to find "
                    "bugs. Choose the next action to maximize test "
                    "coverage. Prioritize untested elements and "
                    "edge cases. Return JSON: action_type, target "
                    "(selector), value (for inputs)."
                )},
                {"role": "user", "content": (
                    f"Current URL: {page.url}\n"
                    f"Page title: {await page.title()}\n"
                    f"Progress: {visited_summary}\n"
                    f"Available elements:\n"
                    f"{json.dumps(elements[:30], indent=2)}"
                )},
            ],
            response_format={"type": "json_object"},
            temperature=0.7,  # Some randomness for exploration
        )

        data = json.loads(response.choices[0].message.content)
        return TestAction(
            action_type=data.get("action_type", "click"),
            target=data.get("target", ""),
            value=data.get("value"),
        )

Bug Detection and Classification

After each action, the agent checks for bugs by analyzing the page state. It looks for HTTP errors, JavaScript console errors, visual anomalies, broken layouts, and unexpected behavior.

class BugDetector:
    def __init__(self, client: AsyncOpenAI):
        self.client = client

    async def check_for_bugs(self, page: Page,
                              action: TestAction,
                              console_errors: list[str]) -> list[BugReport]:
        """Analyze current page state for potential bugs."""
        bugs = []

        # Check 1: HTTP error pages
        status_check = await self._check_http_status(page)
        if status_check:
            bugs.append(status_check)

        # Check 2: Console errors
        for error in console_errors:
            if self._is_significant_error(error):
                bugs.append(BugReport(
                    title=f"JavaScript error: {error[:80]}",
                    severity=BugSeverity.HIGH,
                    description=f"Console error after action: {error}",
                    steps_to_reproduce=[action],
                    expected_behavior="No JavaScript errors",
                    actual_behavior=f"Console error: {error}",
                    url=page.url,
                ))

        # Check 3: Visual/functional bugs via LLM
        screenshot = await page.screenshot()
        visual_bugs = await self._llm_visual_check(
            page, screenshot, action
        )
        bugs.extend(visual_bugs)

        return bugs

    async def _check_http_status(self, page) -> Optional[BugReport]:
        """Check for 4xx/5xx error pages."""
        content = await page.text_content("body") or ""
        error_patterns = [
            "500 Internal Server Error",
            "404 Not Found",
            "403 Forbidden",
            "502 Bad Gateway",
        ]
        for pattern in error_patterns:
            if pattern.lower() in content.lower():
                return BugReport(
                    title=f"HTTP error page: {pattern}",
                    severity=BugSeverity.HIGH,
                    description=f"Page shows {pattern}",
                    steps_to_reproduce=[],
                    expected_behavior="Page loads successfully",
                    actual_behavior=f"Error page: {pattern}",
                    url=page.url,
                )
        return None

    def _is_significant_error(self, error: str) -> bool:
        """Filter out noise from console errors."""
        noise_patterns = [
            "favicon.ico",
            "third-party",
            "analytics",
            "deprecated",
        ]
        return not any(p in error.lower() for p in noise_patterns)

Report Generation

The report generator compiles all discovered bugs into a structured, actionable report.

class TestReportGenerator:
    def generate_report(self, state: ExplorationState) -> str:
        """Generate a structured test report."""
        lines = [
            "# AI Exploratory Test Report",
            f"Generated: {datetime.utcnow().isoformat()}",
            "",
            "## Summary",
            f"- Pages visited: {len(state.visited_urls)}",
            f"- Elements tested: {len(state.clicked_elements)}",
            f"- Forms submitted: {state.forms_submitted}",
            f"- Bugs found: {len(state.bugs_found)}",
            "",
        ]

        # Group bugs by severity
        for severity in BugSeverity:
            severity_bugs = [
                b for b in state.bugs_found
                if b.severity == severity
            ]
            if not severity_bugs:
                continue

            lines.append(f"## {severity.value.upper()} ({len(severity_bugs)})")
            for i, bug in enumerate(severity_bugs, 1):
                lines.extend([
                    f"### {i}. {bug.title}",
                    f"**URL:** {bug.url}",
                    f"**Description:** {bug.description}",
                    f"**Expected:** {bug.expected_behavior}",
                    f"**Actual:** {bug.actual_behavior}",
                    "",
                ])

        return "\n".join(lines)

Running the Full Testing Pipeline

async def run_ai_testing(target_url: str,
                          max_steps: int = 200) -> str:
    """Run a complete AI testing session."""
    client = AsyncOpenAI()
    engine = ExplorationEngine(client, target_url)

    state = await engine.explore(max_steps=max_steps)

    reporter = TestReportGenerator()
    report = reporter.generate_report(state)

    Path("test_report.md").write_text(report)
    print(f"Testing complete. Found {len(state.bugs_found)} bugs.")

    return report

FAQ

How does AI exploratory testing compare to traditional test suites in terms of bug detection rate?

AI exploratory testing excels at finding bugs in areas that scripted tests never cover — unusual navigation sequences, unexpected input combinations, and edge cases in form validation. In practice, AI exploratory testing finds 15-30% more unique bugs than scripted suites alone, but it is not a replacement. The best approach combines both: scripted tests for regression coverage and AI exploration for novel bug discovery.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

How do I prevent the AI tester from performing destructive actions like deleting data?

Implement an action filter that blocks dangerous operations before execution. Maintain a blocklist of selectors and action patterns (delete buttons, admin operations, payment submissions) and require explicit opt-in for destructive tests. Run the agent against a staging environment with seed data that can be reset after each test session.

Can AI testing agents generate regression test scripts from their explorations?

Yes. When the agent discovers a bug, it has a complete record of the actions that led to it. These can be converted to Playwright or Selenium test scripts that reproduce the bug deterministically. This converts exploratory findings into permanent regression tests.

#QAAutomation #AITesting #ExploratoryTesting #BugDetection #TestGeneration #AgenticAI #Playwright #AutomatedQA

Building an AI Testing Agent: Automated QA That Explores and Finds Bugs

Beyond Scripted Test Suites

Architecture of an AI Testing Agent

The Exploration Engine

Bug Detection and Classification

Report Generation

Running the Full Testing Pipeline

FAQ

How does AI exploratory testing compare to traditional test suites in terms of bug detection rate?

How do I prevent the AI tester from performing destructive actions like deleting data?

Can AI testing agents generate regression test scripts from their explorations?

Try CallSphere AI Voice Agents

Related Articles You May Like

Personal AI Assistant: How to Pick One for Business in 2026

Free AI Agents in 2026: When Free Wins and When It Costs You

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)