Skip to content
Learn Agentic AI
Learn Agentic AI12 min read9 views

Building an AI Testing Agent: Automated QA That Explores and Finds Bugs

Build an AI-powered testing agent that performs exploratory testing, automatically generates test cases, classifies discovered bugs, and produces structured reports for development teams.

Beyond Scripted Test Suites

Traditional automated testing follows scripts: visit this URL, click this button, assert that element appears. This approach catches regressions but never discovers new bugs because it only tests paths that a human already thought to check. AI testing agents flip this model. They explore the application like a curious tester, trying unexpected inputs, clicking buttons in unusual orders, and flagging behavior that looks wrong.

The difference is profound. A scripted test suite with 500 tests will always run the same 500 paths. An AI testing agent generates novel test paths on every run, covering UI states and interaction sequences that no human thought to script.

Architecture of an AI Testing Agent

An AI testing agent consists of four components: an explorer that navigates the application, a test case generator that produces structured test scenarios, a bug classifier that determines whether observed behavior is actually a defect, and a report generator that produces actionable output.

flowchart TD
    START["Building an AI Testing Agent: Automated QA That E…"] --> A
    A["Beyond Scripted Test Suites"]
    A --> B
    B["Architecture of an AI Testing Agent"]
    B --> C
    C["The Exploration Engine"]
    C --> D
    D["Bug Detection and Classification"]
    D --> E
    E["Report Generation"]
    E --> F
    F["Running the Full Testing Pipeline"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass, field
from enum import Enum
from datetime import datetime
from typing import Optional

class BugSeverity(Enum):
    CRITICAL = "critical"  # App crashes, data loss
    HIGH = "high"          # Feature broken, no workaround
    MEDIUM = "medium"      # Feature broken, workaround exists
    LOW = "low"            # Cosmetic, minor usability

@dataclass
class TestAction:
    action_type: str  # click, fill, navigate, scroll
    target: str
    value: Optional[str] = None
    screenshot_before: Optional[str] = None
    screenshot_after: Optional[str] = None

@dataclass
class BugReport:
    title: str
    severity: BugSeverity
    description: str
    steps_to_reproduce: list[TestAction]
    expected_behavior: str
    actual_behavior: str
    screenshot_path: Optional[str] = None
    url: str = ""
    discovered_at: datetime = field(
        default_factory=datetime.utcnow
    )

@dataclass
class ExplorationState:
    visited_urls: set = field(default_factory=set)
    clicked_elements: set = field(default_factory=set)
    forms_submitted: int = 0
    bugs_found: list[BugReport] = field(default_factory=list)
    action_history: list[TestAction] = field(default_factory=list)
    error_count: int = 0

The Exploration Engine

The explorer navigates the application systematically, prioritizing unvisited pages and untested interaction patterns. It uses an LLM to decide what to do next based on the current page state and exploration history.

from playwright.async_api import async_playwright, Page
from openai import AsyncOpenAI
import json

class ExplorationEngine:
    def __init__(self, client: AsyncOpenAI, base_url: str):
        self.client = client
        self.base_url = base_url
        self.state = ExplorationState()

    async def explore(self, max_steps: int = 100):
        """Main exploration loop."""
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()

            # Catch console errors and unhandled exceptions
            console_errors = []
            page.on("console", lambda msg: (
                console_errors.append(msg.text)
                if msg.type == "error" else None
            ))
            page.on("pageerror", lambda err: (
                console_errors.append(str(err))
            ))

            await page.goto(self.base_url)
            self.state.visited_urls.add(self.base_url)

            for step in range(max_steps):
                try:
                    action = await self._decide_next_action(page)
                    await self._execute_action(page, action)

                    # Check for bugs after each action
                    bugs = await self._check_for_bugs(
                        page, action, console_errors
                    )
                    self.state.bugs_found.extend(bugs)
                    console_errors.clear()

                except Exception as e:
                    self.state.error_count += 1
                    if self.state.error_count > 10:
                        break

            await browser.close()

        return self.state

    async def _decide_next_action(self, page: Page) -> TestAction:
        """Use LLM to decide the next exploration action."""
        # Get interactive elements
        elements = await page.evaluate("""
            () => {
                const els = document.querySelectorAll(
                    'a, button, input, select, textarea, '
                    + '[onclick], [role="button"]'
                );
                return Array.from(els).slice(0, 50).map(el => ({
                    tag: el.tagName,
                    text: el.textContent?.trim().slice(0, 50),
                    type: el.type || '',
                    href: el.href || '',
                    id: el.id,
                    name: el.name,
                    selector: el.id ? '#' + el.id
                        : el.name ? '[name="' + el.name + '"]'
                        : el.tagName.toLowerCase(),
                }));
            }
        """)

        visited_summary = (
            f"Visited {len(self.state.visited_urls)} pages, "
            f"clicked {len(self.state.clicked_elements)} elements, "
            f"found {len(self.state.bugs_found)} bugs so far."
        )

        response = await self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": (
                    "You are a QA tester exploring a web app to find "
                    "bugs. Choose the next action to maximize test "
                    "coverage. Prioritize untested elements and "
                    "edge cases. Return JSON: action_type, target "
                    "(selector), value (for inputs)."
                )},
                {"role": "user", "content": (
                    f"Current URL: {page.url}\n"
                    f"Page title: {await page.title()}\n"
                    f"Progress: {visited_summary}\n"
                    f"Available elements:\n"
                    f"{json.dumps(elements[:30], indent=2)}"
                )},
            ],
            response_format={"type": "json_object"},
            temperature=0.7,  # Some randomness for exploration
        )

        data = json.loads(response.choices[0].message.content)
        return TestAction(
            action_type=data.get("action_type", "click"),
            target=data.get("target", ""),
            value=data.get("value"),
        )

Bug Detection and Classification

After each action, the agent checks for bugs by analyzing the page state. It looks for HTTP errors, JavaScript console errors, visual anomalies, broken layouts, and unexpected behavior.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class BugDetector:
    def __init__(self, client: AsyncOpenAI):
        self.client = client

    async def check_for_bugs(self, page: Page,
                              action: TestAction,
                              console_errors: list[str]) -> list[BugReport]:
        """Analyze current page state for potential bugs."""
        bugs = []

        # Check 1: HTTP error pages
        status_check = await self._check_http_status(page)
        if status_check:
            bugs.append(status_check)

        # Check 2: Console errors
        for error in console_errors:
            if self._is_significant_error(error):
                bugs.append(BugReport(
                    title=f"JavaScript error: {error[:80]}",
                    severity=BugSeverity.HIGH,
                    description=f"Console error after action: {error}",
                    steps_to_reproduce=[action],
                    expected_behavior="No JavaScript errors",
                    actual_behavior=f"Console error: {error}",
                    url=page.url,
                ))

        # Check 3: Visual/functional bugs via LLM
        screenshot = await page.screenshot()
        visual_bugs = await self._llm_visual_check(
            page, screenshot, action
        )
        bugs.extend(visual_bugs)

        return bugs

    async def _check_http_status(self, page) -> Optional[BugReport]:
        """Check for 4xx/5xx error pages."""
        content = await page.text_content("body") or ""
        error_patterns = [
            "500 Internal Server Error",
            "404 Not Found",
            "403 Forbidden",
            "502 Bad Gateway",
        ]
        for pattern in error_patterns:
            if pattern.lower() in content.lower():
                return BugReport(
                    title=f"HTTP error page: {pattern}",
                    severity=BugSeverity.HIGH,
                    description=f"Page shows {pattern}",
                    steps_to_reproduce=[],
                    expected_behavior="Page loads successfully",
                    actual_behavior=f"Error page: {pattern}",
                    url=page.url,
                )
        return None

    def _is_significant_error(self, error: str) -> bool:
        """Filter out noise from console errors."""
        noise_patterns = [
            "favicon.ico",
            "third-party",
            "analytics",
            "deprecated",
        ]
        return not any(p in error.lower() for p in noise_patterns)

Report Generation

The report generator compiles all discovered bugs into a structured, actionable report.

class TestReportGenerator:
    def generate_report(self, state: ExplorationState) -> str:
        """Generate a structured test report."""
        lines = [
            "# AI Exploratory Test Report",
            f"Generated: {datetime.utcnow().isoformat()}",
            "",
            "## Summary",
            f"- Pages visited: {len(state.visited_urls)}",
            f"- Elements tested: {len(state.clicked_elements)}",
            f"- Forms submitted: {state.forms_submitted}",
            f"- Bugs found: {len(state.bugs_found)}",
            "",
        ]

        # Group bugs by severity
        for severity in BugSeverity:
            severity_bugs = [
                b for b in state.bugs_found
                if b.severity == severity
            ]
            if not severity_bugs:
                continue

            lines.append(f"## {severity.value.upper()} ({len(severity_bugs)})")
            for i, bug in enumerate(severity_bugs, 1):
                lines.extend([
                    f"### {i}. {bug.title}",
                    f"**URL:** {bug.url}",
                    f"**Description:** {bug.description}",
                    f"**Expected:** {bug.expected_behavior}",
                    f"**Actual:** {bug.actual_behavior}",
                    "",
                ])

        return "\n".join(lines)

Running the Full Testing Pipeline

async def run_ai_testing(target_url: str,
                          max_steps: int = 200) -> str:
    """Run a complete AI testing session."""
    client = AsyncOpenAI()
    engine = ExplorationEngine(client, target_url)

    state = await engine.explore(max_steps=max_steps)

    reporter = TestReportGenerator()
    report = reporter.generate_report(state)

    Path("test_report.md").write_text(report)
    print(f"Testing complete. Found {len(state.bugs_found)} bugs.")

    return report

FAQ

How does AI exploratory testing compare to traditional test suites in terms of bug detection rate?

AI exploratory testing excels at finding bugs in areas that scripted tests never cover — unusual navigation sequences, unexpected input combinations, and edge cases in form validation. In practice, AI exploratory testing finds 15-30% more unique bugs than scripted suites alone, but it is not a replacement. The best approach combines both: scripted tests for regression coverage and AI exploration for novel bug discovery.

How do I prevent the AI tester from performing destructive actions like deleting data?

Implement an action filter that blocks dangerous operations before execution. Maintain a blocklist of selectors and action patterns (delete buttons, admin operations, payment submissions) and require explicit opt-in for destructive tests. Run the agent against a staging environment with seed data that can be reset after each test session.

Can AI testing agents generate regression test scripts from their explorations?

Yes. When the agent discovers a bug, it has a complete record of the actions that led to it. These can be converted to Playwright or Selenium test scripts that reproduce the bug deterministically. This converts exploratory findings into permanent regression tests.


#QAAutomation #AITesting #ExploratoryTesting #BugDetection #TestGeneration #AgenticAI #Playwright #AutomatedQA

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Learn Agentic AI

API Design for AI Agent Tool Functions: Best Practices and Anti-Patterns

How to design tool functions that LLMs can use effectively with clear naming, enum parameters, structured responses, informative error messages, and documentation.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

Google Cloud AI Agent Trends Report 2026: Key Findings and Developer Implications

Analysis of Google Cloud's 2026 AI agent trends report covering Gemini-powered agents, Google ADK, Vertex AI agent builder, and enterprise adoption patterns.