Skip to content
Learn Agentic AI
Learn Agentic AI13 min read3 views

Error Tracking in AI Agent Systems: Sentry, PagerDuty, and Custom Alerting

Implement comprehensive error tracking for AI agent systems with error classification, severity-based alert routing to Sentry and PagerDuty, and incident response workflows tailored to LLM failure modes.

Agent Error Modes Are Different

Traditional applications have well-understood failure modes: null pointer exceptions, connection timeouts, authentication failures. AI agents add an entirely new category of errors that are harder to detect and classify. An LLM might return a syntactically valid response that calls a nonexistent tool. A tool call might succeed with HTTP 200 but return data the agent misinterprets. The agent might enter an infinite loop of tool calls without ever producing a final answer.

These failure modes require error tracking that goes beyond exception monitoring. You need to classify errors by type, route alerts based on severity and impact, and build incident response workflows that account for the probabilistic nature of LLM behavior.

Classifying Agent Errors

Define a taxonomy of error types so your alerting can be granular. Not all agent errors deserve the same response.

flowchart TD
    START["Error Tracking in AI Agent Systems: Sentry, Pager…"] --> A
    A["Agent Error Modes Are Different"]
    A --> B
    B["Classifying Agent Errors"]
    B --> C
    C["Integrating Sentry for Error Tracking"]
    C --> D
    D["Building a Custom Alert Router"]
    D --> E
    E["Detecting Agent-Specific Failure Patter…"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from enum import Enum

class AgentErrorType(Enum):
    # Infrastructure errors - immediate attention
    LLM_API_UNREACHABLE = "llm_api_unreachable"
    DATABASE_CONNECTION_FAILED = "database_connection_failed"
    TOOL_SERVER_DOWN = "tool_server_down"

    # LLM behavior errors - investigate if frequent
    LLM_INVALID_TOOL_CALL = "llm_invalid_tool_call"
    LLM_REFUSED_REQUEST = "llm_refused_request"
    LLM_INFINITE_LOOP = "llm_infinite_loop"
    LLM_CONTEXT_OVERFLOW = "llm_context_overflow"

    # Tool execution errors - may need tool-specific fixes
    TOOL_EXECUTION_FAILED = "tool_execution_failed"
    TOOL_TIMEOUT = "tool_timeout"
    TOOL_INVALID_RESPONSE = "tool_invalid_response"

    # Validation errors - usually indicates prompt issues
    OUTPUT_VALIDATION_FAILED = "output_validation_failed"
    GUARDRAIL_TRIGGERED = "guardrail_triggered"

class AgentError(Exception):
    def __init__(
        self,
        error_type: AgentErrorType,
        message: str,
        severity: str = "error",
        context: dict = None,
    ):
        super().__init__(message)
        self.error_type = error_type
        self.severity = severity
        self.context = context or {}

Integrating Sentry for Error Tracking

Sentry captures exceptions with full stack traces, groups them by root cause, and tracks their frequency over time. Configure it to enrich agent errors with custom context.

import sentry_sdk
from sentry_sdk import set_tag, set_context, capture_exception

sentry_sdk.init(
    dsn="https://[email protected]/project-id",
    traces_sample_rate=0.1,
    environment="production",
    release="[email protected]",
)

async def handle_agent_error(error: AgentError, conversation_id: str, user_id: str):
    """Report agent errors to Sentry with rich context."""
    set_tag("error_type", error.error_type.value)
    set_tag("severity", error.severity)
    set_tag("agent_name", error.context.get("agent_name", "unknown"))

    set_context("agent", {
        "conversation_id": conversation_id,
        "user_id": user_id,
        "error_type": error.error_type.value,
        "model": error.context.get("model"),
        "tool_name": error.context.get("tool_name"),
        "step": error.context.get("step"),
    })

    capture_exception(error)

Building a Custom Alert Router

Different error types warrant different responses. Infrastructure errors need PagerDuty pages. LLM behavior errors need Slack notifications. Validation errors need logging for later analysis.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from dataclasses import dataclass
import httpx

@dataclass
class AlertConfig:
    pagerduty_key: str
    slack_webhook: str
    email_endpoint: str

class AlertRouter:
    def __init__(self, config: AlertConfig):
        self.config = config
        self.client = httpx.AsyncClient()

    async def route_alert(self, error: AgentError, conversation_id: str):
        error_type = error.error_type

        # Critical infrastructure errors -> PagerDuty
        if error_type in (
            AgentErrorType.LLM_API_UNREACHABLE,
            AgentErrorType.DATABASE_CONNECTION_FAILED,
            AgentErrorType.TOOL_SERVER_DOWN,
        ):
            await self._page_oncall(error, conversation_id)
            await self._notify_slack(error, conversation_id, channel="#incidents")

        # LLM behavior errors -> Slack warning
        elif error_type in (
            AgentErrorType.LLM_INFINITE_LOOP,
            AgentErrorType.LLM_CONTEXT_OVERFLOW,
        ):
            await self._notify_slack(error, conversation_id, channel="#agent-alerts")

        # Tool errors -> Slack if frequent
        elif error_type in (
            AgentErrorType.TOOL_EXECUTION_FAILED,
            AgentErrorType.TOOL_TIMEOUT,
        ):
            if await self._error_rate_exceeds_threshold(error_type, threshold=10, window_minutes=5):
                await self._notify_slack(error, conversation_id, channel="#agent-alerts")

    async def _page_oncall(self, error: AgentError, conversation_id: str):
        await self.client.post(
            "https://events.pagerduty.com/v2/enqueue",
            json={
                "routing_key": self.config.pagerduty_key,
                "event_action": "trigger",
                "payload": {
                    "summary": f"Agent error: {error.error_type.value} - {str(error)}",
                    "severity": "critical",
                    "source": "agent-service",
                    "custom_details": {
                        "conversation_id": conversation_id,
                        **error.context,
                    },
                },
            },
        )

    async def _notify_slack(self, error: AgentError, conversation_id: str, channel: str):
        await self.client.post(
            self.config.slack_webhook,
            json={
                "channel": channel,
                "text": f"*Agent Error*: {error.error_type.value}\n"
                        f"Message: {str(error)}\n"
                        f"Conversation: {conversation_id}",
            },
        )

Detecting Agent-Specific Failure Patterns

Some agent failures do not raise exceptions. Detect them with runtime checks.

MAX_TOOL_CALLS_PER_TURN = 10
MAX_AGENT_TURNS = 25

class AgentLoopGuard:
    def __init__(self):
        self.tool_call_count = 0
        self.turn_count = 0
        self.seen_tool_calls = []

    def check_tool_call(self, tool_name: str, arguments: dict):
        self.tool_call_count += 1
        call_signature = f"{tool_name}:{hash(str(sorted(arguments.items())))}"

        # Detect infinite loop: same tool call repeated
        if call_signature in self.seen_tool_calls[-3:]:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Repeated tool call detected: {tool_name}",
                severity="critical",
                context={"tool_name": tool_name, "repeat_count": 3},
            )

        if self.tool_call_count > MAX_TOOL_CALLS_PER_TURN:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Tool call limit exceeded: {self.tool_call_count}",
                severity="error",
            )

        self.seen_tool_calls.append(call_signature)

    def check_turn(self):
        self.turn_count += 1
        if self.turn_count > MAX_AGENT_TURNS:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Agent turn limit exceeded: {self.turn_count}",
                severity="error",
            )

FAQ

How do I avoid alert fatigue with AI agents?

Use rate-based alerting instead of per-error alerting. A single tool failure is normal — tools can be temporarily unavailable. Page oncall only when the error rate for a given type exceeds a threshold within a time window. For LLM behavior errors, alert on percentage of conversations affected rather than raw count. Review and tune thresholds weekly during the first month of deployment.

Should I retry LLM calls automatically before raising an error?

Yes, but with limits. Retry transient errors like rate limits (HTTP 429) and server errors (HTTP 500-503) with exponential backoff, up to 3 attempts. Do not retry content policy violations (HTTP 400) or context length errors — these will fail again with the same input. Track retry counts in your error metadata so you can monitor retry rates.

How do I handle errors gracefully so the user gets a useful response?

Implement a fallback chain. If the primary model fails, try a fallback model. If all LLM calls fail, return a static message like "I am having trouble processing your request. Please try again in a moment." Never expose raw error messages or stack traces to users. Log the full error details for your engineering team and return a user-friendly message with a reference ID they can share with support.


#ErrorTracking #Sentry #PagerDuty #Alerting #IncidentResponse #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

Incident Response for AI Agent Breaches: Detection, Containment, and Recovery

Build a comprehensive incident response plan for AI agent security breaches, including detection signals, automated containment, investigation procedures, recovery steps, and post-mortem processes for continuous improvement.

Learn Agentic AI

On-Call for AI Agent Systems: Alert Routing, Escalation, and Response Procedures

Design effective on-call systems for AI agents with PagerDuty setup, rotation design, escalation policies, alert routing, and post-incident review processes tailored to the unique demands of autonomous agent systems.

Use Cases

Service Outage Communication Floods Phone Lines: Use Chat and Voice Agents to Control the Spike

Outages trigger repetitive contacts and long hold times. Learn how AI chat and voice agents share updates, triage exceptions, and protect the service team.

Learn Agentic AI

AI Agent SLA Management: Uptime Monitoring, Error Budgets, and Incident Response

Implement SLA management for AI agent platforms with uptime monitoring, error budget tracking, and automated incident response. Learn how to define meaningful SLIs and SLOs for AI agents and build escalation workflows.

Learn Agentic AI

Continuous Evaluation in Production: Real-Time Quality Monitoring for Deployed Agents

Learn how to implement continuous evaluation for production AI agents with sampling strategies, real-time quality dashboards, alerting on quality degradation, and feedback loops that drive iterative improvement.

Learn Agentic AI

Building an Incident Response Agent: Automated Triage, Diagnosis, and Remediation

Learn how to build an AI agent that ingests alerts from monitoring systems, triages severity, runs diagnostic playbooks, attempts automated remediation, and generates post-incident reports.