---
title: "Error Recovery Patterns: Self-Healing Agents That Fix Their Own Mistakes"
description: "Build AI agents that detect their own errors, apply correction strategies, and learn from failures through feedback loops. Covers error detection, self-correction, escalation paths, and continuous improvement."
canonical: https://callsphere.ai/blog/error-recovery-patterns-self-healing-agents-fix-own-mistakes
category: "Learn Agentic AI"
tags: ["Self-Healing", "Error Recovery", "Feedback Loops", "AI Agents", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:44.153Z
---

# Error Recovery Patterns: Self-Healing Agents That Fix Their Own Mistakes

> Build AI agents that detect their own errors, apply correction strategies, and learn from failures through feedback loops. Covers error detection, self-correction, escalation paths, and continuous improvement.

## Beyond Crash and Retry: Agents That Correct Themselves

Traditional error handling stops at retry and abort. But LLM-powered agents have a unique capability that conventional software does not — they can reason about their own failures. When a tool call returns an error, the agent can read the error message, understand what went wrong, and try a different approach. This self-healing capability is what separates fragile demos from production-grade agents.

The challenge is building structured self-healing that is reliable, bounded, and observable.

## The Self-Healing Loop

A self-healing agent wraps its execution in a loop that detects errors, diagnoses the cause, and applies a correction strategy.

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, Optional
import logging

logger = logging.getLogger("agent.self_heal")

class RecoveryAction(Enum):
    RETRY_SAME = "retry_same"
    RETRY_MODIFIED = "retry_modified"
    USE_ALTERNATIVE = "use_alternative"
    ASK_USER = "ask_user"
    ESCALATE = "escalate"
    ABORT = "abort"

@dataclass
class ErrorDiagnosis:
    error_type: str
    root_cause: str
    recovery_action: RecoveryAction
    modified_args: Optional[dict] = None
    alternative_tool: Optional[str] = None
    user_message: Optional[str] = None

@dataclass
class HealingAttempt:
    diagnosis: ErrorDiagnosis
    success: bool
    result: Optional[dict] = None

class SelfHealingAgent:
    def __init__(self, llm_client, tool_registry: dict, max_healing_attempts: int = 3):
        self.llm = llm_client
        self.tools = tool_registry
        self.max_healing_attempts = max_healing_attempts
        self.healing_history: list[HealingAttempt] = []

    async def execute_with_healing(
        self, tool_name: str, args: dict, context: str = "",
    ) -> dict:
        """Execute a tool call with self-healing on failure."""
        # First attempt
        try:
            return await self._call_tool(tool_name, args)
        except Exception as first_error:
            logger.warning(f"Tool {tool_name} failed: {first_error}")

        # Self-healing loop
        last_error = first_error
        for attempt in range(self.max_healing_attempts):
            diagnosis = await self._diagnose_error(
                tool_name, args, last_error, context,
            )
            logger.info(
                f"Healing attempt {attempt + 1}: {diagnosis.recovery_action.value}"
            )

            if diagnosis.recovery_action == RecoveryAction.ABORT:
                raise RuntimeError(f"Unrecoverable: {diagnosis.root_cause}")

            if diagnosis.recovery_action == RecoveryAction.ASK_USER:
                return {"needs_input": True, "message": diagnosis.user_message}

            if diagnosis.recovery_action == RecoveryAction.ESCALATE:
                return {"escalated": True, "reason": diagnosis.root_cause}

            try:
                result = await self._apply_recovery(diagnosis, tool_name, args)
                self.healing_history.append(
                    HealingAttempt(diagnosis=diagnosis, success=True, result=result)
                )
                return result
            except Exception as exc:
                last_error = exc
                self.healing_history.append(
                    HealingAttempt(diagnosis=diagnosis, success=False)
                )

        raise RuntimeError(
            f"Failed after {self.max_healing_attempts} healing attempts"
        )
```

## LLM-Powered Error Diagnosis

The agent uses its LLM to analyze the error and determine the best recovery strategy.

```python
    async def _diagnose_error(
        self, tool_name: str, args: dict, error: Exception, context: str,
    ) -> ErrorDiagnosis:
        """Use the LLM to diagnose the error and recommend recovery."""
        diagnosis_prompt = f"""A tool call failed. Diagnose the error and recommend a recovery action.

Tool: {tool_name}
Arguments: {args}
Error: {type(error).__name__}: {error}
Context: {context}

Previous healing attempts for this request:
{self._format_history()}

Choose ONE recovery action:
- RETRY_MODIFIED: Fix the arguments and retry (provide corrected args)
- USE_ALTERNATIVE: Use a different tool (specify which)
- ASK_USER: Need clarification from the user (provide a question)
- ESCALATE: This needs human operator intervention
- ABORT: This cannot be recovered

Respond in this exact format:
ACTION:
ROOT_CAUSE:
MODIFIED_ARGS:
ALTERNATIVE_TOOL:
USER_MESSAGE: """

        response = await self.llm.complete(diagnosis_prompt)
        return self._parse_diagnosis(response)
```

## Structured Recovery Strategies

Each recovery action maps to a concrete execution path.

```python
    async def _apply_recovery(
        self, diagnosis: ErrorDiagnosis, original_tool: str, original_args: dict,
    ) -> dict:
        if diagnosis.recovery_action == RecoveryAction.RETRY_SAME:
            return await self._call_tool(original_tool, original_args)

        elif diagnosis.recovery_action == RecoveryAction.RETRY_MODIFIED:
            modified = {**original_args, **(diagnosis.modified_args or {})}
            return await self._call_tool(original_tool, modified)

        elif diagnosis.recovery_action == RecoveryAction.USE_ALTERNATIVE:
            alt_tool = diagnosis.alternative_tool
            if alt_tool not in self.tools:
                raise ValueError(f"Alternative tool '{alt_tool}' not found")
            return await self._call_tool(alt_tool, original_args)

        raise ValueError(f"Unhandled recovery: {diagnosis.recovery_action}")

    async def _call_tool(self, tool_name: str, args: dict) -> dict:
        tool_fn = self.tools.get(tool_name)
        if not tool_fn:
            raise ValueError(f"Tool '{tool_name}' not registered")
        return await tool_fn(args)

    def _format_history(self) -> str:
        if not self.healing_history:
            return "None"
        lines = []
        for h in self.healing_history:
            lines.append(
                f"- {h.diagnosis.recovery_action.value}: "
                f"{'succeeded' if h.success else 'failed'} "
                f"(cause: {h.diagnosis.root_cause})"
            )
        return "\n".join(lines)
```

## Feedback Loop for Continuous Improvement

Track which error patterns the agent encounters and how successfully it recovers. This data informs prompt improvements and tool hardening.

```python
from collections import defaultdict

class HealingMetrics:
    def __init__(self):
        self.error_counts: dict[str, int] = defaultdict(int)
        self.recovery_success: dict[str, list[bool]] = defaultdict(list)

    def record(self, error_type: str, recovery_action: str, success: bool):
        key = f"{error_type}:{recovery_action}"
        self.error_counts[error_type] += 1
        self.recovery_success[key].append(success)

    def success_rate(self, error_type: str, recovery_action: str) -> float:
        key = f"{error_type}:{recovery_action}"
        results = self.recovery_success.get(key, [])
        if not results:
            return 0.0
        return sum(results) / len(results)

    def report(self) -> dict:
        report = {}
        for key, results in self.recovery_success.items():
            rate = sum(results) / len(results) if results else 0
            report[key] = {
                "attempts": len(results),
                "success_rate": round(rate, 2),
            }
        return report
```

## Guardrails: Preventing Infinite Healing Loops

Always cap the number of healing attempts, track token spend during recovery, and prevent the agent from trying the same failed strategy twice.

```python
class HealingGuardrails:
    def __init__(self, max_attempts: int = 3, max_token_budget: int = 5000):
        self.max_attempts = max_attempts
        self.max_token_budget = max_token_budget
        self.tokens_used = 0
        self.tried_strategies: set[str] = set()

    def can_continue(self, attempt: int, proposed_action: str) -> bool:
        if attempt >= self.max_attempts:
            return False
        if self.tokens_used >= self.max_token_budget:
            return False
        if proposed_action in self.tried_strategies:
            return False
        return True

    def record_attempt(self, action: str, tokens: int):
        self.tried_strategies.add(action)
        self.tokens_used += tokens
```

## FAQ

### Is it safe to let the LLM decide how to fix its own errors?

Yes, with guardrails. The LLM's diagnosis should be constrained to a fixed set of recovery actions (the RecoveryAction enum). The agent code validates the proposed action and prevents unsafe operations like modifying arguments in ways that bypass business rules. The LLM provides intelligence; the code provides safety boundaries.

### How do I prevent the agent from looping between two failing strategies?

Track all attempted strategies in a set and reject any strategy that has already been tried. The HealingGuardrails class above implements this. Additionally, include the full healing history in the diagnosis prompt so the LLM knows which approaches have already failed and can choose a different path.

### When should self-healing escalate to a human?

Escalate when the error involves ambiguous user intent (the agent is unsure what the user wants), when the failure involves financial or irreversible actions, or when the maximum healing attempts are exhausted. The escalation path should capture the full context — original request, error, all healing attempts — so the human reviewer can resolve the issue without asking the user to repeat themselves.

---

#SelfHealing #ErrorRecovery #FeedbackLoops #AIAgents #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/error-recovery-patterns-self-healing-agents-fix-own-mistakes
