Skip to content
Platform Reliability: Building 99.9% Uptime for an AI Agent SaaS
Learn Agentic AI14 min read9 views

Platform Reliability: Building 99.9% Uptime for an AI Agent SaaS

Engineer 99.9% uptime for an AI agent platform through redundancy design, health checking, circuit breakers, graceful degradation, and chaos engineering practices that find failures before your customers do.

The Math of 99.9%

99.9% uptime sounds impressive until you do the math. It allows 8.76 hours of downtime per year, or 43.8 minutes per month. For an agent platform serving customer-facing chatbots, 43 minutes of downtime means 43 minutes where your customers' customers get error messages instead of answers. That is enough to lose enterprise accounts.

The path to 99.9% is not about preventing all failures — it is about ensuring that no single failure takes down the entire system. Every component must be redundant, every dependency must have a fallback, and every failure mode must be detected and isolated within seconds.

Health Check System

Reliable systems start with reliable health checks. Shallow checks that return 200 OK without testing dependencies are useless. Deep health checks verify that the service can actually do its job:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
    CALL(["Inbound Call"])
    HEALTH{"Primary<br/>agent healthy?"}
    PRIMARY["Primary agent<br/>LLM provider A"]
    SECONDARY["Hot standby<br/>LLM provider B"]
    QUEUE[("Persisted<br/>call state")]
    HUMAN(["Live human<br/>fallback"])
    DONE(["Caller served"])
    CALL --> HEALTH
    HEALTH -->|Yes| PRIMARY
    HEALTH -->|Timeout or 5xx| SECONDARY
    PRIMARY --> QUEUE
    SECONDARY --> QUEUE
    PRIMARY --> DONE
    SECONDARY --> DONE
    SECONDARY -->|Both fail| HUMAN
    style HEALTH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PRIMARY fill:#4f46e5,stroke:#4338ca,color:#fff
    style SECONDARY fill:#0ea5e9,stroke:#0369a1,color:#fff
    style HUMAN fill:#dc2626,stroke:#b91c1c,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
# health.py — Deep health check implementation
import asyncio
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional

class HealthStatus(str, Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"

@dataclass
class ComponentHealth:
    name: str
    status: HealthStatus
    latency_ms: float
    message: str = ""

@dataclass
class SystemHealth:
    status: HealthStatus
    components: list[ComponentHealth] = field(default_factory=list)
    timestamp: float = field(default_factory=time.time)

class HealthChecker:
    def __init__(self, db, redis_client, llm_client):
        self.db = db
        self.redis = redis_client
        self.llm = llm_client

    async def check(self) -> SystemHealth:
        checks = await asyncio.gather(
            self._check_database(),
            self._check_redis(),
            self._check_llm_provider(),
            return_exceptions=True,
        )

        components = []
        for result in checks:
            if isinstance(result, Exception):
                components.append(ComponentHealth(
                    name="unknown", status=HealthStatus.UNHEALTHY,
                    latency_ms=0, message=str(result),
                ))
            else:
                components.append(result)

        unhealthy = sum(1 for c in components if c.status == HealthStatus.UNHEALTHY)
        degraded = sum(1 for c in components if c.status == HealthStatus.DEGRADED)

        if unhealthy > 0:
            overall = HealthStatus.UNHEALTHY
        elif degraded > 0:
            overall = HealthStatus.DEGRADED
        else:
            overall = HealthStatus.HEALTHY

        return SystemHealth(status=overall, components=components)

    async def _check_database(self) -> ComponentHealth:
        start = time.monotonic()
        try:
            await self.db.execute("SELECT 1")
            latency = (time.monotonic() - start) * 1000
            status = HealthStatus.HEALTHY if latency < 100 else HealthStatus.DEGRADED
            return ComponentHealth("database", status, latency)
        except Exception as e:
            return ComponentHealth("database", HealthStatus.UNHEALTHY, 0, str(e))

    async def _check_redis(self) -> ComponentHealth:
        start = time.monotonic()
        try:
            await self.redis.ping()
            latency = (time.monotonic() - start) * 1000
            status = HealthStatus.HEALTHY if latency < 50 else HealthStatus.DEGRADED
            return ComponentHealth("redis", status, latency)
        except Exception as e:
            return ComponentHealth("redis", HealthStatus.UNHEALTHY, 0, str(e))

    async def _check_llm_provider(self) -> ComponentHealth:
        start = time.monotonic()
        try:
            # Minimal completion to verify API connectivity
            response = await self.llm.completions.create(
                model="gpt-4o-mini", messages=[{"role": "user", "content": "ping"}],
                max_tokens=1,
            )
            latency = (time.monotonic() - start) * 1000
            status = HealthStatus.HEALTHY if latency < 2000 else HealthStatus.DEGRADED
            return ComponentHealth("llm_provider", status, latency)
        except Exception as e:
            return ComponentHealth("llm_provider", HealthStatus.UNHEALTHY, 0, str(e))

Circuit Breaker Pattern

When an LLM provider goes down, you do not want every request to wait 30 seconds for a timeout. A circuit breaker detects failure patterns and fails fast:

# circuit_breaker.py — Circuit breaker for external dependencies
import time
from enum import Enum
from dataclasses import dataclass

class CircuitState(str, Enum):
    CLOSED = "closed"       # Normal operation
    OPEN = "open"           # Failing fast, not sending requests
    HALF_OPEN = "half_open" # Testing if the service recovered

@dataclass
class CircuitBreaker:
    name: str
    failure_threshold: int = 5
    recovery_timeout: float = 30.0  # seconds
    half_open_max_calls: int = 3

    _state: CircuitState = CircuitState.CLOSED
    _failure_count: int = 0
    _last_failure_time: float = 0
    _half_open_calls: int = 0

    @property
    def state(self) -> CircuitState:
        if self._state == CircuitState.OPEN:
            if time.monotonic() - self._last_failure_time > self.recovery_timeout:
                self._state = CircuitState.HALF_OPEN
                self._half_open_calls = 0
        return self._state

    def record_success(self):
        if self._state == CircuitState.HALF_OPEN:
            self._half_open_calls += 1
            if self._half_open_calls >= self.half_open_max_calls:
                self._state = CircuitState.CLOSED
                self._failure_count = 0
        else:
            self._failure_count = 0

    def record_failure(self):
        self._failure_count += 1
        self._last_failure_time = time.monotonic()
        if self._failure_count >= self.failure_threshold:
            self._state = CircuitState.OPEN

    def allow_request(self) -> bool:
        state = self.state
        if state == CircuitState.CLOSED:
            return True
        if state == CircuitState.HALF_OPEN:
            return True
        return False  # OPEN — fail fast

Multi-Provider LLM Failover

The highest-risk dependency for an agent platform is the LLM provider. If OpenAI goes down, your entire platform goes down — unless you have failover:

# llm_failover.py — Multi-provider LLM failover
class LLMFailoverClient:
    def __init__(self, providers: list[dict]):
        self.providers = providers  # [{"name": "openai", "client": ..., "models": {...}}]
        self.breakers = {p["name"]: CircuitBreaker(name=p["name"]) for p in providers}

    async def complete(self, messages: list, model: str, **kwargs) -> dict:
        errors = []
        for provider in self.providers:
            breaker = self.breakers[provider["name"]]
            if not breaker.allow_request():
                errors.append(f"{provider['name']}: circuit open")
                continue

            mapped_model = provider["models"].get(model, model)
            try:
                result = await provider["client"].chat.completions.create(
                    model=mapped_model, messages=messages, **kwargs,
                )
                breaker.record_success()
                return {
                    "content": result.choices[0].message.content,
                    "provider": provider["name"],
                    "model": mapped_model,
                    "input_tokens": result.usage.prompt_tokens,
                    "output_tokens": result.usage.completion_tokens,
                }
            except Exception as e:
                breaker.record_failure()
                errors.append(f"{provider['name']}: {str(e)}")

        raise AllProvidersFailedError(
            f"All LLM providers failed: {'; '.join(errors)}"
        )

# Configuration
failover_client = LLMFailoverClient([
    {
        "name": "openai",
        "client": openai_client,
        "models": {"gpt-4o": "gpt-4o", "gpt-4o-mini": "gpt-4o-mini"},
    },
    {
        "name": "anthropic",
        "client": anthropic_client,
        "models": {"gpt-4o": "claude-sonnet-4-20250514", "gpt-4o-mini": "claude-haiku-4-20250414"},
    },
])

Graceful Degradation Strategy

When components fail, the system should degrade gracefully rather than crash entirely:

# degradation.py — Graceful degradation policies
class DegradationPolicy:
    def __init__(self, health_checker: HealthChecker):
        self.health = health_checker

    async def get_capabilities(self) -> dict:
        health = await self.health.check()
        component_status = {c.name: c.status for c in health.components}

        return {
            "chat": component_status.get("llm_provider") != HealthStatus.UNHEALTHY,
            "streaming": component_status.get("llm_provider") == HealthStatus.HEALTHY,
            "conversation_history": component_status.get("database") != HealthStatus.UNHEALTHY,
            "analytics": component_status.get("database") == HealthStatus.HEALTHY,
            "caching": component_status.get("redis") != HealthStatus.UNHEALTHY,
            "real_time_usage": component_status.get("redis") == HealthStatus.HEALTHY,
        }

If Redis is down, the system still works — it just skips caching. If the database is degraded, analytics queries are disabled but chat continues using in-memory conversation state. This layered degradation keeps the core functionality running even when supporting services fail.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

FAQ

How do I implement chaos engineering without breaking production?

Start with game days in a staging environment. Use tools like Chaos Monkey or LitmusChaos to randomly kill pods, inject network latency, and simulate LLM provider outages. Once your team is comfortable with the failure modes, introduce controlled chaos in production during business hours with the team ready to intervene. Never run chaos experiments during peak traffic or outside business hours.

What monitoring and alerting should I set up for 99.9% uptime?

Monitor four golden signals: latency (P50, P95, P99 response times), traffic (requests per second), errors (error rate by status code), and saturation (CPU, memory, connection pool usage). Set alerts on error rate exceeding 1% for 5 minutes and P95 latency exceeding 5 seconds for 10 minutes. Use PagerDuty or Opsgenie for on-call rotation. Dashboard these in Grafana with a 30-day uptime counter visible to the entire team.

How do I handle planned maintenance without counting against my uptime SLA?

Schedule maintenance windows in advance and communicate them to customers 72 hours ahead. Use blue-green deployments so that most updates require zero downtime. For database migrations that require downtime, run them during the lowest-traffic window and keep the maintenance window under 15 minutes. Your SLA should explicitly exclude pre-announced maintenance windows.


#Reliability #SRE #Uptime #AIAgents #Infrastructure #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Agents

Personal AI Assistant: How to Pick One for Business in 2026

A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.

AI Agents

Free AI Agents in 2026: When Free Wins and When It Costs You

A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.

Agentic AI

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.

AI Agents

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.

HVAC

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.

AI Engineering

Self-Correcting Agents: How Model-Native Loops Handle Failure in 2026

Self-correction is now a property of the model, not the framework. What that means for production agent reliability, voice/chat fallbacks, and CallSphere.