Reliability Patterns for AI Systems: Circuit Breakers, Retries, Fallbacks
Circuit breakers, retries, and fallbacks for AI systems require LLM-aware tweaks. The 2026 reliability patterns that actually hold up.
Why LLM Reliability Patterns Differ
Standard reliability patterns (circuit breakers, retries, fallbacks) apply to LLM systems but need LLM-aware adaptations. Naive retries on LLM 429s amplify the rate-limit problem. Circuit breakers tuned for traditional services fire too late or too early. Fallbacks need to preserve quality.
This piece walks through the LLM-aware versions.
Circuit Breakers
flowchart LR
Open[Open: failing] --> HalfOpen[Half-open: probing]
HalfOpen --> Closed[Closed: healthy]
Closed --> Open
A circuit breaker tracks recent failure rate. When it crosses a threshold, the breaker opens and short-circuits requests (fail fast). After a cool-down, it tries again.
For LLM APIs, the patterns:
- Open on persistent 5xx or 429 errors
- Cool-down typically 30-60 seconds
- Probe with synthetic traffic, not user traffic
- Different breakers per provider in multi-provider stacks
Retries
Standard exponential backoff with caps. For LLM:
- Cap retry count (3-5 typical)
- Cap total retry time (10-30 seconds)
- Respect retry-after headers
- Distinguish retryable (5xx, 429, timeout) from non-retryable (400, 401)
Fallbacks
Multi-tier degradation:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
Try[Try primary] --> Pri{OK?}
Pri -->|Yes| Done[Return]
Pri -->|No| Sec[Try secondary provider]
Sec --> Sec2{OK?}
Sec2 -->|Yes| Done
Sec2 -->|No| Cache[Use cached recent response]
Cache --> Cache2{Available?}
Cache2 -->|Yes| Done
Cache2 -->|No| Static[Static fallback message]
Four levels of degradation. Each is faster and lower-quality.
Idempotency
Retries assume idempotency. For LLM with side effects (tool calls), this is not free:
- Track operation IDs
- Don't repeat the side effect
- Use the operation ID to detect duplicates server-side
For pure response generation (no side effect), retry is safe.
Hedged Requests
For latency-sensitive workloads, send the request to two providers; use whichever responds first. Cancels on first response.
- Cost: 2x request cost
- Benefit: latency = min of two; reduces tail latency
Used for premium-tier workloads where p99 latency matters more than cost.
Time-Outs
Per-request timeouts must be set:
- Total request timeout
- Streaming idle timeout (no token in N seconds)
- Connection timeout
Without timeouts, hung connections accumulate.
Bulkheads
Isolate failure domains:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- One tenant's high load does not consume all gateway capacity
- One model's outage does not affect others
- One feature's bug does not crash unrelated features
Per-tenant pools, per-model pools, per-feature instances.
Graceful Degradation
When all else fails:
- Static cached responses for common queries
- Queue requests for later
- Inform user with helpful message
- Log for review
The user sees something useful, not a 500 error.
Observability for Reliability
For each request:
- Provider used
- Whether retries occurred
- Whether fallbacks engaged
- End-to-end success
- Total latency
Without these, debugging reliability is guesswork.
A Production Reliability Stack
flowchart LR
Req[Request] --> Time[Timeout]
Time --> Circuit[Circuit breaker]
Circuit --> Gate[Gateway]
Gate --> Hedge[Hedged?]
Hedge --> P1[Primary provider]
Hedge --> P2[Secondary]
P1 --> Retry[Retry on transient]
P2 --> Retry
Retry --> Fallback[Fallback chain]
Layered. Each layer is testable. Compromise of one does not bring down the system.
What CallSphere Implements
For voice agents:
- Per-provider circuit breakers
- Hedged requests for latency-critical tool calls
- Multi-provider failover at gateway
- Cached recent responses as last resort
- Static "we're experiencing issues" message as final fallback
Reliability target: 99.9 percent perceived uptime even with single-provider 99.5 percent uptime.
Sources
- "Release It!" Michael Nygard — https://pragprog.com
- Google SRE book — https://sre.google
- "Reliability patterns" Hystrix — https://github.com/Netflix/Hystrix
- "AI reliability engineering" Hamel Husain — https://hamel.dev
- "Bulkhead pattern" Microsoft — https://learn.microsoft.com
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.