LLM API Gateway Design Patterns: Rate Limiting, Caching, and Fallbacks
Design patterns for building a production LLM API gateway — including intelligent rate limiting, semantic caching, provider fallbacks, and request routing for multi-model deployments.
Why LLM Applications Need a Specialized Gateway
Standard API gateways handle authentication, rate limiting, and routing for traditional APIs. LLM APIs have additional requirements that standard gateways do not address:
- Token-based billing: Costs scale with input/output tokens, not request count
- Variable latency: Streaming responses can take 5-30 seconds
- Multi-provider routing: Most production systems use multiple LLM providers (OpenAI, Anthropic, Google) for redundancy and cost optimization
- Semantic-aware caching: Identical queries should be cacheable even if worded slightly differently
- Content safety: Inputs and outputs may need content filtering before reaching the LLM or the user
An LLM API gateway sits between your application and LLM providers, handling these concerns in a single layer.
Core Pattern 1: Token-Aware Rate Limiting
Standard rate limiters count requests. LLM rate limiters need to count tokens, because a single request with a 100K context window costs 100x more than a simple query.
flowchart TD
START["LLM API Gateway Design Patterns: Rate Limiting, C…"] --> A
A["Why LLM Applications Need a Specialized…"]
A --> B
B["Core Pattern 1: Token-Aware Rate Limiti…"]
B --> C
C["Core Pattern 2: Semantic Caching Layer"]
C --> D
D["Core Pattern 3: Provider Fallback and L…"]
D --> E
E["Core Pattern 4: Request/Response Transf…"]
E --> F
F["Core Pattern 5: Observability and Loggi…"]
F --> G
G["Existing Solutions"]
G --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
class TokenAwareRateLimiter:
def __init__(self, redis: Redis):
self.redis = redis
async def check_and_consume(
self, tenant_id: str, estimated_tokens: int
) -> bool:
key = f"ratelimit:{tenant_id}:{self.current_window()}"
current = await self.redis.get(key)
if current and int(current) + estimated_tokens > self.get_limit(tenant_id):
return False # Rate limited
pipe = self.redis.pipeline()
pipe.incrby(key, estimated_tokens)
pipe.expire(key, 60) # 1-minute window
await pipe.execute()
return True
def get_limit(self, tenant_id: str) -> int:
# Per-tenant token limits
return self.tenant_limits.get(tenant_id, 100_000) # Default 100K/min
Cost Budgets
Beyond rate limiting, implement cost budgets that track spending per tenant, team, or project. Alert when spending approaches the budget and hard-stop when it is exceeded.
Core Pattern 2: Semantic Caching Layer
Cache responses for semantically similar queries to reduce costs and latency.
class SemanticCacheLayer:
def __init__(self, vector_store, ttl_seconds: int = 3600):
self.vector_store = vector_store
self.ttl = ttl_seconds
async def get(self, messages: list[dict], model: str) -> CacheResult | None:
# Create cache key from the last user message + model
cache_query = self.extract_cache_key(messages)
embedding = await self.embed(cache_query)
results = await self.vector_store.search(
embedding, threshold=0.97, filter={"model": model}
)
if results and not self.is_expired(results[0]):
return CacheResult(
response=results[0].metadata["response"],
cache_hit=True
)
return None
async def set(self, messages: list[dict], model: str, response: str):
cache_query = self.extract_cache_key(messages)
embedding = await self.embed(cache_query)
await self.vector_store.insert(
embedding,
metadata={"response": response, "model": model, "timestamp": time.time()}
)
Important: Only cache deterministic, factual queries. Do not cache creative tasks, personalized responses, or time-sensitive queries.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Core Pattern 3: Provider Fallback and Load Balancing
When your primary LLM provider experiences outages or rate limits, automatically fall back to alternatives.
flowchart TD
ROOT["LLM API Gateway Design Patterns: Rate Limiti…"]
ROOT --> P0["Core Pattern 1: Token-Aware Rate Limiti…"]
P0 --> P0C0["Cost Budgets"]
ROOT --> P1["Core Pattern 5: Observability and Loggi…"]
P1 --> P1C0["Structured Logging"]
style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
class LLMProviderRouter:
def __init__(self):
self.providers = [
ProviderConfig("anthropic", "claude-sonnet-4", priority=1, weight=0.6),
ProviderConfig("openai", "gpt-4o", priority=1, weight=0.4),
ProviderConfig("anthropic", "claude-haiku-4", priority=2, weight=1.0), # Fallback
]
self.circuit_breakers = {p.name: CircuitBreaker() for p in self.providers}
async def route(self, request: LLMRequest) -> LLMResponse:
# Group by priority, try highest priority first
for priority_group in self.group_by_priority():
available = [
p for p in priority_group
if self.circuit_breakers[p.name].is_closed()
]
if not available:
continue
# Weighted random selection within priority group
provider = self.weighted_select(available)
try:
response = await provider.complete(request)
self.circuit_breakers[provider.name].record_success()
return response
except (RateLimitError, TimeoutError, ServerError) as e:
self.circuit_breakers[provider.name].record_failure()
continue
raise AllProvidersUnavailable()
Core Pattern 4: Request/Response Transformation
Normalize requests and responses across providers so your application code does not need provider-specific logic.
flowchart TD
CENTER(("Architecture"))
CENTER --> N0["Token-based billing: Costs scale with i…"]
CENTER --> N1["Variable latency: Streaming responses c…"]
CENTER --> N2["Semantic-aware caching: Identical queri…"]
CENTER --> N3["Content safety: Inputs and outputs may …"]
CENTER --> N4["Normalize message formats OpenAI39s mes…"]
CENTER --> N5["Map model names to provider-specific id…"]
style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
The gateway translates between a unified internal format and each provider's API format:
- Normalize message formats (OpenAI's
messagesarray vs. Anthropic's format) - Map model names to provider-specific identifiers
- Standardize tool/function calling formats
- Normalize streaming event formats
Core Pattern 5: Observability and Logging
Every request through the gateway should be logged with:
- Request/response token counts
- Cost calculation (based on model pricing)
- Latency breakdown (queue time, TTFT, total)
- Cache hit/miss status
- Provider used (primary vs. fallback)
- Content safety filter results
Structured Logging
{
"trace_id": "abc-123",
"tenant_id": "tenant-456",
"model_requested": "claude-sonnet-4",
"provider_used": "anthropic",
"input_tokens": 1523,
"output_tokens": 487,
"cost_usd": 0.0061,
"latency_ms": 2340,
"ttft_ms": 890,
"cache_hit": false,
"fallback_used": false
}
Existing Solutions
Before building your own gateway, evaluate existing options:
- LiteLLM: Open-source proxy supporting 100+ LLM providers with a unified OpenAI-compatible API
- Portkey: Managed LLM gateway with built-in caching, fallbacks, and observability
- Helicone: Observability-focused LLM proxy with cost tracking and prompt management
For most teams, starting with LiteLLM and adding custom middleware for your specific needs is the fastest path to production.
Sources:
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.