Skip to content
Technology
Technology6 min read6 views

LLM API Gateway Design Patterns: Rate Limiting, Caching, and Fallbacks

Design patterns for building a production LLM API gateway — including intelligent rate limiting, semantic caching, provider fallbacks, and request routing for multi-model deployments.

Why LLM Applications Need a Specialized Gateway

Standard API gateways handle authentication, rate limiting, and routing for traditional APIs. LLM APIs have additional requirements that standard gateways do not address:

  • Token-based billing: Costs scale with input/output tokens, not request count
  • Variable latency: Streaming responses can take 5-30 seconds
  • Multi-provider routing: Most production systems use multiple LLM providers (OpenAI, Anthropic, Google) for redundancy and cost optimization
  • Semantic-aware caching: Identical queries should be cacheable even if worded slightly differently
  • Content safety: Inputs and outputs may need content filtering before reaching the LLM or the user

An LLM API gateway sits between your application and LLM providers, handling these concerns in a single layer.

Core Pattern 1: Token-Aware Rate Limiting

Standard rate limiters count requests. LLM rate limiters need to count tokens, because a single request with a 100K context window costs 100x more than a simple query.

flowchart TD
    START["LLM API Gateway Design Patterns: Rate Limiting, C…"] --> A
    A["Why LLM Applications Need a Specialized…"]
    A --> B
    B["Core Pattern 1: Token-Aware Rate Limiti…"]
    B --> C
    C["Core Pattern 2: Semantic Caching Layer"]
    C --> D
    D["Core Pattern 3: Provider Fallback and L…"]
    D --> E
    E["Core Pattern 4: Request/Response Transf…"]
    E --> F
    F["Core Pattern 5: Observability and Loggi…"]
    F --> G
    G["Existing Solutions"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
class TokenAwareRateLimiter:
    def __init__(self, redis: Redis):
        self.redis = redis

    async def check_and_consume(
        self, tenant_id: str, estimated_tokens: int
    ) -> bool:
        key = f"ratelimit:{tenant_id}:{self.current_window()}"
        current = await self.redis.get(key)

        if current and int(current) + estimated_tokens > self.get_limit(tenant_id):
            return False  # Rate limited

        pipe = self.redis.pipeline()
        pipe.incrby(key, estimated_tokens)
        pipe.expire(key, 60)  # 1-minute window
        await pipe.execute()
        return True

    def get_limit(self, tenant_id: str) -> int:
        # Per-tenant token limits
        return self.tenant_limits.get(tenant_id, 100_000)  # Default 100K/min

Cost Budgets

Beyond rate limiting, implement cost budgets that track spending per tenant, team, or project. Alert when spending approaches the budget and hard-stop when it is exceeded.

Core Pattern 2: Semantic Caching Layer

Cache responses for semantically similar queries to reduce costs and latency.

class SemanticCacheLayer:
    def __init__(self, vector_store, ttl_seconds: int = 3600):
        self.vector_store = vector_store
        self.ttl = ttl_seconds

    async def get(self, messages: list[dict], model: str) -> CacheResult | None:
        # Create cache key from the last user message + model
        cache_query = self.extract_cache_key(messages)
        embedding = await self.embed(cache_query)

        results = await self.vector_store.search(
            embedding, threshold=0.97, filter={"model": model}
        )

        if results and not self.is_expired(results[0]):
            return CacheResult(
                response=results[0].metadata["response"],
                cache_hit=True
            )
        return None

    async def set(self, messages: list[dict], model: str, response: str):
        cache_query = self.extract_cache_key(messages)
        embedding = await self.embed(cache_query)
        await self.vector_store.insert(
            embedding,
            metadata={"response": response, "model": model, "timestamp": time.time()}
        )

Important: Only cache deterministic, factual queries. Do not cache creative tasks, personalized responses, or time-sensitive queries.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Core Pattern 3: Provider Fallback and Load Balancing

When your primary LLM provider experiences outages or rate limits, automatically fall back to alternatives.

flowchart TD
    ROOT["LLM API Gateway Design Patterns: Rate Limiti…"] 
    ROOT --> P0["Core Pattern 1: Token-Aware Rate Limiti…"]
    P0 --> P0C0["Cost Budgets"]
    ROOT --> P1["Core Pattern 5: Observability and Loggi…"]
    P1 --> P1C0["Structured Logging"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
class LLMProviderRouter:
    def __init__(self):
        self.providers = [
            ProviderConfig("anthropic", "claude-sonnet-4", priority=1, weight=0.6),
            ProviderConfig("openai", "gpt-4o", priority=1, weight=0.4),
            ProviderConfig("anthropic", "claude-haiku-4", priority=2, weight=1.0),  # Fallback
        ]
        self.circuit_breakers = {p.name: CircuitBreaker() for p in self.providers}

    async def route(self, request: LLMRequest) -> LLMResponse:
        # Group by priority, try highest priority first
        for priority_group in self.group_by_priority():
            available = [
                p for p in priority_group
                if self.circuit_breakers[p.name].is_closed()
            ]
            if not available:
                continue

            # Weighted random selection within priority group
            provider = self.weighted_select(available)
            try:
                response = await provider.complete(request)
                self.circuit_breakers[provider.name].record_success()
                return response
            except (RateLimitError, TimeoutError, ServerError) as e:
                self.circuit_breakers[provider.name].record_failure()
                continue

        raise AllProvidersUnavailable()

Core Pattern 4: Request/Response Transformation

Normalize requests and responses across providers so your application code does not need provider-specific logic.

flowchart TD
    CENTER(("Architecture"))
    CENTER --> N0["Token-based billing: Costs scale with i…"]
    CENTER --> N1["Variable latency: Streaming responses c…"]
    CENTER --> N2["Semantic-aware caching: Identical queri…"]
    CENTER --> N3["Content safety: Inputs and outputs may …"]
    CENTER --> N4["Normalize message formats OpenAI39s mes…"]
    CENTER --> N5["Map model names to provider-specific id…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff

The gateway translates between a unified internal format and each provider's API format:

  • Normalize message formats (OpenAI's messages array vs. Anthropic's format)
  • Map model names to provider-specific identifiers
  • Standardize tool/function calling formats
  • Normalize streaming event formats

Core Pattern 5: Observability and Logging

Every request through the gateway should be logged with:

  • Request/response token counts
  • Cost calculation (based on model pricing)
  • Latency breakdown (queue time, TTFT, total)
  • Cache hit/miss status
  • Provider used (primary vs. fallback)
  • Content safety filter results

Structured Logging

{
  "trace_id": "abc-123",
  "tenant_id": "tenant-456",
  "model_requested": "claude-sonnet-4",
  "provider_used": "anthropic",
  "input_tokens": 1523,
  "output_tokens": 487,
  "cost_usd": 0.0061,
  "latency_ms": 2340,
  "ttft_ms": 890,
  "cache_hit": false,
  "fallback_used": false
}

Existing Solutions

Before building your own gateway, evaluate existing options:

  • LiteLLM: Open-source proxy supporting 100+ LLM providers with a unified OpenAI-compatible API
  • Portkey: Managed LLM gateway with built-in caching, fallbacks, and observability
  • Helicone: Observability-focused LLM proxy with cost tracking and prompt management

For most teams, starting with LiteLLM and adding custom middleware for your specific needs is the fastest path to production.

Sources:

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

Learn Agentic AI

Agent Gateway Pattern: Rate Limiting, Authentication, and Request Routing for AI Agents

Implementing an agent gateway with API key management, per-agent rate limiting, intelligent request routing, audit logging, and cost tracking for enterprise AI systems.

Learn Agentic AI

AI Agent Cost Optimization: Reducing LLM API Spend by 70% with Caching and Routing

Practical cost reduction strategies for AI agents including semantic caching, intelligent model routing, prompt optimization, and batch processing to cut LLM API spend.

Learn Agentic AI

Production Text-to-SQL: Caching, Monitoring, and Scaling Natural Language Database Access

Learn how to take text-to-SQL from prototype to production with query caching, usage analytics, performance monitoring, cost optimization, and scaling strategies for high-traffic deployments.

Learn Agentic AI

Cost Optimization for Vision-Based Browser Agents: Image Compression and Caching

Reduce GPT Vision API costs by 60-80% through image resizing, compression, region cropping, intelligent caching, and token-aware strategies. Essential techniques for production vision-based browser automation.

Learn Agentic AI

Building a Social Media Automation Agent: Content Posting, Scheduling, and Engagement

Learn to build an AI agent for social media automation covering platform API integration versus browser automation, content scheduling, engagement monitoring, and rate limiting strategies.