---
title: "LLM API Gateway Design Patterns: Rate Limiting, Caching, and Fallbacks"
description: "Design patterns for building a production LLM API gateway — including intelligent rate limiting, semantic caching, provider fallbacks, and request routing for multi-model deployments."
canonical: https://callsphere.ai/blog/llm-api-gateway-design-rate-limiting-caching-fallbacks
category: "Technology"
tags: ["API Gateway", "LLM APIs", "Rate Limiting", "Caching", "System Design", "Backend Engineering"]
author: "CallSphere Team"
published: 2026-02-17T00:00:00.000Z
updated: 2026-05-06T01:02:41.205Z
---

# LLM API Gateway Design Patterns: Rate Limiting, Caching, and Fallbacks

> Design patterns for building a production LLM API gateway — including intelligent rate limiting, semantic caching, provider fallbacks, and request routing for multi-model deployments.

## Why LLM Applications Need a Specialized Gateway

Standard API gateways handle authentication, rate limiting, and routing for traditional APIs. LLM APIs have additional requirements that standard gateways do not address:

- **Token-based billing**: Costs scale with input/output tokens, not request count
- **Variable latency**: Streaming responses can take 5-30 seconds
- **Multi-provider routing**: Most production systems use multiple LLM providers (OpenAI, Anthropic, Google) for redundancy and cost optimization
- **Semantic-aware caching**: Identical queries should be cacheable even if worded slightly differently
- **Content safety**: Inputs and outputs may need content filtering before reaching the LLM or the user

An LLM API gateway sits between your application and LLM providers, handling these concerns in a single layer.

## Core Pattern 1: Token-Aware Rate Limiting

Standard rate limiters count requests. LLM rate limiters need to count tokens, because a single request with a 100K context window costs 100x more than a simple query.

```mermaid
flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway
auth plus rate limit"]
    APP["FastAPI app
handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer
business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
```

```python
class TokenAwareRateLimiter:
    def __init__(self, redis: Redis):
        self.redis = redis

    async def check_and_consume(
        self, tenant_id: str, estimated_tokens: int
    ) -> bool:
        key = f"ratelimit:{tenant_id}:{self.current_window()}"
        current = await self.redis.get(key)

        if current and int(current) + estimated_tokens > self.get_limit(tenant_id):
            return False  # Rate limited

        pipe = self.redis.pipeline()
        pipe.incrby(key, estimated_tokens)
        pipe.expire(key, 60)  # 1-minute window
        await pipe.execute()
        return True

    def get_limit(self, tenant_id: str) -> int:
        # Per-tenant token limits
        return self.tenant_limits.get(tenant_id, 100_000)  # Default 100K/min
```

### Cost Budgets

Beyond rate limiting, implement cost budgets that track spending per tenant, team, or project. Alert when spending approaches the budget and hard-stop when it is exceeded.

## Core Pattern 2: Semantic Caching Layer

Cache responses for semantically similar queries to reduce costs and latency.

```python
class SemanticCacheLayer:
    def __init__(self, vector_store, ttl_seconds: int = 3600):
        self.vector_store = vector_store
        self.ttl = ttl_seconds

    async def get(self, messages: list[dict], model: str) -> CacheResult | None:
        # Create cache key from the last user message + model
        cache_query = self.extract_cache_key(messages)
        embedding = await self.embed(cache_query)

        results = await self.vector_store.search(
            embedding, threshold=0.97, filter={"model": model}
        )

        if results and not self.is_expired(results[0]):
            return CacheResult(
                response=results[0].metadata["response"],
                cache_hit=True
            )
        return None

    async def set(self, messages: list[dict], model: str, response: str):
        cache_query = self.extract_cache_key(messages)
        embedding = await self.embed(cache_query)
        await self.vector_store.insert(
            embedding,
            metadata={"response": response, "model": model, "timestamp": time.time()}
        )
```

**Important**: Only cache deterministic, factual queries. Do not cache creative tasks, personalized responses, or time-sensitive queries.

## Core Pattern 3: Provider Fallback and Load Balancing

When your primary LLM provider experiences outages or rate limits, automatically fall back to alternatives.

```python
class LLMProviderRouter:
    def __init__(self):
        self.providers = [
            ProviderConfig("anthropic", "claude-sonnet-4", priority=1, weight=0.6),
            ProviderConfig("openai", "gpt-4o", priority=1, weight=0.4),
            ProviderConfig("anthropic", "claude-haiku-4", priority=2, weight=1.0),  # Fallback
        ]
        self.circuit_breakers = {p.name: CircuitBreaker() for p in self.providers}

    async def route(self, request: LLMRequest) -> LLMResponse:
        # Group by priority, try highest priority first
        for priority_group in self.group_by_priority():
            available = [
                p for p in priority_group
                if self.circuit_breakers[p.name].is_closed()
            ]
            if not available:
                continue

            # Weighted random selection within priority group
            provider = self.weighted_select(available)
            try:
                response = await provider.complete(request)
                self.circuit_breakers[provider.name].record_success()
                return response
            except (RateLimitError, TimeoutError, ServerError) as e:
                self.circuit_breakers[provider.name].record_failure()
                continue

        raise AllProvidersUnavailable()
```

## Core Pattern 4: Request/Response Transformation

Normalize requests and responses across providers so your application code does not need provider-specific logic.

The gateway translates between a unified internal format and each provider's API format:

- Normalize message formats (OpenAI's `messages` array vs. Anthropic's format)
- Map model names to provider-specific identifiers
- Standardize tool/function calling formats
- Normalize streaming event formats

## Core Pattern 5: Observability and Logging

Every request through the gateway should be logged with:

- Request/response token counts
- Cost calculation (based on model pricing)
- Latency breakdown (queue time, TTFT, total)
- Cache hit/miss status
- Provider used (primary vs. fallback)
- Content safety filter results

### Structured Logging

```json
{
  "trace_id": "abc-123",
  "tenant_id": "tenant-456",
  "model_requested": "claude-sonnet-4",
  "provider_used": "anthropic",
  "input_tokens": 1523,
  "output_tokens": 487,
  "cost_usd": 0.0061,
  "latency_ms": 2340,
  "ttft_ms": 890,
  "cache_hit": false,
  "fallback_used": false
}
```

## Existing Solutions

Before building your own gateway, evaluate existing options:

- **LiteLLM**: Open-source proxy supporting 100+ LLM providers with a unified OpenAI-compatible API
- **Portkey**: Managed LLM gateway with built-in caching, fallbacks, and observability
- **Helicone**: Observability-focused LLM proxy with cost tracking and prompt management

For most teams, starting with LiteLLM and adding custom middleware for your specific needs is the fastest path to production.

**Sources:**

- [https://docs.litellm.ai/docs/](https://docs.litellm.ai/docs/)
- [https://portkey.ai/docs](https://portkey.ai/docs)
- [https://www.helicone.ai/docs](https://www.helicone.ai/docs)

---

Source: https://callsphere.ai/blog/llm-api-gateway-design-rate-limiting-caching-fallbacks
