---
title: "FastAPI Middleware for AI Agents: Logging, Auth, and Rate Limiting"
description: "Build a production middleware stack for AI agent APIs in FastAPI. Covers structured request logging, Bearer token authentication, sliding window rate limiting, and CORS configuration for agent frontends."
canonical: https://callsphere.ai/blog/fastapi-middleware-ai-agents-logging-auth-rate-limiting
category: "Learn Agentic AI"
tags: ["FastAPI", "Middleware", "Authentication", "Rate Limiting", "AI Agents"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T17:44:27.348Z
---

# FastAPI Middleware for AI Agents: Logging, Auth, and Rate Limiting

> Build a production middleware stack for AI agent APIs in FastAPI. Covers structured request logging, Bearer token authentication, sliding window rate limiting, and CORS configuration for agent frontends.

## The Middleware Stack for AI Agent APIs

Middleware sits between the incoming HTTP request and your endpoint handler. For AI agent backends, a proper middleware stack handles cross-cutting concerns: logging every request for debugging, authenticating callers before they reach agent endpoints, rate limiting to prevent LLM cost overruns, and adding CORS headers for browser-based agent frontends.

FastAPI middleware executes in the order it is added, wrapping your endpoint like layers of an onion. The first middleware added is the outermost layer, meaning it sees the request first and the response last.

## Structured Request Logging

Every AI agent request should be logged with enough context to debug issues in production. This middleware captures timing, status codes, and request metadata:

```mermaid
flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway
auth plus rate limit"]
    APP["FastAPI app
handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer
business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
```

```python
import time
import uuid
import logging
from fastapi import Request

logger = logging.getLogger("agent_api")

@app.middleware("http")
async def logging_middleware(request: Request, call_next):
    request_id = str(uuid.uuid4())[:8]
    request.state.request_id = request_id

    start_time = time.monotonic()

    # Log request
    logger.info(
        "request_started",
        extra={
            "request_id": request_id,
            "method": request.method,
            "path": request.url.path,
            "client_ip": request.client.host,
        },
    )

    try:
        response = await call_next(request)
        duration_ms = (time.monotonic() - start_time) * 1000

        logger.info(
            "request_completed",
            extra={
                "request_id": request_id,
                "status_code": response.status_code,
                "duration_ms": round(duration_ms, 2),
                "path": request.url.path,
            },
        )

        response.headers["X-Request-ID"] = request_id
        response.headers["X-Response-Time"] = f"{duration_ms:.0f}ms"
        return response

    except Exception as e:
        duration_ms = (time.monotonic() - start_time) * 1000
        logger.error(
            "request_failed",
            extra={
                "request_id": request_id,
                "error": str(e),
                "duration_ms": round(duration_ms, 2),
            },
        )
        raise
```

The `X-Request-ID` header lets clients and support teams correlate frontend errors with backend logs.

## Token-Based Authentication Middleware

AI agent APIs should authenticate every request. This middleware validates Bearer tokens and attaches user context to the request:

```python
from fastapi import Request, HTTPException
from fastapi.responses import JSONResponse
import jwt

SKIP_AUTH_PATHS = {"/health", "/docs", "/openapi.json"}

@app.middleware("http")
async def auth_middleware(request: Request, call_next):
    if request.url.path in SKIP_AUTH_PATHS:
        return await call_next(request)

    auth_header = request.headers.get("Authorization")
    if not auth_header or not auth_header.startswith("Bearer "):
        return JSONResponse(
            status_code=401,
            content={"error": "Missing or invalid auth token"},
        )

    token = auth_header.split(" ", 1)[1]

    try:
        payload = jwt.decode(
            token,
            settings.jwt_secret,
            algorithms=["HS256"],
        )
        request.state.user_id = payload["sub"]
        request.state.user_tier = payload.get("tier", "free")
    except jwt.ExpiredSignatureError:
        return JSONResponse(
            status_code=401,
            content={"error": "Token expired"},
        )
    except jwt.InvalidTokenError:
        return JSONResponse(
            status_code=401,
            content={"error": "Invalid token"},
        )

    return await call_next(request)
```

Notice this uses `JSONResponse` instead of raising `HTTPException`. Inside middleware, raising exceptions can bypass other middleware layers. Returning a response directly is safer.

## Sliding Window Rate Limiting

AI agent APIs are expensive because every request triggers LLM calls. Rate limiting prevents abuse and cost overruns. This implementation uses Redis for a sliding window algorithm:

```python
import redis.asyncio as redis

redis_client = redis.from_url("redis://localhost:6379/2")

RATE_LIMITS = {
    "free": {"requests": 20, "window_seconds": 3600},
    "pro": {"requests": 200, "window_seconds": 3600},
    "enterprise": {"requests": 2000, "window_seconds": 3600},
}

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    if request.url.path in SKIP_AUTH_PATHS:
        return await call_next(request)

    user_id = getattr(request.state, "user_id", "anonymous")
    user_tier = getattr(request.state, "user_tier", "free")
    limits = RATE_LIMITS[user_tier]

    key = f"ratelimit:{user_id}"
    now = time.time()
    window_start = now - limits["window_seconds"]

    pipe = redis_client.pipeline()
    # Remove old entries outside the window
    pipe.zremrangebyscore(key, 0, window_start)
    # Count remaining entries
    pipe.zcard(key)
    # Add current request
    pipe.zadd(key, {str(now): now})
    # Set expiry on the key
    pipe.expire(key, limits["window_seconds"])
    results = await pipe.execute()

    request_count = results[1]

    if request_count >= limits["requests"]:
        retry_after = int(limits["window_seconds"])
        return JSONResponse(
            status_code=429,
            content={
                "error": "Rate limit exceeded",
                "limit": limits["requests"],
                "window": f"{limits['window_seconds']}s",
                "retry_after": retry_after,
            },
            headers={"Retry-After": str(retry_after)},
        )

    response = await call_next(request)
    remaining = limits["requests"] - request_count - 1
    response.headers["X-RateLimit-Limit"] = str(limits["requests"])
    response.headers["X-RateLimit-Remaining"] = str(max(0, remaining))
    return response
```

The Redis sorted set tracks each request timestamp. On each new request, old entries outside the window are pruned, the current count is checked, and the new request is added. This gives an accurate sliding window rather than a fixed window that resets.

## CORS Configuration

Browser-based agent frontends need proper CORS headers:

```python
from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=[
        "https://app.yourdomain.com",
        "http://localhost:3000",
    ],
    allow_credentials=True,
    allow_methods=["GET", "POST", "PUT", "DELETE"],
    allow_headers=["Authorization", "Content-Type"],
    expose_headers=[
        "X-Request-ID",
        "X-RateLimit-Remaining",
    ],
)
```

Add CORS middleware last so it is the outermost layer and properly handles preflight OPTIONS requests before any other middleware runs.

## FAQ

### What is the correct order for middleware in a FastAPI AI agent API?

Add middleware in this order: CORS (outermost, handles preflight), logging (captures all requests including rejected ones), authentication (rejects unauthenticated requests early), rate limiting (checks limits for authenticated users). Since FastAPI middleware wraps in reverse order of addition, add CORS last in your code so it executes first. This ensures OPTIONS preflight requests get CORS headers without triggering auth or rate limiting.

### Should I use middleware or Dependencies for authentication?

Middleware is better when every endpoint needs authentication because it runs automatically without any per-endpoint configuration. Dependencies are better when only some endpoints need auth, or when different endpoints need different auth levels. A common pattern is using middleware for basic token validation and a dependency for fine-grained permission checks on specific endpoints.

### How do I handle rate limiting for streaming endpoints?

Count the initial request, not individual streamed chunks. A streaming response that sends 500 tokens is still one API request from a rate limiting perspective. However, you may want to track token usage separately for billing purposes. Use the logging middleware to record total tokens consumed per request and apply token-based quotas as a separate check from request-count rate limiting.

---

#FastAPI #Middleware #Authentication #RateLimiting #AIAgents #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/fastapi-middleware-ai-agents-logging-auth-rate-limiting