Skip to content
Learn Agentic AI
Learn Agentic AI10 min read11 views

Prompt Caching Strategies: Reducing Latency and Cost with Cached Prefixes

Learn how to leverage prompt caching features from OpenAI and Anthropic to dramatically reduce latency and cost by reusing cached prompt prefixes across requests.

The Hidden Cost of Repeated Prefixes

In production LLM applications, the same text gets sent to the model thousands of times per day. Your system prompt, few-shot examples, tool definitions, and retrieval context templates are largely identical across requests. Every time you send this prefix, the model processes it from scratch — computing attention over the same tokens it processed moments ago.

Prompt caching eliminates this redundancy. Both OpenAI and Anthropic now offer server-side caching where the model stores the computed key-value (KV) cache for prompt prefixes. When a subsequent request shares the same prefix, the model skips recomputation and starts generating immediately.

The impact is substantial: OpenAI's prompt caching offers 50 percent cost reduction on cached tokens and up to 80 percent latency reduction. Anthropic's caching charges a small write fee for the first request but then offers 90 percent savings on cached reads.

How OpenAI Prompt Caching Works

OpenAI's prompt caching is automatic for supported models. When a request shares a prefix of at least 1024 tokens with a recent request, the cached portion is served at half price:

flowchart TD
    START["Prompt Caching Strategies: Reducing Latency and C…"] --> A
    A["The Hidden Cost of Repeated Prefixes"]
    A --> B
    B["How OpenAI Prompt Caching Works"]
    B --> C
    C["Designing Cache-Friendly Prompts"]
    C --> D
    D["Anthropic39s Explicit Cache Control"]
    D --> E
    E["Measuring Cache Effectiveness"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import openai

client = openai.OpenAI()

# This long system prompt will be cached after the first request
SYSTEM_PROMPT = """You are a financial analysis assistant for Acme Corp.

## Company Context
Acme Corp is a mid-cap technology company with the following key metrics:
- Revenue: $2.4B (FY 2025)
- Operating margin: 18.3%
- Employee count: 12,400
...
(imagine 2000+ tokens of company context, policies, and instructions)

## Analysis Guidelines
1. Always cite specific numbers from the provided data
2. Compare metrics to industry benchmarks
3. Flag any year-over-year changes exceeding 15%
4. Present findings in order of business impact
"""

def analyze_financial_data(user_query: str) -> dict:
    """Query with cached system prompt prefix."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_query},
        ],
    )
    usage = response.usage
    return {
        "answer": response.choices[0].message.content,
        "cached_tokens": getattr(usage, "prompt_tokens_details", {})
            .get("cached_tokens", 0),
        "total_prompt_tokens": usage.prompt_tokens,
    }

# First call: full processing (no cache)
result1 = analyze_financial_data("What is the revenue trend?")
print(f"Cached: {result1['cached_tokens']} / {result1['total_prompt_tokens']}")

# Subsequent calls: prefix is cached
result2 = analyze_financial_data("Analyze operating margins.")
print(f"Cached: {result2['cached_tokens']} / {result2['total_prompt_tokens']}")

Designing Cache-Friendly Prompts

The critical insight is that caching works on prefixes — the matching starts from the first token. Any change at the beginning of the prompt invalidates the entire cache. This means you should structure your prompts with static content first and dynamic content last:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def build_cache_friendly_prompt(
    static_instructions: str,
    static_examples: list[str],
    dynamic_context: str,
    user_query: str,
) -> list[dict]:
    """Structure prompt for maximum cache reuse."""
    # Static prefix — identical across requests, cached
    system_content = (
        f"{static_instructions}\n\n"
        "## Examples\n\n"
        + "\n\n".join(static_examples)
    )

    # Dynamic content — changes per request, not cached
    user_content = (
        f"## Context\n\n{dynamic_context}\n\n"
        f"## Question\n\n{user_query}"
    )

    return [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content},
    ]

# Anti-pattern: dynamic content in system prompt breaks cache
def bad_prompt_design(user_id: str, query: str) -> list[dict]:
    """This breaks caching because user_id changes per request."""
    return [
        {"role": "system", "content": f"User ID: {user_id}\n{SYSTEM_PROMPT}"},
        {"role": "user", "content": query},
    ]

# Better: move dynamic content after the static prefix
def good_prompt_design(user_id: str, query: str) -> list[dict]:
    """Static prefix stays cacheable, dynamic content is appended."""
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"[User: {user_id}] {query}"},
    ]

Anthropic's Explicit Cache Control

Anthropic takes a different approach with explicit cache breakpoints. You mark exactly where in the prompt the cache should apply:

import anthropic

anthropic_client = anthropic.Anthropic()

def cached_anthropic_query(
    static_context: str,
    user_query: str,
) -> dict:
    """Use Anthropic's explicit cache control."""
    response = anthropic_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": static_context,
                "cache_control": {"type": "ephemeral"},
            },
        ],
        messages=[
            {"role": "user", "content": user_query},
        ],
    )
    return {
        "answer": response.content[0].text,
        "input_tokens": response.usage.input_tokens,
        "cache_read_tokens": getattr(
            response.usage, "cache_read_input_tokens", 0
        ),
        "cache_write_tokens": getattr(
            response.usage, "cache_creation_input_tokens", 0
        ),
    }

Measuring Cache Effectiveness

Track your cache hit rate to validate that your prompt design is actually benefiting from caching:

class CacheMetrics:
    """Track prompt caching effectiveness over time."""

    def __init__(self):
        self.total_requests = 0
        self.total_prompt_tokens = 0
        self.total_cached_tokens = 0

    def record(self, prompt_tokens: int, cached_tokens: int):
        self.total_requests += 1
        self.total_prompt_tokens += prompt_tokens
        self.total_cached_tokens += cached_tokens

    @property
    def cache_hit_rate(self) -> float:
        if self.total_prompt_tokens == 0:
            return 0.0
        return self.total_cached_tokens / self.total_prompt_tokens

    @property
    def estimated_savings(self) -> float:
        """Estimated cost savings from caching (50% on cached tokens)."""
        return self.total_cached_tokens * 0.5

    def report(self) -> dict:
        return {
            "total_requests": self.total_requests,
            "cache_hit_rate": f"{self.cache_hit_rate:.1%}",
            "total_tokens_cached": self.total_cached_tokens,
            "estimated_token_savings": self.estimated_savings,
        }

A well-designed caching strategy achieves 60 to 80 percent cache hit rates on the prompt prefix. If your hit rate is below 40 percent, audit your prompt construction to find dynamic content that is breaking the prefix match.

FAQ

How long do cached prefixes persist?

OpenAI caches persist for 5 to 10 minutes of inactivity. Anthropic's ephemeral caches persist for roughly 5 minutes. Neither provider guarantees cache persistence — your application should work correctly whether the cache hits or misses. Design for caching but do not depend on it for correctness.

What is the minimum prefix length for caching?

OpenAI requires at least 1024 tokens in the matching prefix. Anthropic requires at least 1024 tokens for the content marked with cache control. Short system prompts do not benefit from caching. If your system prompt is under 1024 tokens, consider prepending static context like tool definitions or few-shot examples to reach the threshold.

Can I cache tool definitions and function schemas?

Yes, and this is one of the highest-value caching targets. Tool schemas are identical across requests and can be very long — 20 tools with detailed schemas easily exceed 2000 tokens. Place tool definitions in the system prompt before any dynamic content to maximize cache reuse.


#PromptEngineering #Caching #CostOptimization #Latency #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

AI Agent Cost Optimization: Reducing LLM API Spend by 70% with Caching and Routing

Practical cost reduction strategies for AI agents including semantic caching, intelligent model routing, prompt optimization, and batch processing to cut LLM API spend.