Prompt Caching Strategies: Reducing Latency and Cost with Cached Prefixes

The Hidden Cost of Repeated Prefixes

In production LLM applications, the same text gets sent to the model thousands of times per day. Your system prompt, few-shot examples, tool definitions, and retrieval context templates are largely identical across requests. Every time you send this prefix, the model processes it from scratch — computing attention over the same tokens it processed moments ago.

Prompt caching eliminates this redundancy. Both OpenAI and Anthropic now offer server-side caching where the model stores the computed key-value (KV) cache for prompt prefixes. When a subsequent request shares the same prefix, the model skips recomputation and starts generating immediately.

The impact is substantial: OpenAI's prompt caching offers 50 percent cost reduction on cached tokens and up to 80 percent latency reduction. Anthropic's caching charges a small write fee for the first request but then offers 90 percent savings on cached reads.

How OpenAI Prompt Caching Works

OpenAI's prompt caching is automatic for supported models. When a request shares a prefix of at least 1024 tokens with a recent request, the cached portion is served at half price:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

import openai

client = openai.OpenAI()

# This long system prompt will be cached after the first request
SYSTEM_PROMPT = """You are a financial analysis assistant for Acme Corp.

## Company Context
Acme Corp is a mid-cap technology company with the following key metrics:
- Revenue: $2.4B (FY 2025)
- Operating margin: 18.3%
- Employee count: 12,400
...
(imagine 2000+ tokens of company context, policies, and instructions)

## Analysis Guidelines
1. Always cite specific numbers from the provided data
2. Compare metrics to industry benchmarks
3. Flag any year-over-year changes exceeding 15%
4. Present findings in order of business impact
"""

def analyze_financial_data(user_query: str) -> dict:
    """Query with cached system prompt prefix."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_query},
        ],
    )
    usage = response.usage
    return {
        "answer": response.choices[0].message.content,
        "cached_tokens": getattr(usage, "prompt_tokens_details", {})
            .get("cached_tokens", 0),
        "total_prompt_tokens": usage.prompt_tokens,
    }

# First call: full processing (no cache)
result1 = analyze_financial_data("What is the revenue trend?")
print(f"Cached: {result1['cached_tokens']} / {result1['total_prompt_tokens']}")

# Subsequent calls: prefix is cached
result2 = analyze_financial_data("Analyze operating margins.")
print(f"Cached: {result2['cached_tokens']} / {result2['total_prompt_tokens']}")

Designing Cache-Friendly Prompts

The critical insight is that caching works on prefixes — the matching starts from the first token. Any change at the beginning of the prompt invalidates the entire cache. This means you should structure your prompts with static content first and dynamic content last:

def build_cache_friendly_prompt(
    static_instructions: str,
    static_examples: list[str],
    dynamic_context: str,
    user_query: str,
) -> list[dict]:
    """Structure prompt for maximum cache reuse."""
    # Static prefix — identical across requests, cached
    system_content = (
        f"{static_instructions}\n\n"
        "## Examples\n\n"
        + "\n\n".join(static_examples)
    )

    # Dynamic content — changes per request, not cached
    user_content = (
        f"## Context\n\n{dynamic_context}\n\n"
        f"## Question\n\n{user_query}"
    )

    return [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content},
    ]

# Anti-pattern: dynamic content in system prompt breaks cache
def bad_prompt_design(user_id: str, query: str) -> list[dict]:
    """This breaks caching because user_id changes per request."""
    return [
        {"role": "system", "content": f"User ID: {user_id}\n{SYSTEM_PROMPT}"},
        {"role": "user", "content": query},
    ]

# Better: move dynamic content after the static prefix
def good_prompt_design(user_id: str, query: str) -> list[dict]:
    """Static prefix stays cacheable, dynamic content is appended."""
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"[User: {user_id}] {query}"},
    ]

Anthropic's Explicit Cache Control

Anthropic takes a different approach with explicit cache breakpoints. You mark exactly where in the prompt the cache should apply:

import anthropic

anthropic_client = anthropic.Anthropic()

def cached_anthropic_query(
    static_context: str,
    user_query: str,
) -> dict:
    """Use Anthropic's explicit cache control."""
    response = anthropic_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": static_context,
                "cache_control": {"type": "ephemeral"},
            },
        ],
        messages=[
            {"role": "user", "content": user_query},
        ],
    )
    return {
        "answer": response.content[0].text,
        "input_tokens": response.usage.input_tokens,
        "cache_read_tokens": getattr(
            response.usage, "cache_read_input_tokens", 0
        ),
        "cache_write_tokens": getattr(
            response.usage, "cache_creation_input_tokens", 0
        ),
    }

Measuring Cache Effectiveness

Track your cache hit rate to validate that your prompt design is actually benefiting from caching:

class CacheMetrics:
    """Track prompt caching effectiveness over time."""

    def __init__(self):
        self.total_requests = 0
        self.total_prompt_tokens = 0
        self.total_cached_tokens = 0

    def record(self, prompt_tokens: int, cached_tokens: int):
        self.total_requests += 1
        self.total_prompt_tokens += prompt_tokens
        self.total_cached_tokens += cached_tokens

    @property
    def cache_hit_rate(self) -> float:
        if self.total_prompt_tokens == 0:
            return 0.0
        return self.total_cached_tokens / self.total_prompt_tokens

    @property
    def estimated_savings(self) -> float:
        """Estimated cost savings from caching (50% on cached tokens)."""
        return self.total_cached_tokens * 0.5

    def report(self) -> dict:
        return {
            "total_requests": self.total_requests,
            "cache_hit_rate": f"{self.cache_hit_rate:.1%}",
            "total_tokens_cached": self.total_cached_tokens,
            "estimated_token_savings": self.estimated_savings,
        }

A well-designed caching strategy achieves 60 to 80 percent cache hit rates on the prompt prefix. If your hit rate is below 40 percent, audit your prompt construction to find dynamic content that is breaking the prefix match.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

How long do cached prefixes persist?

OpenAI caches persist for 5 to 10 minutes of inactivity. Anthropic's ephemeral caches persist for roughly 5 minutes. Neither provider guarantees cache persistence — your application should work correctly whether the cache hits or misses. Design for caching but do not depend on it for correctness.

What is the minimum prefix length for caching?

OpenAI requires at least 1024 tokens in the matching prefix. Anthropic requires at least 1024 tokens for the content marked with cache control. Short system prompts do not benefit from caching. If your system prompt is under 1024 tokens, consider prepending static context like tool definitions or few-shot examples to reach the threshold.

Can I cache tool definitions and function schemas?

Yes, and this is one of the highest-value caching targets. Tool schemas are identical across requests and can be very long — 20 tools with detailed schemas easily exceed 2000 tokens. Place tool definitions in the system prompt before any dynamic content to maximize cache reuse.

#PromptEngineering #Caching #CostOptimization #Latency #Python #AgenticAI #LearnAI #AIEngineering

Prompt Caching Strategies: Reducing Latency and Cost with Cached Prefixes

The Hidden Cost of Repeated Prefixes

How OpenAI Prompt Caching Works

Designing Cache-Friendly Prompts

Anthropic's Explicit Cache Control

Measuring Cache Effectiveness

FAQ

How long do cached prefixes persist?

What is the minimum prefix length for caching?

Can I cache tool definitions and function schemas?

Try CallSphere AI Voice Agents

Related Articles You May Like

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Self-hosted on-prem stack for Edge / on-device LLM inference: A May 2026 Comparison

Edge / on-device LLM inference in 2026: Open-source frontier matchup (DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3)

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Multilingual customer support in 2026?