Skip to content
Large Language Models
Large Language Models5 min read7 views

LLM Caching Strategies for Cost Optimization: Prompt, Semantic, and KV Caching

Practical techniques to reduce LLM inference costs by 40-80 percent through prompt caching, semantic caching, and KV cache optimization in production systems.

LLM Inference Costs Add Up Fast

At $3-15 per million input tokens for frontier models, LLM costs become significant at scale. A customer support agent handling 10,000 conversations per day with 2,000 tokens per conversation costs $60-300 daily on input tokens alone. Caching strategies can reduce these costs by 40-80 percent while simultaneously improving latency.

Three caching approaches address different patterns: exact prompt caching, semantic caching, and KV cache optimization.

Exact Prompt Caching

The simplest approach: hash the full prompt and cache the response. If the same prompt appears again, return the cached response without calling the LLM.

flowchart TD
    START["LLM Caching Strategies for Cost Optimization: Pro…"] --> A
    A["LLM Inference Costs Add Up Fast"]
    A --> B
    B["Exact Prompt Caching"]
    B --> C
    C["Semantic Caching"]
    C --> D
    D["Provider-Level Prompt Caching"]
    D --> E
    E["KV Cache Optimization"]
    E --> F
    F["Cost Savings Calculator"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import hashlib
import redis
import json

cache = redis.Redis(host="localhost", port=6379, db=0)

async def cached_llm_call(messages: list, model: str, ttl: int = 3600):
    cache_key = hashlib.sha256(
        json.dumps({"messages": messages, "model": model}).encode()
    ).hexdigest()

    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)

    response = await openai_client.chat.completions.create(
        model=model, messages=messages
    )
    cache.setex(cache_key, ttl, json.dumps(response.to_dict()))
    return response

When Exact Caching Works

  • Repeated system prompts: Many requests share identical system prompts
  • Structured queries: Classification tasks with a fixed set of inputs
  • Batch processing: Re-running analysis on unchanged data

When It Fails

Exact caching has a low hit rate for conversational applications where each message includes unique user input. Even one character difference produces a different hash.

Semantic Caching

Semantic caching matches queries by meaning rather than exact text. "What's the weather in NYC?" and "How's the weather in New York City?" should return the same cached response.

flowchart TD
    ROOT["LLM Caching Strategies for Cost Optimization…"] 
    ROOT --> P0["Exact Prompt Caching"]
    P0 --> P0C0["When Exact Caching Works"]
    P0 --> P0C1["When It Fails"]
    ROOT --> P1["Semantic Caching"]
    P1 --> P1C0["Tuning the Similarity Threshold"]
    ROOT --> P2["Provider-Level Prompt Caching"]
    P2 --> P2C0["Anthropic Prompt Caching"]
    P2 --> P2C1["OpenAI Cached Tokens"]
    ROOT --> P3["KV Cache Optimization"]
    P3 --> P3C0["Techniques"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b

Implementation uses embedding models and vector similarity:

from openai import OpenAI

async def semantic_cache_lookup(query: str, threshold: float = 0.95):
    query_embedding = embed(query)

    # Search vector store for similar previous queries
    results = vector_store.search(
        vector=query_embedding,
        limit=1,
        filter={"created_at": {"$gt": ttl_cutoff}}
    )

    if results and results[0].score > threshold:
        return results[0].metadata["response"]

    # Cache miss: call LLM and store
    response = await llm_call(query)
    vector_store.upsert({
        "vector": query_embedding,
        "metadata": {"query": query, "response": response}
    })
    return response

Tuning the Similarity Threshold

  • 0.98+: Nearly identical queries only. Low hit rate, very safe.
  • 0.95-0.98: Paraphrases and minor variations. Good balance.
  • 0.90-0.95: Loosely similar queries. Higher hit rate but risk of returning irrelevant cached responses.

Test with your actual query distribution to find the right threshold.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Provider-Level Prompt Caching

Anthropic and OpenAI now offer server-side prompt caching that reduces costs for repeated prompt prefixes.

Anthropic Prompt Caching

Anthropic caches prompt prefixes marked with a cache_control parameter. Subsequent requests with the same prefix hit the cache, reducing input token costs by 90 percent for the cached portion. The cache has a 5-minute TTL that resets on each hit.

This is particularly effective for:

  • Long system prompts (1,000+ tokens)
  • RAG contexts where the retrieved documents are appended to a fixed instruction prefix
  • Multi-turn conversations where the history grows but the system prompt remains constant

OpenAI Cached Tokens

OpenAI automatically caches prompt prefixes longer than 1,024 tokens and charges 50 percent less for cached tokens. Unlike Anthropic's approach, caching is automatic — no API changes required.

KV Cache Optimization

For self-hosted models, the key-value cache stored during autoregressive generation is a major memory and compute bottleneck.

flowchart TD
    CENTER(("LLM Pipeline"))
    CENTER --> N0["Repeated system prompts: Many requests …"]
    CENTER --> N1["Structured queries: Classification task…"]
    CENTER --> N2["Batch processing: Re-running analysis o…"]
    CENTER --> N3["0.98+: Nearly identical queries only. L…"]
    CENTER --> N4["0.95-0.98: Paraphrases and minor variat…"]
    CENTER --> N5["Long system prompts 1,000+ tokens"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff

Techniques

  • PagedAttention (vLLM): Manages KV cache memory like virtual memory pages, eliminating fragmentation and enabling higher batch sizes
  • Prefix caching: Shares KV cache entries across requests with identical prompt prefixes, avoiding redundant computation
  • Quantized KV cache: Storing cached keys and values in FP8 or INT8 precision reduces memory by 50 percent with minimal quality impact

Cost Savings Calculator

For a system processing 100,000 LLM calls per day:

Strategy Typical Hit Rate Cost Reduction
Exact prompt cache 5-15% 5-15%
Semantic cache 15-40% 15-40%
Provider prompt caching 60-90% of tokens 30-50%
Combined approach 50-80%

The strategies are complementary. A production system should layer exact caching (cheapest to implement), semantic caching (catches paraphrases), and provider-level caching (reduces per-token cost for cache misses).

Sources: Anthropic Prompt Caching Documentation | vLLM PagedAttention Paper | GPTCache GitHub

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

AI Agent Cost Optimization: Reducing LLM API Spend by 70% with Caching and Routing

Practical cost reduction strategies for AI agents including semantic caching, intelligent model routing, prompt optimization, and batch processing to cut LLM API spend.

Learn Agentic AI

AI Agent Guardrails in Production: Input Validation, Output Filtering, and Safety Patterns

Practical patterns for agent safety including prompt injection detection, PII filtering, hallucination detection, output content moderation, and circuit breaker implementations.

Learn Agentic AI

NVIDIA OpenShell: Secure Runtime for Autonomous AI Agents in Production

Deep dive into NVIDIA OpenShell's policy-based security model for autonomous AI agents — network guardrails, filesystem isolation, privacy controls, and production deployment patterns.

Learn Agentic AI

AI Agent Observability: Tracing, Logging, and Monitoring with OpenTelemetry

Set up production observability for AI agents with distributed tracing across agent calls, structured logging, metrics dashboards, and alert patterns using OpenTelemetry.

Learn Agentic AI

Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps

Learn how to build agent systems that route different reasoning tasks to different language models — using fast, cheap models for classification and routing, and powerful models for generation and complex reasoning.

Learn Agentic AI

Cost Optimization for Vision-Based Browser Agents: Image Compression and Caching

Reduce GPT Vision API costs by 60-80% through image resizing, compression, region cropping, intelligent caching, and token-aware strategies. Essential techniques for production vision-based browser automation.