LLM Inference Costs Add Up Fast

At $3-15 per million input tokens for frontier models, LLM costs become significant at scale. A customer support agent handling 10,000 conversations per day with 2,000 tokens per conversation costs $60-300 daily on input tokens alone. Caching strategies can reduce these costs by 40-80 percent while simultaneously improving latency.

Three caching approaches address different patterns: exact prompt caching, semantic caching, and KV cache optimization.

Exact Prompt Caching

The simplest approach: hash the full prompt and cache the response. If the same prompt appears again, return the cached response without calling the LLM.

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

import hashlib
import redis
import json

cache = redis.Redis(host="localhost", port=6379, db=0)

async def cached_llm_call(messages: list, model: str, ttl: int = 3600):
    cache_key = hashlib.sha256(
        json.dumps({"messages": messages, "model": model}).encode()
    ).hexdigest()

    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)

    response = await openai_client.chat.completions.create(
        model=model, messages=messages
    )
    cache.setex(cache_key, ttl, json.dumps(response.to_dict()))
    return response

When Exact Caching Works

Repeated system prompts: Many requests share identical system prompts
Structured queries: Classification tasks with a fixed set of inputs
Batch processing: Re-running analysis on unchanged data

When It Fails

Exact caching has a low hit rate for conversational applications where each message includes unique user input. Even one character difference produces a different hash.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Semantic Caching

Semantic caching matches queries by meaning rather than exact text. "What's the weather in NYC?" and "How's the weather in New York City?" should return the same cached response.

Implementation uses embedding models and vector similarity:

from openai import OpenAI

async def semantic_cache_lookup(query: str, threshold: float = 0.95):
    query_embedding = embed(query)

    # Search vector store for similar previous queries
    results = vector_store.search(
        vector=query_embedding,
        limit=1,
        filter={"created_at": {"$gt": ttl_cutoff}}
    )

    if results and results[0].score > threshold:
        return results[0].metadata["response"]

    # Cache miss: call LLM and store
    response = await llm_call(query)
    vector_store.upsert({
        "vector": query_embedding,
        "metadata": {"query": query, "response": response}
    })
    return response

Tuning the Similarity Threshold

0.98+: Nearly identical queries only. Low hit rate, very safe.
0.95-0.98: Paraphrases and minor variations. Good balance.
0.90-0.95: Loosely similar queries. Higher hit rate but risk of returning irrelevant cached responses.

Test with your actual query distribution to find the right threshold.

Provider-Level Prompt Caching

Anthropic and OpenAI now offer server-side prompt caching that reduces costs for repeated prompt prefixes.

Anthropic Prompt Caching

Anthropic caches prompt prefixes marked with a cache_control parameter. Subsequent requests with the same prefix hit the cache, reducing input token costs by 90 percent for the cached portion. The cache has a 5-minute TTL that resets on each hit.

This is particularly effective for:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Long system prompts (1,000+ tokens)
RAG contexts where the retrieved documents are appended to a fixed instruction prefix
Multi-turn conversations where the history grows but the system prompt remains constant

OpenAI Cached Tokens

OpenAI automatically caches prompt prefixes longer than 1,024 tokens and charges 50 percent less for cached tokens. Unlike Anthropic's approach, caching is automatic — no API changes required.

KV Cache Optimization

For self-hosted models, the key-value cache stored during autoregressive generation is a major memory and compute bottleneck.

Techniques

PagedAttention (vLLM): Manages KV cache memory like virtual memory pages, eliminating fragmentation and enabling higher batch sizes
Prefix caching: Shares KV cache entries across requests with identical prompt prefixes, avoiding redundant computation
Quantized KV cache: Storing cached keys and values in FP8 or INT8 precision reduces memory by 50 percent with minimal quality impact

Cost Savings Calculator

For a system processing 100,000 LLM calls per day:

Strategy	Typical Hit Rate	Cost Reduction
Exact prompt cache	5-15%	5-15%
Semantic cache	15-40%	15-40%
Provider prompt caching	60-90% of tokens	30-50%
Combined approach	—	50-80%

The strategies are complementary. A production system should layer exact caching (cheapest to implement), semantic caching (catches paraphrases), and provider-level caching (reduces per-token cost for cache misses).

Sources: Anthropic Prompt Caching Documentation | vLLM PagedAttention Paper | GPTCache GitHub

LLM Caching Strategies for Cost Optimization: Prompt, Semantic, and KV Caching

LLM Inference Costs Add Up Fast

Exact Prompt Caching

When Exact Caching Works

When It Fails

Semantic Caching

Tuning the Similarity Threshold

Provider-Level Prompt Caching

Anthropic Prompt Caching

OpenAI Cached Tokens

KV Cache Optimization

Techniques

Cost Savings Calculator

Try CallSphere AI Voice Agents

Related Articles You May Like

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Self-hosted on-prem stack for Edge / on-device LLM inference: A May 2026 Comparison

Edge / on-device LLM inference in 2026: Open-source frontier matchup (DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3)

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Multilingual customer support in 2026?