RAG Pipeline Optimization: Reducing Latency from Seconds to Milliseconds

Where RAG Latency Comes From

A typical RAG pipeline has five latency-contributing stages:

Embedding the query — 50-200ms (API call to embedding model)
Vector search — 10-500ms (depends on index size and infrastructure)
Document retrieval — 5-50ms (fetching full documents from storage)
Context assembly — 1-5ms (concatenating and formatting)
LLM generation — 500-5000ms (the dominant cost)

A naive implementation runs these sequentially, resulting in 1-6 seconds of total latency. With the optimizations in this guide, you can reduce stages 1-4 to under 100ms combined and significantly improve the perceived speed of stage 5 through streaming.

Optimization 1: Semantic Cache

The highest-impact optimization is caching. If two users ask semantically similar questions, the second query can return a cached response instantly:

flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

import hashlib
import numpy as np
from openai import OpenAI
import redis
import json

client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.threshold = similarity_threshold
        self.embedding_cache_key = "rag:embeddings"
        self.response_cache_key = "rag:responses"

    def _get_embedding(self, text: str) -> list[float]:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
        )
        return response.data[0].embedding

    def _cosine_similarity(
        self, a: list[float], b: list[float]
    ) -> float:
        a_np, b_np = np.array(a), np.array(b)
        return float(
            np.dot(a_np, b_np)
            / (np.linalg.norm(a_np) * np.linalg.norm(b_np))
        )

    def get(self, query: str) -> str | None:
        """Check if a semantically similar query was cached."""
        query_emb = self._get_embedding(query)

        # Check all cached embeddings
        cached = cache.hgetall(self.embedding_cache_key)
        for key, emb_json in cached.items():
            cached_emb = json.loads(emb_json)
            similarity = self._cosine_similarity(
                query_emb, cached_emb
            )
            if similarity >= self.threshold:
                response = cache.hget(
                    self.response_cache_key, key
                )
                if response:
                    return response.decode()

        return None

    def set(
        self, query: str, response: str, ttl: int = 3600
    ):
        """Cache a query-response pair."""
        query_emb = self._get_embedding(query)
        key = hashlib.md5(query.encode()).hexdigest()
        cache.hset(
            self.embedding_cache_key,
            key,
            json.dumps(query_emb),
        )
        cache.hset(self.response_cache_key, key, response)

Optimization 2: Async Parallel Retrieval

When searching multiple sources, run them concurrently:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import asyncio
from typing import Any

async def async_embed(text: str) -> list[float]:
    """Non-blocking embedding call."""
    loop = asyncio.get_event_loop()
    response = await loop.run_in_executor(
        None,
        lambda: client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
        )
    )
    return response.data[0].embedding

async def async_search(
    vectorstore, query_embedding: list[float], k: int
) -> list[dict]:
    """Non-blocking vector search."""
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(
        None,
        lambda: vectorstore.search_by_vector(
            query_embedding, k=k
        )
    )

async def optimized_retrieval(
    query: str,
    vectorstores: list,
    k_per_store: int = 3,
) -> list[dict]:
    """Search all vector stores in parallel."""
    # Single embedding call shared across all stores
    query_embedding = await async_embed(query)

    # Search all stores concurrently
    tasks = [
        async_search(vs, query_embedding, k_per_store)
        for vs in vectorstores
    ]
    results = await asyncio.gather(*tasks)

    # Flatten and return
    return [doc for store_results in results
            for doc in store_results]

Optimization 3: Matryoshka Embeddings for Faster Search

Modern embedding models like text-embedding-3-small support dimensionality reduction. Shorter embeddings mean faster similarity computation:

def get_compact_embedding(
    text: str, dimensions: int = 256
) -> list[float]:
    """Get a reduced-dimension embedding for faster search.
    text-embedding-3-small natively supports 256, 512,
    or 1536 dimensions."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
        dimensions=dimensions,  # Reduce from 1536 to 256
    )
    return response.data[0].embedding

# 256-dim embeddings are 6x smaller and search is
# approximately 4x faster with minimal quality loss

Optimization 4: Streaming Generation

The LLM generation step dominates latency. Streaming gives users immediate feedback:

def streaming_rag(
    query: str,
    context: str,
):
    """Stream the RAG response token by token."""
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Answer using the provided context."
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\n"
                       f"Question: {query}"
        }],
        stream=True,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            yield delta.content

Optimization 5: Pre-Computed Popular Queries

For queries that follow predictable patterns, pre-compute and cache results during off-peak hours:

from datetime import datetime

def precompute_popular_queries(
    popular_queries: list[str],
    rag_pipeline,
    semantic_cache: SemanticCache,
):
    """Pre-compute answers for frequently asked questions
    during off-peak hours."""
    for query in popular_queries:
        # Check if already cached and fresh
        cached = semantic_cache.get(query)
        if cached:
            continue

        # Generate and cache
        answer = rag_pipeline.answer(query)
        semantic_cache.set(query, answer, ttl=86400)

    print(
        f"Pre-computed {len(popular_queries)} queries "
        f"at {datetime.now()}"
    )

Combined Pipeline with All Optimizations

When you apply all these optimizations together, the typical latency profile changes dramatically. Cache hits return in under 100ms. Cache misses with parallel retrieval and streaming return the first token in 300-500ms. The user perceives near-instant responses for common queries and fast streaming for novel ones.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

What cache hit rate should I expect?

In production RAG systems with enterprise users, cache hit rates of 30-50% are common because users often ask variations of the same questions. Consumer-facing systems see lower hit rates (10-20%) due to query diversity. Even a 30% hit rate means nearly a third of your queries return instantly.

Does reducing embedding dimensions hurt retrieval quality?

At 256 dimensions (down from 1536), text-embedding-3-small retains approximately 95% of its retrieval quality on standard benchmarks. For most applications, this is an excellent tradeoff. If you work in a domain with very fine-grained semantic distinctions (like legal or medical), test on your specific evaluation set before committing to reduced dimensions.

Should I optimize the retrieval pipeline or the generation step first?

Optimize generation first with streaming — it gives the biggest perceived latency improvement because users see tokens appearing immediately instead of waiting for the full response. Then add semantic caching, which eliminates both retrieval and generation latency for repeated queries. Async retrieval and embedding optimization are worthwhile refinements after those two are in place.

#RAGOptimization #LatencyReduction #Caching #AsyncRetrieval #Performance #AgenticAI #LearnAI #AIEngineering

RAG Pipeline Optimization: Reducing Latency from Seconds to Milliseconds

Where RAG Latency Comes From

Optimization 1: Semantic Cache

Optimization 2: Async Parallel Retrieval

Optimization 3: Matryoshka Embeddings for Faster Search

Optimization 4: Streaming Generation

Optimization 5: Pre-Computed Popular Queries

Combined Pipeline with All Optimizations

FAQ

What cache hit rate should I expect?

Does reducing embedding dimensions hurt retrieval quality?

Should I optimize the retrieval pipeline or the generation step first?

Try CallSphere AI Voice Agents

Related Articles You May Like

Model Latency Profiles by Provider: TTFT, TPS, and p99 in 2026

RAG Caching Layers: Hit Rates and Cost Reduction Strategies

Agent Latency Budgets: How to Hit Sub-Second Decisions

Caching Strategies for AI Apps: Multi-Layer Cache Design

Custom CUDA Kernels via Triton for AI Workloads

PyTorch 2.x Compile in Production: When It Helps and When It Hurts