Skip to content
Learn Agentic AI
Learn Agentic AI10 min read3 views

RAG Pipeline Optimization: Reducing Latency from Seconds to Milliseconds

Learn practical techniques to dramatically reduce RAG pipeline latency including async retrieval, semantic caching, pre-computation, and embedding optimization without sacrificing answer quality.

Where RAG Latency Comes From

A typical RAG pipeline has five latency-contributing stages:

  1. Embedding the query — 50-200ms (API call to embedding model)
  2. Vector search — 10-500ms (depends on index size and infrastructure)
  3. Document retrieval — 5-50ms (fetching full documents from storage)
  4. Context assembly — 1-5ms (concatenating and formatting)
  5. LLM generation — 500-5000ms (the dominant cost)

A naive implementation runs these sequentially, resulting in 1-6 seconds of total latency. With the optimizations in this guide, you can reduce stages 1-4 to under 100ms combined and significantly improve the perceived speed of stage 5 through streaming.

Optimization 1: Semantic Cache

The highest-impact optimization is caching. If two users ask semantically similar questions, the second query can return a cached response instantly:

flowchart TD
    START["RAG Pipeline Optimization: Reducing Latency from …"] --> A
    A["Where RAG Latency Comes From"]
    A --> B
    B["Optimization 1: Semantic Cache"]
    B --> C
    C["Optimization 2: Async Parallel Retrieval"]
    C --> D
    D["Optimization 3: Matryoshka Embeddings f…"]
    D --> E
    E["Optimization 4: Streaming Generation"]
    E --> F
    F["Optimization 5: Pre-Computed Popular Qu…"]
    F --> G
    G["Combined Pipeline with All Optimizations"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import hashlib
import numpy as np
from openai import OpenAI
import redis
import json

client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.threshold = similarity_threshold
        self.embedding_cache_key = "rag:embeddings"
        self.response_cache_key = "rag:responses"

    def _get_embedding(self, text: str) -> list[float]:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
        )
        return response.data[0].embedding

    def _cosine_similarity(
        self, a: list[float], b: list[float]
    ) -> float:
        a_np, b_np = np.array(a), np.array(b)
        return float(
            np.dot(a_np, b_np)
            / (np.linalg.norm(a_np) * np.linalg.norm(b_np))
        )

    def get(self, query: str) -> str | None:
        """Check if a semantically similar query was cached."""
        query_emb = self._get_embedding(query)

        # Check all cached embeddings
        cached = cache.hgetall(self.embedding_cache_key)
        for key, emb_json in cached.items():
            cached_emb = json.loads(emb_json)
            similarity = self._cosine_similarity(
                query_emb, cached_emb
            )
            if similarity >= self.threshold:
                response = cache.hget(
                    self.response_cache_key, key
                )
                if response:
                    return response.decode()

        return None

    def set(
        self, query: str, response: str, ttl: int = 3600
    ):
        """Cache a query-response pair."""
        query_emb = self._get_embedding(query)
        key = hashlib.md5(query.encode()).hexdigest()
        cache.hset(
            self.embedding_cache_key,
            key,
            json.dumps(query_emb),
        )
        cache.hset(self.response_cache_key, key, response)

Optimization 2: Async Parallel Retrieval

When searching multiple sources, run them concurrently:

import asyncio
from typing import Any

async def async_embed(text: str) -> list[float]:
    """Non-blocking embedding call."""
    loop = asyncio.get_event_loop()
    response = await loop.run_in_executor(
        None,
        lambda: client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
        )
    )
    return response.data[0].embedding

async def async_search(
    vectorstore, query_embedding: list[float], k: int
) -> list[dict]:
    """Non-blocking vector search."""
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(
        None,
        lambda: vectorstore.search_by_vector(
            query_embedding, k=k
        )
    )

async def optimized_retrieval(
    query: str,
    vectorstores: list,
    k_per_store: int = 3,
) -> list[dict]:
    """Search all vector stores in parallel."""
    # Single embedding call shared across all stores
    query_embedding = await async_embed(query)

    # Search all stores concurrently
    tasks = [
        async_search(vs, query_embedding, k_per_store)
        for vs in vectorstores
    ]
    results = await asyncio.gather(*tasks)

    # Flatten and return
    return [doc for store_results in results
            for doc in store_results]

Modern embedding models like text-embedding-3-small support dimensionality reduction. Shorter embeddings mean faster similarity computation:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Embedding the query — 50-200ms API call…"]
    CENTER --> N1["Vector search — 10-500ms depends on ind…"]
    CENTER --> N2["Document retrieval — 5-50ms fetching fu…"]
    CENTER --> N3["Context assembly — 1-5ms concatenating …"]
    CENTER --> N4["LLM generation — 500-5000ms the dominan…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
def get_compact_embedding(
    text: str, dimensions: int = 256
) -> list[float]:
    """Get a reduced-dimension embedding for faster search.
    text-embedding-3-small natively supports 256, 512,
    or 1536 dimensions."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
        dimensions=dimensions,  # Reduce from 1536 to 256
    )
    return response.data[0].embedding

# 256-dim embeddings are 6x smaller and search is
# approximately 4x faster with minimal quality loss

Optimization 4: Streaming Generation

The LLM generation step dominates latency. Streaming gives users immediate feedback:

def streaming_rag(
    query: str,
    context: str,
):
    """Stream the RAG response token by token."""
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Answer using the provided context."
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\n"
                       f"Question: {query}"
        }],
        stream=True,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            yield delta.content

For queries that follow predictable patterns, pre-compute and cache results during off-peak hours:

from datetime import datetime

def precompute_popular_queries(
    popular_queries: list[str],
    rag_pipeline,
    semantic_cache: SemanticCache,
):
    """Pre-compute answers for frequently asked questions
    during off-peak hours."""
    for query in popular_queries:
        # Check if already cached and fresh
        cached = semantic_cache.get(query)
        if cached:
            continue

        # Generate and cache
        answer = rag_pipeline.answer(query)
        semantic_cache.set(query, answer, ttl=86400)

    print(
        f"Pre-computed {len(popular_queries)} queries "
        f"at {datetime.now()}"
    )

Combined Pipeline with All Optimizations

When you apply all these optimizations together, the typical latency profile changes dramatically. Cache hits return in under 100ms. Cache misses with parallel retrieval and streaming return the first token in 300-500ms. The user perceives near-instant responses for common queries and fast streaming for novel ones.

FAQ

What cache hit rate should I expect?

In production RAG systems with enterprise users, cache hit rates of 30-50% are common because users often ask variations of the same questions. Consumer-facing systems see lower hit rates (10-20%) due to query diversity. Even a 30% hit rate means nearly a third of your queries return instantly.

Does reducing embedding dimensions hurt retrieval quality?

At 256 dimensions (down from 1536), text-embedding-3-small retains approximately 95% of its retrieval quality on standard benchmarks. For most applications, this is an excellent tradeoff. If you work in a domain with very fine-grained semantic distinctions (like legal or medical), test on your specific evaluation set before committing to reduced dimensions.

Should I optimize the retrieval pipeline or the generation step first?

Optimize generation first with streaming — it gives the biggest perceived latency improvement because users see tokens appearing immediately instead of waiting for the full response. Then add semantic caching, which eliminates both retrieval and generation latency for repeated queries. Async retrieval and embedding optimization are worthwhile refinements after those two are in place.


#RAGOptimization #LatencyReduction #Caching #AsyncRetrieval #Performance #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

Technical Guides

Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide

Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.

Learn Agentic AI

AI Agent Cost Optimization: Reducing LLM API Spend by 70% with Caching and Routing

Practical cost reduction strategies for AI agents including semantic caching, intelligent model routing, prompt optimization, and batch processing to cut LLM API spend.

Learn Agentic AI

Sub-500ms Latency Voice Agents: Architecture Patterns for Production Deployment

Technical deep dive into achieving under 500ms voice agent latency with streaming architectures, edge deployment, connection pooling, pre-warming, and async tool execution.

Learn Agentic AI

Token-Efficient Agent Design: Reducing LLM Costs Without Sacrificing Quality

Practical strategies for reducing LLM token costs in agentic systems including compact prompts, tool result summarization, selective context, and model tiering approaches.

Learn Agentic AI

How NVIDIA Vera CPU Solves the Agentic AI Bottleneck: Architecture Deep Dive

Technical analysis of NVIDIA's Vera CPU designed for agentic AI workloads — why the CPU is the bottleneck, how Vera's architecture addresses it, and what it means for agent performance.