Skip to content
Learn Agentic AI
Learn Agentic AI11 min read5 views

Semantic Memory for AI Agents: Using Embeddings to Remember Relevant Facts

Learn how to build a semantic memory system for AI agents using text embeddings, similarity thresholds, and memory consolidation to retrieve the most relevant facts from past interactions.

What Is Semantic Memory?

In cognitive science, semantic memory is the store of general knowledge and facts — distinct from episodic memory (specific events) and procedural memory (how to do things). For AI agents, semantic memory is a retrieval system that finds stored information based on meaning rather than exact keywords.

The core idea is simple: convert text into numerical vectors (embeddings) that capture semantic meaning, then use vector similarity to find the most relevant stored facts when the agent needs them. A query about "monthly subscription cost" should retrieve a memory stored as "The plan is priced at $49/month" even though the words barely overlap.

Generating Embeddings

Embeddings are produced by specialized models that map text to high-dimensional vectors. Similar meanings produce vectors that are close together in this space.

flowchart TD
    START["Semantic Memory for AI Agents: Using Embeddings t…"] --> A
    A["What Is Semantic Memory?"]
    A --> B
    B["Generating Embeddings"]
    B --> C
    C["Building a Semantic Memory Store"]
    C --> D
    D["Relevance-Weighted Retrieval"]
    D --> E
    E["Memory Consolidation"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import openai
import numpy as np
from typing import List

client = openai.OpenAI()

def embed_text(text: str) -> List[float]:
    """Generate an embedding vector for a single text string."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def embed_batch(texts: List[str]) -> List[List[float]]:
    """Generate embeddings for multiple texts in one API call."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

The text-embedding-3-small model produces 1536-dimensional vectors and costs fractions of a cent per thousand tokens. For higher accuracy, text-embedding-3-large produces 3072 dimensions.

Building a Semantic Memory Store

Here is a complete semantic memory implementation that stores facts with their embeddings and retrieves them by similarity.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from dataclasses import dataclass, field
from datetime import datetime
from typing import List, Optional, Tuple

@dataclass
class SemanticMemory:
    content: str
    embedding: List[float]
    category: str
    importance: float = 0.5  # 0.0 to 1.0
    access_count: int = 0
    created_at: datetime = field(default_factory=datetime.utcnow)
    last_accessed: datetime = field(default_factory=datetime.utcnow)

class SemanticMemoryStore:
    def __init__(self, similarity_threshold: float = 0.7):
        self.memories: List[SemanticMemory] = []
        self.threshold = similarity_threshold

    def add(self, content: str, category: str, importance: float = 0.5):
        embedding = embed_text(content)

        # Check for duplicates before adding
        similar = self._find_similar(embedding, threshold=0.92)
        if similar:
            # Update existing memory instead of creating duplicate
            existing = similar[0][0]
            existing.content = content
            existing.embedding = embedding
            existing.last_accessed = datetime.utcnow()
            return existing

        memory = SemanticMemory(
            content=content,
            embedding=embedding,
            category=category,
            importance=importance,
        )
        self.memories.append(memory)
        return memory

    def recall(
        self,
        query: str,
        top_k: int = 5,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        """Retrieve the most relevant memories for a query."""
        query_embedding = embed_text(query)
        results = self._find_similar(
            query_embedding, threshold=self.threshold, category=category
        )

        # Update access metadata
        for memory, score in results[:top_k]:
            memory.access_count += 1
            memory.last_accessed = datetime.utcnow()

        return results[:top_k]

    def _find_similar(
        self,
        embedding: List[float],
        threshold: float = 0.7,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        scored = []
        for mem in self.memories:
            if category and mem.category != category:
                continue
            score = cosine_similarity(embedding, mem.embedding)
            if score >= threshold:
                scored.append((mem, score))
        scored.sort(key=lambda x: x[1], reverse=True)
        return scored

Relevance-Weighted Retrieval

Raw cosine similarity is a good start, but production systems often combine similarity with recency and importance for a composite relevance score.

import math

def compute_relevance(
    similarity: float,
    memory: SemanticMemory,
    recency_weight: float = 0.2,
    importance_weight: float = 0.15,
) -> float:
    """Combine similarity, recency, and importance into a single score."""
    hours_ago = (datetime.utcnow() - memory.last_accessed).total_seconds() / 3600
    recency_score = math.exp(-0.01 * hours_ago)  # exponential decay

    return (
        (1 - recency_weight - importance_weight) * similarity
        + recency_weight * recency_score
        + importance_weight * memory.importance
    )

This formula ensures that recent, important memories rank higher when similarity scores are close.

Memory Consolidation

Over time, a semantic memory store accumulates redundant or overlapping entries. Consolidation merges similar memories to keep the store efficient.

async def consolidate_memories(
    store: SemanticMemoryStore,
    merge_threshold: float = 0.88,
) -> int:
    """Merge highly similar memories to reduce redundancy."""
    merged_count = 0
    skip_indices = set()

    for i, mem_a in enumerate(store.memories):
        if i in skip_indices:
            continue
        for j, mem_b in enumerate(store.memories[i + 1:], start=i + 1):
            if j in skip_indices:
                continue
            sim = cosine_similarity(mem_a.embedding, mem_b.embedding)
            if sim >= merge_threshold:
                # Keep the more important or more recently accessed one
                if mem_b.importance > mem_a.importance:
                    mem_a.content = mem_b.content
                    mem_a.embedding = mem_b.embedding
                    mem_a.importance = max(mem_a.importance, mem_b.importance)
                mem_a.access_count += mem_b.access_count
                skip_indices.add(j)
                merged_count += 1

    store.memories = [
        m for i, m in enumerate(store.memories) if i not in skip_indices
    ]
    return merged_count

FAQ

How do I choose the right similarity threshold?

Start with 0.7 for general retrieval and tune based on your data. Lower thresholds (0.5-0.6) cast a wider net but include more noise. Higher thresholds (0.8+) are more precise but may miss relevant matches. Test with real queries from your domain and adjust.

Are there alternatives to OpenAI embeddings?

Yes. Open-source models like sentence-transformers/all-MiniLM-L6-v2 run locally with no API costs. Cohere and Voyage AI also offer embedding APIs. The choice depends on your latency, cost, and accuracy requirements.

How do I handle memory that becomes outdated?

Attach a timestamp and optionally a TTL (time-to-live) to each memory. Periodically sweep for expired entries. For facts that change — like a user's address — use the duplicate detection logic to overwrite the old entry rather than creating a conflicting one.


#SemanticMemory #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

Learn Agentic AI

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.

Learn Agentic AI

API Design for AI Agent Tool Functions: Best Practices and Anti-Patterns

How to design tool functions that LLMs can use effectively with clear naming, enum parameters, structured responses, informative error messages, and documentation.

Learn Agentic AI

Google Cloud AI Agent Trends Report 2026: Key Findings and Developer Implications

Analysis of Google Cloud's 2026 AI agent trends report covering Gemini-powered agents, Google ADK, Vertex AI agent builder, and enterprise adoption patterns.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.