Skip to content
Learn Agentic AI
Learn Agentic AI11 min read2 views

UFO Memory and Learning: How the Agent Remembers Successful Task Patterns

Learn how Microsoft UFO's experience learning system stores successful task executions, retrieves relevant past patterns for new tasks, and optimizes performance through memory-based action prediction.

Why Agent Memory Matters

Without memory, every UFO task starts from scratch. The agent has no recollection of successfully completing the same task yesterday or of discovering that a particular sequence of clicks is the fastest way to apply a filter in Excel. Every execution involves the same number of LLM calls, the same trial-and-error, and the same cost.

UFO addresses this with an experience learning system that records successful task executions and retrieves relevant experiences when handling new tasks. This is functionally a Retrieval-Augmented Generation (RAG) system applied to UI automation memory.

How Experience Learning Works

UFO's memory system operates in three phases: record, index, and retrieve.

flowchart TD
    START["UFO Memory and Learning: How the Agent Remembers …"] --> A
    A["Why Agent Memory Matters"]
    A --> B
    B["How Experience Learning Works"]
    B --> C
    C["Injecting Memory Into the Prompt"]
    C --> D
    D["Performance Impact"]
    D --> E
    E["FAQ"]
    E --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Phase 1: Recording Experiences

After a task completes successfully, UFO serializes the entire execution trace — every observation, action, and outcome — into a structured experience record:

from dataclasses import dataclass, field
from datetime import datetime
import json

@dataclass
class TaskExperience:
    """A recorded successful task execution."""
    task_id: str
    task_description: str
    application: str
    steps: list[dict]
    total_steps: int
    start_time: datetime
    end_time: datetime
    success: bool
    metadata: dict = field(default_factory=dict)

    def to_dict(self) -> dict:
        return {
            "task_id": self.task_id,
            "task_description": self.task_description,
            "application": self.application,
            "steps": self.steps,
            "total_steps": self.total_steps,
            "duration_seconds": (self.end_time - self.start_time).total_seconds(),
            "success": self.success,
            "metadata": self.metadata,
        }


def record_experience(task: str, execution_trace: list[dict]) -> TaskExperience:
    """Record a successful task execution for future reference."""
    experience = TaskExperience(
        task_id=generate_uuid(),
        task_description=task,
        application=execution_trace[0].get("application", "Unknown"),
        steps=[
            {
                "step_number": step["step"],
                "observation": step["thought"],
                "action_type": step["action_type"],
                "target_control": step.get("control_text", ""),
                "parameters": step.get("parameters", {}),
                "result": step.get("result", "success"),
            }
            for step in execution_trace
        ],
        total_steps=len(execution_trace),
        start_time=execution_trace[0]["timestamp"],
        end_time=execution_trace[-1]["timestamp"],
        success=True,
    )

    # Save to disk
    save_path = f"experience_db/{experience.task_id}.json"
    with open(save_path, "w") as f:
        json.dump(experience.to_dict(), f, indent=2, default=str)

    return experience

Phase 2: Indexing With Embeddings

Stored experiences are indexed using text embeddings so they can be retrieved by semantic similarity:

from openai import OpenAI
import numpy as np

client = OpenAI()

def create_experience_index(experiences_dir: str) -> dict:
    """Build a vector index of task experiences."""
    index = {"embeddings": [], "task_ids": [], "descriptions": []}

    for exp_file in Path(experiences_dir).glob("*.json"):
        with open(exp_file) as f:
            exp = json.load(f)

        # Create embedding from task description + key actions
        summary = build_experience_summary(exp)

        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=summary
        )

        index["embeddings"].append(response.data[0].embedding)
        index["task_ids"].append(exp["task_id"])
        index["descriptions"].append(summary)

    # Convert to numpy for efficient similarity search
    index["embeddings"] = np.array(index["embeddings"])
    return index


def build_experience_summary(experience: dict) -> str:
    """Create a searchable summary of an experience."""
    steps_summary = " -> ".join(
        f"{s['action_type']}({s['target_control']})"
        for s in experience["steps"][:10]
    )
    return (
        f"Task: {experience['task_description']} "
        f"App: {experience['application']} "
        f"Steps: {steps_summary}"
    )

Phase 3: Retrieving Relevant Experiences

When a new task arrives, UFO searches the index for similar past experiences:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def retrieve_relevant_experiences(
    new_task: str,
    index: dict,
    top_k: int = 3,
    similarity_threshold: float = 0.75,
) -> list[dict]:
    """Find past experiences relevant to the new task."""
    # Embed the new task
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=new_task
    )
    query_embedding = np.array(response.data[0].embedding)

    # Cosine similarity search
    similarities = np.dot(index["embeddings"], query_embedding) / (
        np.linalg.norm(index["embeddings"], axis=1)
        * np.linalg.norm(query_embedding)
    )

    # Filter by threshold and get top-k
    candidates = [
        (i, sim) for i, sim in enumerate(similarities)
        if sim >= similarity_threshold
    ]
    candidates.sort(key=lambda x: x[1], reverse=True)
    top_candidates = candidates[:top_k]

    # Load full experience records
    results = []
    for idx, score in top_candidates:
        task_id = index["task_ids"][idx]
        with open(f"experience_db/{task_id}.json") as f:
            exp = json.load(f)
        exp["similarity_score"] = float(score)
        results.append(exp)

    return results

Injecting Memory Into the Prompt

Retrieved experiences are included in the GPT-4V prompt as few-shot examples, giving the model a proven action sequence to follow:

flowchart TD
    ROOT["UFO Memory and Learning: How the Agent Remem…"] 
    ROOT --> P0["How Experience Learning Works"]
    P0 --> P0C0["Phase 1: Recording Experiences"]
    P0 --> P0C1["Phase 2: Indexing With Embeddings"]
    P0 --> P0C2["Phase 3: Retrieving Relevant Experiences"]
    ROOT --> P1["FAQ"]
    P1 --> P1C0["How much storage does the experience da…"]
    P1 --> P1C1["Does UFO learn from failed tasks?"]
    P1 --> P1C2["Can experiences transfer between machin…"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
def build_prompt_with_memory(
    task: str,
    screenshot: str,
    controls: list[dict],
    relevant_experiences: list[dict],
) -> str:
    """Build the action prompt enriched with past experiences."""
    experience_text = ""
    if relevant_experiences:
        experience_text = "\n\nRelevant past experiences:\n"
        for exp in relevant_experiences:
            experience_text += f"\nTask: {exp['task_description']}\n"
            experience_text += f"Similarity: {exp['similarity_score']:.2f}\n"
            experience_text += "Successful steps:\n"
            for step in exp["steps"]:
                experience_text += (
                    f"  {step['step_number']}. {step['action_type']}"
                    f"({step['target_control']}) - {step['observation']}\n"
                )

    return f"""Task: {task}
{experience_text}

Based on the annotated screenshot and any relevant past experience,
select the next action. Past experiences are suggestions — adapt them
to the current UI state if controls have changed."""

Performance Impact

Memory reduces both cost and execution time:

flowchart LR
    S0["Phase 1: Recording Experiences"]
    S0 --> S1
    S1["Phase 2: Indexing With Embeddings"]
    S1 --> S2
    S2["Phase 3: Retrieving Relevant Experiences"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S2 fill:#059669,stroke:#047857,color:#fff
  • Fewer exploratory actions — the agent follows proven paths instead of experimenting
  • Lower token usage — successful patterns provide shorter reasoning chains
  • Better first-attempt accuracy — relevant examples guide the model toward correct actions
# Measuring memory impact
def compare_with_without_memory(task: str):
    """Run the same task with and without memory retrieval."""
    # Without memory
    result_no_mem = run_ufo_task(task, use_memory=False)

    # With memory
    result_with_mem = run_ufo_task(task, use_memory=True)

    print(f"Without memory: {result_no_mem['steps']} steps, "
          f"${result_no_mem['cost']:.3f}")
    print(f"With memory: {result_with_mem['steps']} steps, "
          f"${result_with_mem['cost']:.3f}")
    print(f"Step reduction: "
          f"{(1 - result_with_mem['steps']/result_no_mem['steps'])*100:.0f}%")

In practice, memory-augmented execution typically reduces step count by 20-40% for tasks similar to previously recorded experiences.

FAQ

How much storage does the experience database require?

Each experience record is a JSON file of 5-50 KB depending on task complexity. The embeddings index adds roughly 6 KB per experience (1536-dimensional float32 vector). A database of 1,000 experiences takes approximately 50-60 MB total — negligible on modern systems.

Does UFO learn from failed tasks?

By default, UFO only records successful completions. However, you can configure it to also record failures and use them as negative examples in the prompt — telling the model "this approach was tried and failed" to steer it toward alternative strategies.

Can experiences transfer between machines with different screen resolutions?

Experiences are stored as abstract action sequences (click control type X, type text Y) rather than pixel coordinates, so they transfer well between machines. The vision model adapts to different layouts and resolutions when following experience-suggested action sequences.


#AgentMemory #ExperienceLearning #RAG #TaskPatterns #MicrosoftUFO #PerformanceOptimization #AIMemory #VectorSearch

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

guides

Understanding AI Voice Technology: A Beginner's Guide

A plain-English guide to AI voice technology — LLMs, STT, TTS, RAG, function calling, and latency budgets. Learn how modern voice agents actually work.

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Agent Memory Systems: Short-Term, Long-Term, and Episodic Memory for AI Agents

Technical deep dive into agent memory architectures covering conversation context, vector DB persistence, and experience replay with implementation code for production systems.

Guides

Privacy-First AI for Procurement: How to Build Secure, Guardrail-Driven Systems

Learn how to design privacy-first AI systems for procurement workflows. Covers data classification, guardrails, RBAC, prompt injection prevention, RAG, and full auditability for enterprise AI.