TL;DR

Most teams ship agent memory the same way they ship retrieval: turn it on, observe that it sometimes does the right thing, and call it done. Then six months later the agent confidently tells a user her spouse's name is wrong, the support team files a P1, and nobody can explain how it happened because nobody ever measured. This post is the eval pipeline I wish more teams ran from week one — a multi-turn dataset that plants facts on turn 1 and probes for them on turn 5+, four metrics that actually matter (recall@k, precision, contradiction rate, staleness handling), and a working LangSmith multi-turn evaluator. Numbers from our pinned benchmark on `gpt-4o-2024-11-20` and `gpt-4o-mini-2024-07-18` at the end.

Why Single-Turn Evals Cannot Catch Memory Bugs

The standard agent eval — input → output → score — cannot evaluate memory by construction. Memory is, definitionally, a property that emerges across turns. If you only score one turn, you are scoring the prompt and the model, not the memory layer. To detect "the agent forgot what the user said three turns ago" you need a dataset where the correct answer at turn N depends on information given at turn 1.

This is the same insight behind tool-use evals — see our end-to-end multi-step evaluation post for the broader pattern. Memory is just the special case where the "tool" being measured is the agent's own state.

The Multi-Turn Dataset Schema

Every memory eval row is a scripted conversation with planted facts and probe turns. The minimal schema:

```python from pydantic import BaseModel from typing import Literal

class Turn(BaseModel): role: Literal["user", "agent"] content: str # Annotations used by the evaluator (not shown to the agent): plants: list[str] = [] # facts this turn introduces probes: list[str] = [] # facts this turn should recall contradicts: list[str] = [] # facts this turn supersedes (staleness test)

class MemoryEvalCase(BaseModel): case_id: str user_id: str # determines memory namespace turns: list[Turn] expected_facts: dict[str, str] # canonical (subject:predicate) -> object map ```

A real case from our internal suite:

```python case = MemoryEvalCase( case_id="rec-prefname-001", user_id="eval-user-7421", expected_facts={ "user:preferred_name": "Sam", "user:timezone": "America/Los_Angeles", }, turns=[ Turn(role="user", content="Hi, please call me Sam — Samantha feels too formal.", plants=["user:preferred_name=Sam"]), Turn(role="agent", content="Got it, Sam."), Turn(role="user", content="I'm based in LA so afternoons work best.", plants=["user:timezone=America/Los_Angeles"]), # ... three filler turns about something else ... Turn(role="user", content="What time should we schedule? You know what works for me.", probes=["user:timezone=America/Los_Angeles", "user:preferred_name=Sam"]), ], ) ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

The probe turn is where evaluation happens. The agent's response should reflect both planted facts. If it asks "what is your name?" it failed recall. If it suggests a 9 AM EST slot it failed timezone recall. If it calls the user "Samantha" it actively contradicted memory.

The Four Metrics

Memory quality is multi-dimensional. A single number hides regressions. We track four:

Metric	Question it answers	Failure mode it catches
Recall@k	Of the planted facts that should have been retrieved, how many were?	Agent forgot
Precision	Of the facts the agent used, how many were actually true?	Agent hallucinated a memory
Contradiction rate	Did the agent assert something that contradicts a stored fact?	Memory and reasoning disagree
Staleness handling	When a fact is superseded mid-conversation, does the agent update?	Agent stuck on old info

Recall and precision are obvious. Contradiction rate is the one that catches the worst bugs — when the store has the right fact but the agent ignores it because retrieval did not surface it, or because the prompt was poorly structured. Staleness handling catches the opposite failure: the agent retrieved correctly but failed to update when the user said "actually, I moved."

The Multi-Turn Evaluator

LangSmith does not natively know your agent has memory. You drive the conversation yourself, then score the final state. Here is the working pattern:

```python

pip install langsmith==0.2.4 langgraph==0.2.55

from langsmith import Client, evaluate from my_agent import build_agent_with_memory

client = Client()

async def run_conversation(case: dict) -> dict: """Replay the scripted conversation and capture probe-turn responses.""" agent = build_agent_with_memory( model="gpt-4o-2024-11-20", user_id=case["user_id"], )

probe_responses = []
facts_used = []  # facts the agent surfaced in its replies

for turn in case["turns"]:
    if turn["role"] != "user":
        continue
    result = await agent.ainvoke({"messages": [{"role": "user", "content": turn["content"]}]})
    agent_reply = result["messages"][-1].content

    if turn.get("probes"):
        probe_responses.append({
            "turn_content": turn["content"],
            "probes": turn["probes"],
            "reply": agent_reply,
            "facts_used": _extract_asserted_facts(agent_reply),
        })

return {"probe_responses": probe_responses, "case_id": case["case_id"]}

--- Evaluators ---

def recall_at_k(run, example): """Fraction of probed facts that appear in the agent's reply.""" hits, total = 0, 0 for pr in run.outputs["probe_responses"]: for probe in pr["probes"]: total += 1 if _fact_satisfied(probe, pr["reply"]): hits += 1 return {"key": "recall_at_k", "score": hits / total if total else 1.0}

def contradiction_rate(run, example): """Fraction of probe turns where reply contradicts a planted fact.""" expected = example.outputs["expected_facts"] contradictions = 0 for pr in run.outputs["probe_responses"]: for fact in pr["facts_used"]: key = f"{fact['subject']}:{fact['predicate']}" if key in expected and expected[key] != fact["object"]: contradictions += 1 total_probes = sum(len(pr["probes"]) for pr in run.outputs["probe_responses"]) return {"key": "contradiction_rate", "score": contradictions / total_probes if total_probes else 0.0}

def precision(run, example): """Of facts the agent asserted, fraction that match expected_facts.""" expected = example.outputs["expected_facts"] asserted = [f for pr in run.outputs["probe_responses"] for f in pr["facts_used"]] if not asserted: return {"key": "precision", "score": 1.0} correct = sum(1 for f in asserted if expected.get(f"{f['subject']}:{f['predicate']}") == f["object"]) return {"key": "precision", "score": correct / len(asserted)}

--- Run it ---

results = evaluate( run_conversation, data="memory-eval-suite-v3", evaluators=[recall_at_k, contradiction_rate, precision, staleness_handling], experiment_prefix="memory-gpt-4o-2024-11-20-pgvector", metadata={"agent_version": "v0.42.1", "store": "PostgresStore"}, max_concurrency=4, ) ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

The two helper functions `_extract_asserted_facts` and `_fact_satisfied` are themselves small LLM-as-judge calls (using `gpt-4o-mini-2024-07-18`) — pattern-matching on free-text replies is too brittle for anything past toy datasets. We covered that judging pattern in our LLM-as-judge post.

The Eval Loop, Visualized

```mermaid flowchart LR A[Scripted multi-turn case] --> B[Drive turns through agent + memory store] B --> C[Capture probe-turn replies] C --> D[Extract asserted facts via judge LLM] D --> E1[recall@k] D --> E2[precision] D --> E3[contradiction rate] D --> E4[staleness handling] E1 --> F[LangSmith experiment] E2 --> F E3 --> F E4 --> F F --> G{regress vs baseline?} G -->|yes| H[Block merge] G -->|no| I[Ship + watch online evals] style B fill:#e0f2fe style D fill:#fef3c7 style F fill:#dcfce7 ```

Figure 1 — The memory eval pipeline. The critical detail: every metric ultimately resolves to a per-experiment score in LangSmith, so PR diffs are visual and merges are gated.

The Rubric — What Counts as a Pass

Numbers without thresholds are decoration. Our current bars, after about six months of tuning:

Metric	Threshold to ship	Internal stretch goal
recall@5	>= 0.85	>= 0.92
precision	>= 0.95	>= 0.98
contradiction_rate	<= 0.02	<= 0.005
staleness_handling	>= 0.80	>= 0.90

Note the asymmetry: precision and contradiction rate have much tighter bars than recall. A memory miss is annoying ("I told you that already!"). A memory contradiction is brand damage ("the bot is making things up about me"). Optimize accordingly. We will accept a ship that drops recall by 3 points if it drops contradictions by 1 point — almost always.

Designing the Dataset — Where Most Teams Trip

The dataset is harder than the evaluator. A few patterns we converged on:

Plant on turn 1, probe no earlier than turn 4. Anything closer and you are mostly testing the chat history sliding window, not the long-term store.
Include filler turns. Real conversations are not "fact, probe, fact, probe." Mix in 2–4 unrelated turns between plant and probe. Forgetting under irrelevant load is the realistic failure mode.
Plant contradictions deliberately. Turn 1: "I work at Acme." Turn 6: "I just started at Globex." Turn 9: "Where do I work?" Correct answer is Globex; old answer is Acme. This is the staleness test.
Use stable user IDs across the suite, not per-case. This catches cross-case bleed — if your namespacing is wrong, user-7421's facts will leak into user-7422's results and recall will look amazing for the wrong reason.
Keep cases short (8–14 turns). Beyond ~14 turns the eval gets expensive and slow without meaningfully improving signal.

We currently run 180 cases per release. Total cost on `gpt-4o-2024-11-20`: about $4.20 per full eval pass, 11 minutes wall-clock at concurrency 4.

A Result From Our Pinned Benchmark

Same dataset, same agent code, two configurations, both pinned to `gpt-4o-2024-11-20` for the agent and `gpt-4o-mini-2024-07-18` for judges:

Configuration	recall@5	precision	contradiction	staleness
No long-term memory (sliding window only)	0.41	0.99	0.001	0.12
Naive memory (every fact saved, no schema)	0.78	0.81	0.071	0.34
Structured memory (Pydantic + supersession)	0.92	0.97	0.008	0.86

The naive configuration is the cautionary tale. Recall jumped, but precision and contradictions got noticeably worse — the agent had more facts to draw from and a meaningful chunk of them were duplicates or stale. This is the empirical backing for the schema discipline we covered in the LangGraph memory architecture post: structure is not aesthetic, it is what makes precision survive scale.

Where to Wire the Eval

Two integration points matter:

CI gate on PR. Run the full memory eval on any PR that touches memory code, prompts, or schemas. Block merge if any metric drops below threshold vs the last green main commit. We use the LangSmith CI integration pattern for this — same pipeline, different dataset.
Nightly on a held-out subset. A 20-case subset runs every night against production-pinned models. If a vendor silently drifts a model snapshot, this catches it before a customer does.

Online evals are weaker for memory than for single-turn quality, because in production you rarely have ground truth for "did the agent recall correctly?" Stick to offline for memory. Use online evals (latency, error rate, user thumbs-down rate) as a leading indicator that something is wrong, then reproduce in the offline suite.

What Most Teams Get Wrong

The pattern I see repeatedly:

They build memory before they build the eval, ship it because "it seems to work in the demo," and accumulate quiet contradictions until a customer churns.
They measure recall and ignore precision, which trains the team to optimize for "remembers more" instead of "remembers correctly."
They evaluate at turn 2 instead of turn 8, which means the metric is mostly testing the prompt window, not the memory store.
They never test staleness, so the agent confidently parrots a fact the user updated three turns ago.

The fix is not exotic. It is: write the dataset, write the four evaluators, wire them into CI, and treat regressions as bugs. The agents that survive 2026 in production will be the ones whose teams ran this loop on day one — not the ones that shipped the cleverest retrieval and hoped.

Evaluating Agent Memory: Recall, Precision, and the Eval Pipeline Most Teams Don't Build

TL;DR

Why Single-Turn Evals Cannot Catch Memory Bugs

The Multi-Turn Dataset Schema

The Four Metrics

The Multi-Turn Evaluator

pip install langsmith==0.2.4 langgraph==0.2.55

--- Evaluators ---

--- Run it ---

The Eval Loop, Visualized

The Rubric — What Counts as a Pass

Designing the Dataset — Where Most Teams Trip

A Result From Our Pinned Benchmark

Where to Wire the Eval

What Most Teams Get Wrong

Try CallSphere AI Voice Agents

Related Articles You May Like

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

How to Build a Golden Dataset for Production AI Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split