Evaluating Agent Memory: Recall, Precision, and the Eval Pipeline Most Teams Don't Build
Memory is supposed to make agents better — but does it? Build a memory eval pipeline that measures recall, precision, contradiction rate, and the freshness/staleness tradeoff.
TL;DR
Most teams ship agent memory the same way they ship retrieval: turn it on, observe that it sometimes does the right thing, and call it done. Then six months later the agent confidently tells a user her spouse's name is wrong, the support team files a P1, and nobody can explain how it happened because nobody ever measured. This post is the eval pipeline I wish more teams ran from week one — a multi-turn dataset that plants facts on turn 1 and probes for them on turn 5+, four metrics that actually matter (recall@k, precision, contradiction rate, staleness handling), and a working LangSmith multi-turn evaluator. Numbers from our pinned benchmark on `gpt-4o-2024-11-20` and `gpt-4o-mini-2024-07-18` at the end.
Why Single-Turn Evals Cannot Catch Memory Bugs
The standard agent eval — input → output → score — cannot evaluate memory by construction. Memory is, definitionally, a property that emerges across turns. If you only score one turn, you are scoring the prompt and the model, not the memory layer. To detect "the agent forgot what the user said three turns ago" you need a dataset where the correct answer at turn N depends on information given at turn 1.
This is the same insight behind tool-use evals — see our end-to-end multi-step evaluation post for the broader pattern. Memory is just the special case where the "tool" being measured is the agent's own state.
The Multi-Turn Dataset Schema
Every memory eval row is a scripted conversation with planted facts and probe turns. The minimal schema:
```python from pydantic import BaseModel from typing import Literal
class Turn(BaseModel): role: Literal["user", "agent"] content: str # Annotations used by the evaluator (not shown to the agent): plants: list[str] = [] # facts this turn introduces probes: list[str] = [] # facts this turn should recall contradicts: list[str] = [] # facts this turn supersedes (staleness test)
class MemoryEvalCase(BaseModel): case_id: str user_id: str # determines memory namespace turns: list[Turn] expected_facts: dict[str, str] # canonical (subject:predicate) -> object map ```
A real case from our internal suite:
```python case = MemoryEvalCase( case_id="rec-prefname-001", user_id="eval-user-7421", expected_facts={ "user:preferred_name": "Sam", "user:timezone": "America/Los_Angeles", }, turns=[ Turn(role="user", content="Hi, please call me Sam — Samantha feels too formal.", plants=["user:preferred_name=Sam"]), Turn(role="agent", content="Got it, Sam."), Turn(role="user", content="I'm based in LA so afternoons work best.", plants=["user:timezone=America/Los_Angeles"]), # ... three filler turns about something else ... Turn(role="user", content="What time should we schedule? You know what works for me.", probes=["user:timezone=America/Los_Angeles", "user:preferred_name=Sam"]), ], ) ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The probe turn is where evaluation happens. The agent's response should reflect both planted facts. If it asks "what is your name?" it failed recall. If it suggests a 9 AM EST slot it failed timezone recall. If it calls the user "Samantha" it actively contradicted memory.
The Four Metrics
Memory quality is multi-dimensional. A single number hides regressions. We track four:
| Metric | Question it answers | Failure mode it catches |
|---|---|---|
| Recall@k | Of the planted facts that should have been retrieved, how many were? | Agent forgot |
| Precision | Of the facts the agent used, how many were actually true? | Agent hallucinated a memory |
| Contradiction rate | Did the agent assert something that contradicts a stored fact? | Memory and reasoning disagree |
| Staleness handling | When a fact is superseded mid-conversation, does the agent update? | Agent stuck on old info |
Recall and precision are obvious. Contradiction rate is the one that catches the worst bugs — when the store has the right fact but the agent ignores it because retrieval did not surface it, or because the prompt was poorly structured. Staleness handling catches the opposite failure: the agent retrieved correctly but failed to update when the user said "actually, I moved."
The Multi-Turn Evaluator
LangSmith does not natively know your agent has memory. You drive the conversation yourself, then score the final state. Here is the working pattern:
```python
pip install langsmith==0.2.4 langgraph==0.2.55
from langsmith import Client, evaluate from my_agent import build_agent_with_memory
client = Client()
async def run_conversation(case: dict) -> dict: """Replay the scripted conversation and capture probe-turn responses.""" agent = build_agent_with_memory( model="gpt-4o-2024-11-20", user_id=case["user_id"], )
probe_responses = []
facts_used = [] # facts the agent surfaced in its replies
for turn in case["turns"]:
if turn["role"] != "user":
continue
result = await agent.ainvoke({"messages": [{"role": "user", "content": turn["content"]}]})
agent_reply = result["messages"][-1].content
if turn.get("probes"):
probe_responses.append({
"turn_content": turn["content"],
"probes": turn["probes"],
"reply": agent_reply,
"facts_used": _extract_asserted_facts(agent_reply),
})
return {"probe_responses": probe_responses, "case_id": case["case_id"]}
--- Evaluators ---
def recall_at_k(run, example): """Fraction of probed facts that appear in the agent's reply.""" hits, total = 0, 0 for pr in run.outputs["probe_responses"]: for probe in pr["probes"]: total += 1 if _fact_satisfied(probe, pr["reply"]): hits += 1 return {"key": "recall_at_k", "score": hits / total if total else 1.0}
def contradiction_rate(run, example): """Fraction of probe turns where reply contradicts a planted fact.""" expected = example.outputs["expected_facts"] contradictions = 0 for pr in run.outputs["probe_responses"]: for fact in pr["facts_used"]: key = f"{fact['subject']}:{fact['predicate']}" if key in expected and expected[key] != fact["object"]: contradictions += 1 total_probes = sum(len(pr["probes"]) for pr in run.outputs["probe_responses"]) return {"key": "contradiction_rate", "score": contradictions / total_probes if total_probes else 0.0}
def precision(run, example): """Of facts the agent asserted, fraction that match expected_facts.""" expected = example.outputs["expected_facts"] asserted = [f for pr in run.outputs["probe_responses"] for f in pr["facts_used"]] if not asserted: return {"key": "precision", "score": 1.0} correct = sum(1 for f in asserted if expected.get(f"{f['subject']}:{f['predicate']}") == f["object"]) return {"key": "precision", "score": correct / len(asserted)}
--- Run it ---
results = evaluate( run_conversation, data="memory-eval-suite-v3", evaluators=[recall_at_k, contradiction_rate, precision, staleness_handling], experiment_prefix="memory-gpt-4o-2024-11-20-pgvector", metadata={"agent_version": "v0.42.1", "store": "PostgresStore"}, max_concurrency=4, ) ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The two helper functions `_extract_asserted_facts` and `_fact_satisfied` are themselves small LLM-as-judge calls (using `gpt-4o-mini-2024-07-18`) — pattern-matching on free-text replies is too brittle for anything past toy datasets. We covered that judging pattern in our LLM-as-judge post.
The Eval Loop, Visualized
```mermaid flowchart LR A[Scripted multi-turn case] --> B[Drive turns through agent + memory store] B --> C[Capture probe-turn replies] C --> D[Extract asserted facts via judge LLM] D --> E1[recall@k] D --> E2[precision] D --> E3[contradiction rate] D --> E4[staleness handling] E1 --> F[LangSmith experiment] E2 --> F E3 --> F E4 --> F F --> G{regress vs baseline?} G -->|yes| H[Block merge] G -->|no| I[Ship + watch online evals] style B fill:#e0f2fe style D fill:#fef3c7 style F fill:#dcfce7 ```
Figure 1 — The memory eval pipeline. The critical detail: every metric ultimately resolves to a per-experiment score in LangSmith, so PR diffs are visual and merges are gated.
The Rubric — What Counts as a Pass
Numbers without thresholds are decoration. Our current bars, after about six months of tuning:
| Metric | Threshold to ship | Internal stretch goal |
|---|---|---|
| recall@5 | >= 0.85 | >= 0.92 |
| precision | >= 0.95 | >= 0.98 |
| contradiction_rate | <= 0.02 | <= 0.005 |
| staleness_handling | >= 0.80 | >= 0.90 |
Note the asymmetry: precision and contradiction rate have much tighter bars than recall. A memory miss is annoying ("I told you that already!"). A memory contradiction is brand damage ("the bot is making things up about me"). Optimize accordingly. We will accept a ship that drops recall by 3 points if it drops contradictions by 1 point — almost always.
Designing the Dataset — Where Most Teams Trip
The dataset is harder than the evaluator. A few patterns we converged on:
- Plant on turn 1, probe no earlier than turn 4. Anything closer and you are mostly testing the chat history sliding window, not the long-term store.
- Include filler turns. Real conversations are not "fact, probe, fact, probe." Mix in 2–4 unrelated turns between plant and probe. Forgetting under irrelevant load is the realistic failure mode.
- Plant contradictions deliberately. Turn 1: "I work at Acme." Turn 6: "I just started at Globex." Turn 9: "Where do I work?" Correct answer is Globex; old answer is Acme. This is the staleness test.
- Use stable user IDs across the suite, not per-case. This catches cross-case bleed — if your namespacing is wrong, user-7421's facts will leak into user-7422's results and recall will look amazing for the wrong reason.
- Keep cases short (8–14 turns). Beyond ~14 turns the eval gets expensive and slow without meaningfully improving signal.
We currently run 180 cases per release. Total cost on `gpt-4o-2024-11-20`: about $4.20 per full eval pass, 11 minutes wall-clock at concurrency 4.
A Result From Our Pinned Benchmark
Same dataset, same agent code, two configurations, both pinned to `gpt-4o-2024-11-20` for the agent and `gpt-4o-mini-2024-07-18` for judges:
| Configuration | recall@5 | precision | contradiction | staleness |
|---|---|---|---|---|
| No long-term memory (sliding window only) | 0.41 | 0.99 | 0.001 | 0.12 |
| Naive memory (every fact saved, no schema) | 0.78 | 0.81 | 0.071 | 0.34 |
| Structured memory (Pydantic + supersession) | 0.92 | 0.97 | 0.008 | 0.86 |
The naive configuration is the cautionary tale. Recall jumped, but precision and contradictions got noticeably worse — the agent had more facts to draw from and a meaningful chunk of them were duplicates or stale. This is the empirical backing for the schema discipline we covered in the LangGraph memory architecture post: structure is not aesthetic, it is what makes precision survive scale.
Where to Wire the Eval
Two integration points matter:
- CI gate on PR. Run the full memory eval on any PR that touches memory code, prompts, or schemas. Block merge if any metric drops below threshold vs the last green main commit. We use the LangSmith CI integration pattern for this — same pipeline, different dataset.
- Nightly on a held-out subset. A 20-case subset runs every night against production-pinned models. If a vendor silently drifts a model snapshot, this catches it before a customer does.
Online evals are weaker for memory than for single-turn quality, because in production you rarely have ground truth for "did the agent recall correctly?" Stick to offline for memory. Use online evals (latency, error rate, user thumbs-down rate) as a leading indicator that something is wrong, then reproduce in the offline suite.
What Most Teams Get Wrong
The pattern I see repeatedly:
- They build memory before they build the eval, ship it because "it seems to work in the demo," and accumulate quiet contradictions until a customer churns.
- They measure recall and ignore precision, which trains the team to optimize for "remembers more" instead of "remembers correctly."
- They evaluate at turn 2 instead of turn 8, which means the metric is mostly testing the prompt window, not the memory store.
- They never test staleness, so the agent confidently parrots a fact the user updated three turns ago.
The fix is not exotic. It is: write the dataset, write the four evaluators, wire them into CI, and treat regressions as bugs. The agents that survive 2026 in production will be the ones whose teams ran this loop on day one — not the ones that shipped the cleverest retrieval and hoped.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.