By Sagar Shankaran, Founder of CallSphere
Memory is supposed to make agents better — but does it? Build a memory eval pipeline that measures recall, precision, contradiction rate, and the freshness/staleness tradeoff.
Key takeaways
Most teams ship agent memory the same way they ship retrieval: turn it on, observe that it sometimes does the right thing, and call it done. Then six months later the agent confidently tells a user her spouse's name is wrong, the support team files a P1, and nobody can explain how it happened because nobody ever measured. This post is the eval pipeline I wish more teams ran from week one — a multi-turn dataset that plants facts on turn 1 and probes for them on turn 5+, four metrics that actually matter (recall@k, precision, contradiction rate, staleness handling), and a working LangSmith multi-turn evaluator. Numbers from our pinned benchmark on `gpt-4o-2024-11-20` and `gpt-4o-mini-2024-07-18` at the end.
The standard agent eval — input → output → score — cannot evaluate memory by construction. Memory is, definitionally, a property that emerges across turns. If you only score one turn, you are scoring the prompt and the model, not the memory layer. To detect "the agent forgot what the user said three turns ago" you need a dataset where the correct answer at turn N depends on information given at turn 1.
This is the same insight behind tool-use evals — see our end-to-end multi-step evaluation post for the broader pattern. Memory is just the special case where the "tool" being measured is the agent's own state.
Every memory eval row is a scripted conversation with planted facts and probe turns. The minimal schema:
```python from pydantic import BaseModel from typing import Literal
class Turn(BaseModel): role: Literal["user", "agent"] content: str # Annotations used by the evaluator (not shown to the agent): plants: list[str] = [] # facts this turn introduces probes: list[str] = [] # facts this turn should recall contradicts: list[str] = [] # facts this turn supersedes (staleness test)
class MemoryEvalCase(BaseModel): case_id: str user_id: str # determines memory namespace turns: list[Turn] expected_facts: dict[str, str] # canonical (subject:predicate) -> object map ```
A real case from our internal suite:
```python case = MemoryEvalCase( case_id="rec-prefname-001", user_id="eval-user-7421", expected_facts={ "user:preferred_name": "Sam", "user:timezone": "America/Los_Angeles", }, turns=[ Turn(role="user", content="Hi, please call me Sam — Samantha feels too formal.", plants=["user:preferred_name=Sam"]), Turn(role="agent", content="Got it, Sam."), Turn(role="user", content="I'm based in LA so afternoons work best.", plants=["user:timezone=America/Los_Angeles"]), # ... three filler turns about something else ... Turn(role="user", content="What time should we schedule? You know what works for me.", probes=["user:timezone=America/Los_Angeles", "user:preferred_name=Sam"]), ], ) ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The probe turn is where evaluation happens. The agent's response should reflect both planted facts. If it asks "what is your name?" it failed recall. If it suggests a 9 AM EST slot it failed timezone recall. If it calls the user "Samantha" it actively contradicted memory.
Memory quality is multi-dimensional. A single number hides regressions. We track four:
| Metric | Question it answers | Failure mode it catches |
|---|---|---|
| Recall@k | Of the planted facts that should have been retrieved, how many were? | Agent forgot |
| Precision | Of the facts the agent used, how many were actually true? | Agent hallucinated a memory |
| Contradiction rate | Did the agent assert something that contradicts a stored fact? | Memory and reasoning disagree |
| Staleness handling | When a fact is superseded mid-conversation, does the agent update? | Agent stuck on old info |
Recall and precision are obvious. Contradiction rate is the one that catches the worst bugs — when the store has the right fact but the agent ignores it because retrieval did not surface it, or because the prompt was poorly structured. Staleness handling catches the opposite failure: the agent retrieved correctly but failed to update when the user said "actually, I moved."
LangSmith does not natively know your agent has memory. You drive the conversation yourself, then score the final state. Here is the working pattern:
```python
from langsmith import Client, evaluate from my_agent import build_agent_with_memory
client = Client()
async def run_conversation(case: dict) -> dict: """Replay the scripted conversation and capture probe-turn responses.""" agent = build_agent_with_memory( model="gpt-4o-2024-11-20", user_id=case["user_id"], )
probe_responses = []
facts_used = [] # facts the agent surfaced in its replies
for turn in case["turns"]:
if turn["role"] != "user":
continue
result = await agent.ainvoke({"messages": [{"role": "user", "content": turn["content"]}]})
agent_reply = result["messages"][-1].content
if turn.get("probes"):
probe_responses.append({
"turn_content": turn["content"],
"probes": turn["probes"],
"reply": agent_reply,
"facts_used": _extract_asserted_facts(agent_reply),
})
return {"probe_responses": probe_responses, "case_id": case["case_id"]}
def recall_at_k(run, example): """Fraction of probed facts that appear in the agent's reply.""" hits, total = 0, 0 for pr in run.outputs["probe_responses"]: for probe in pr["probes"]: total += 1 if _fact_satisfied(probe, pr["reply"]): hits += 1 return {"key": "recall_at_k", "score": hits / total if total else 1.0}
def contradiction_rate(run, example): """Fraction of probe turns where reply contradicts a planted fact.""" expected = example.outputs["expected_facts"] contradictions = 0 for pr in run.outputs["probe_responses"]: for fact in pr["facts_used"]: key = f"{fact['subject']}:{fact['predicate']}" if key in expected and expected[key] != fact["object"]: contradictions += 1 total_probes = sum(len(pr["probes"]) for pr in run.outputs["probe_responses"]) return {"key": "contradiction_rate", "score": contradictions / total_probes if total_probes else 0.0}
def precision(run, example): """Of facts the agent asserted, fraction that match expected_facts.""" expected = example.outputs["expected_facts"] asserted = [f for pr in run.outputs["probe_responses"] for f in pr["facts_used"]] if not asserted: return {"key": "precision", "score": 1.0} correct = sum(1 for f in asserted if expected.get(f"{f['subject']}:{f['predicate']}") == f["object"]) return {"key": "precision", "score": correct / len(asserted)}
results = evaluate( run_conversation, data="memory-eval-suite-v3", evaluators=[recall_at_k, contradiction_rate, precision, staleness_handling], experiment_prefix="memory-gpt-4o-2024-11-20-pgvector", metadata={"agent_version": "v0.42.1", "store": "PostgresStore"}, max_concurrency=4, ) ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The two helper functions `_extract_asserted_facts` and `_fact_satisfied` are themselves small LLM-as-judge calls (using `gpt-4o-mini-2024-07-18`) — pattern-matching on free-text replies is too brittle for anything past toy datasets. We covered that judging pattern in our LLM-as-judge post.
```mermaid flowchart LR A[Scripted multi-turn case] --> B[Drive turns through agent + memory store] B --> C[Capture probe-turn replies] C --> D[Extract asserted facts via judge LLM] D --> E1[recall@k] D --> E2[precision] D --> E3[contradiction rate] D --> E4[staleness handling] E1 --> F[LangSmith experiment] E2 --> F E3 --> F E4 --> F F --> G{regress vs baseline?} G -->|yes| H[Block merge] G -->|no| I[Ship + watch online evals] style B fill:#e0f2fe style D fill:#fef3c7 style F fill:#dcfce7 ```
Figure 1 — The memory eval pipeline. The critical detail: every metric ultimately resolves to a per-experiment score in LangSmith, so PR diffs are visual and merges are gated.
Numbers without thresholds are decoration. Our current bars, after about six months of tuning:
| Metric | Threshold to ship | Internal stretch goal |
|---|---|---|
| recall@5 | >= 0.85 | >= 0.92 |
| precision | >= 0.95 | >= 0.98 |
| contradiction_rate | <= 0.02 | <= 0.005 |
| staleness_handling | >= 0.80 | >= 0.90 |
Note the asymmetry: precision and contradiction rate have much tighter bars than recall. A memory miss is annoying ("I told you that already!"). A memory contradiction is brand damage ("the bot is making things up about me"). Optimize accordingly. We will accept a ship that drops recall by 3 points if it drops contradictions by 1 point — almost always.
The dataset is harder than the evaluator. A few patterns we converged on:
We currently run 180 cases per release. Total cost on `gpt-4o-2024-11-20`: about $4.20 per full eval pass, 11 minutes wall-clock at concurrency 4.
Same dataset, same agent code, two configurations, both pinned to `gpt-4o-2024-11-20` for the agent and `gpt-4o-mini-2024-07-18` for judges:
| Configuration | recall@5 | precision | contradiction | staleness |
|---|---|---|---|---|
| No long-term memory (sliding window only) | 0.41 | 0.99 | 0.001 | 0.12 |
| Naive memory (every fact saved, no schema) | 0.78 | 0.81 | 0.071 | 0.34 |
| Structured memory (Pydantic + supersession) | 0.92 | 0.97 | 0.008 | 0.86 |
The naive configuration is the cautionary tale. Recall jumped, but precision and contradictions got noticeably worse — the agent had more facts to draw from and a meaningful chunk of them were duplicates or stale. This is the empirical backing for the schema discipline we covered in the LangGraph memory architecture post: structure is not aesthetic, it is what makes precision survive scale.
Two integration points matter:
Online evals are weaker for memory than for single-turn quality, because in production you rarely have ground truth for "did the agent recall correctly?" Stick to offline for memory. Use online evals (latency, error rate, user thumbs-down rate) as a leading indicator that something is wrong, then reproduce in the offline suite.
The pattern I see repeatedly:
The fix is not exotic. It is: write the dataset, write the four evaluators, wire them into CI, and treat regressions as bugs. The agents that survive 2026 in production will be the ones whose teams ran this loop on day one — not the ones that shipped the cleverest retrieval and hoped.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.