Event Sourcing for AI Agents: Replay a Conversation, Re-Plan a Decision, Audit a Refund
Storing the agent's state mutations as immutable events lets you replay any conversation, A/B-test a new prompt against historical traffic, and prove to a regulator exactly what the agent saw and said.
TL;DR — Store every agent input and output as an immutable event in an append-only log. The current state is a fold over the events. Benefits: replay a conversation against a new prompt, full audit trail for HIPAA/SOC2, time-travel debugging, and A/B testing a new model on real historical traffic without touching production.
The pattern
A traditional CRUD agent stores "the current customer record". An event-sourced agent stores "every event that has ever happened". The current record is derived by replaying the events. This is more storage and more code, but you gain superpowers: replay, audit, time-travel, and A/B against history.
How it works (architecture)
flowchart LR
Caller --> Agent[AI agent]
Agent -->|append| ES[(Event store<br/>EventStoreDB / Postgres)]
ES -->|projection| Read1[(Customer view)]
ES -->|projection| Read2[(Conversation view)]
ES -.replay.- Replay[Replay engine]
Replay -->|fold events| New[New prompt + model]
Replay -->|compare| Diff[A/B diff]
Each event is (aggregateId, eventType, payload, version, timestamp). Projections (CQRS read models) are derived. New prompts can be A/B tested by replaying past conversations through the new prompt and diffing the responses.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
CallSphere implementation
CallSphere event-sources every agent turn for Real Estate OneRoof, Healthcare, and Sales because regulator/HIPAA replay is contractual. After-hours and Salon use a simpler journal model. The append-only log lives in Postgres with a partition per month and a Kafka projection for downstream views. We A/B test new prompts by replaying yesterday's traffic through the candidate prompt and diffing tool calls. 37 agents · 90+ tools · 115+ DB tables · 6 verticals · pricing $149/$499/$1499 · 14-day trial · 22% affiliate. /pricing · /demo.
Build steps with code
- Pick a store: EventStoreDB, Axon, or Postgres with append-only constraints.
- Define event types — naming and versioning matter forever.
- Append on every state change — DB row, tool call, model decision, redaction.
- Build projections for read views (customer, call, conversation).
- Snapshot every N events so replay isn't O(n) from time zero.
- Migrate event schemas with versioning, never with mutation.
- Replay engine to test new prompts against historical events.
CREATE TABLE events (
id bigserial PRIMARY KEY,
aggregate_id uuid NOT NULL,
aggregate_type text NOT NULL,
event_type text NOT NULL,
event_version int NOT NULL,
payload jsonb NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (aggregate_id, id)
) PARTITION BY RANGE (created_at);
CREATE INDEX events_agg ON events (aggregate_id, id);
def fold_call(events: list[dict]) -> dict:
state = {"transcript": [], "tool_calls": [], "redactions": []}
for e in events:
t = e["event_type"]
if t == "user.utterance.v1":
state["transcript"].append({"role": "user", "text": e["payload"]["text"]})
elif t == "agent.utterance.v1":
state["transcript"].append({"role": "agent", "text": e["payload"]["text"]})
elif t == "tool.call.v1":
state["tool_calls"].append(e["payload"])
elif t == "redaction.applied.v1":
state["redactions"].append(e["payload"])
return state
# Replay against new prompt
def replay_with_prompt(call_id: str, new_prompt: str) -> dict:
events = load_events(call_id)
user_turns = [e for e in events if e["event_type"] == "user.utterance.v1"]
new_responses = [run_model(new_prompt, u["payload"]["text"]) for u in user_turns]
return diff_against_recorded(events, new_responses)
Common pitfalls
- Mutating event payloads — never; version events instead.
- No snapshots — replay grinds when aggregates have 10k+ events.
- Coupling read model schema to event names — keep them decoupled via projections.
- Treating event store as write-only — you'll need to query; build projections.
- Replay storms on prompt changes — schedule them off-peak.
FAQ
Event sourcing vs append-only log? Event sourcing is the architectural pattern; the log is the storage.
EventStoreDB vs Postgres? EventStoreDB is purpose-built; Postgres is fine up to ~1k events/sec per partition.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Do we need CQRS? Not strictly — but the read/write split falls out naturally.
How does CallSphere use replay in eval? We replay the last 7 days through every prompt change before shipping. Book a demo to see it.
Cost? ~3x storage vs CRUD; bounded with snapshot+compaction.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.