TL;DR — Store every agent input and output as an immutable event in an append-only log. The current state is a fold over the events. Benefits: replay a conversation against a new prompt, full audit trail for HIPAA/SOC2, time-travel debugging, and A/B testing a new model on real historical traffic without touching production.

The pattern

A traditional CRUD agent stores "the current customer record". An event-sourced agent stores "every event that has ever happened". The current record is derived by replaying the events. This is more storage and more code, but you gain superpowers: replay, audit, time-travel, and A/B against history.

How it works (architecture)

flowchart LR
  Caller --> Agent[AI agent]
  Agent -->|append| ES[(Event store<br/>EventStoreDB / Postgres)]
  ES -->|projection| Read1[(Customer view)]
  ES -->|projection| Read2[(Conversation view)]
  ES -.replay.- Replay[Replay engine]
  Replay -->|fold events| New[New prompt + model]
  Replay -->|compare| Diff[A/B diff]

Each event is (aggregateId, eventType, payload, version, timestamp). Projections (CQRS read models) are derived. New prompts can be A/B tested by replaying past conversations through the new prompt and diffing the responses.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

CallSphere implementation

CallSphere event-sources every agent turn for Real Estate OneRoof, Healthcare, and Sales because regulator/HIPAA replay is contractual. After-hours and Salon use a simpler journal model. The append-only log lives in Postgres with a partition per month and a Kafka projection for downstream views. We A/B test new prompts by replaying yesterday's traffic through the candidate prompt and diffing tool calls. 37 agents · 90+ tools · 115+ DB tables · 6 verticals · pricing $149/$499/$1499 · 14-day trial · 22% affiliate. /pricing · /demo.

Build steps with code

Pick a store: EventStoreDB, Axon, or Postgres with append-only constraints.
Define event types — naming and versioning matter forever.
Append on every state change — DB row, tool call, model decision, redaction.
Build projections for read views (customer, call, conversation).
Snapshot every N events so replay isn't O(n) from time zero.
Migrate event schemas with versioning, never with mutation.
Replay engine to test new prompts against historical events.

CREATE TABLE events (
  id bigserial PRIMARY KEY,
  aggregate_id uuid NOT NULL,
  aggregate_type text NOT NULL,
  event_type text NOT NULL,
  event_version int NOT NULL,
  payload jsonb NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  UNIQUE (aggregate_id, id)
) PARTITION BY RANGE (created_at);

CREATE INDEX events_agg ON events (aggregate_id, id);

def fold_call(events: list[dict]) -> dict:
    state = {"transcript": [], "tool_calls": [], "redactions": []}
    for e in events:
        t = e["event_type"]
        if t == "user.utterance.v1":
            state["transcript"].append({"role": "user", "text": e["payload"]["text"]})
        elif t == "agent.utterance.v1":
            state["transcript"].append({"role": "agent", "text": e["payload"]["text"]})
        elif t == "tool.call.v1":
            state["tool_calls"].append(e["payload"])
        elif t == "redaction.applied.v1":
            state["redactions"].append(e["payload"])
    return state

# Replay against new prompt
def replay_with_prompt(call_id: str, new_prompt: str) -> dict:
    events = load_events(call_id)
    user_turns = [e for e in events if e["event_type"] == "user.utterance.v1"]
    new_responses = [run_model(new_prompt, u["payload"]["text"]) for u in user_turns]
    return diff_against_recorded(events, new_responses)

Common pitfalls

Mutating event payloads — never; version events instead.
No snapshots — replay grinds when aggregates have 10k+ events.
Coupling read model schema to event names — keep them decoupled via projections.
Treating event store as write-only — you'll need to query; build projections.
Replay storms on prompt changes — schedule them off-peak.

FAQ

Event sourcing vs append-only log? Event sourcing is the architectural pattern; the log is the storage.

EventStoreDB vs Postgres? EventStoreDB is purpose-built; Postgres is fine up to ~1k events/sec per partition.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Do we need CQRS? Not strictly — but the read/write split falls out naturally.

How does CallSphere use replay in eval? We replay the last 7 days through every prompt change before shipping. Book a demo to see it.

Cost? ~3x storage vs CRUD; bounded with snapshot+compaction.

Event Sourcing for AI Agents: Replay a Conversation, Re-Plan a Decision, Audit a Refund

The pattern

How it works (architecture)

CallSphere implementation

Build steps with code

Common pitfalls

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

A/B Testing Chat Agent Prompts in Production: 2026 Playbook

SOC 2 Type II Evidence Expectations for Healthcare AI Vendors in 2026

Agent A/B Testing: Comparing Model Versions, Prompts, and Architectures in Production

Logging Best Practices for AI Agents: Structured Logs for Debugging and Audit

The Strategy Pattern: Swappable Agent Behaviors Based on Runtime Context

Evaluating Fine-Tuned Models: Benchmarks, Human Eval, and A/B Testing