Skip to content
AI Engineering
AI Engineering12 min read0 views

Event Sourcing for AI Agents: Replay a Conversation, Re-Plan a Decision, Audit a Refund

Storing the agent's state mutations as immutable events lets you replay any conversation, A/B-test a new prompt against historical traffic, and prove to a regulator exactly what the agent saw and said.

TL;DR — Store every agent input and output as an immutable event in an append-only log. The current state is a fold over the events. Benefits: replay a conversation against a new prompt, full audit trail for HIPAA/SOC2, time-travel debugging, and A/B testing a new model on real historical traffic without touching production.

The pattern

A traditional CRUD agent stores "the current customer record". An event-sourced agent stores "every event that has ever happened". The current record is derived by replaying the events. This is more storage and more code, but you gain superpowers: replay, audit, time-travel, and A/B against history.

How it works (architecture)

flowchart LR
  Caller --> Agent[AI agent]
  Agent -->|append| ES[(Event store<br/>EventStoreDB / Postgres)]
  ES -->|projection| Read1[(Customer view)]
  ES -->|projection| Read2[(Conversation view)]
  ES -.replay.- Replay[Replay engine]
  Replay -->|fold events| New[New prompt + model]
  Replay -->|compare| Diff[A/B diff]

Each event is (aggregateId, eventType, payload, version, timestamp). Projections (CQRS read models) are derived. New prompts can be A/B tested by replaying past conversations through the new prompt and diffing the responses.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

CallSphere implementation

CallSphere event-sources every agent turn for Real Estate OneRoof, Healthcare, and Sales because regulator/HIPAA replay is contractual. After-hours and Salon use a simpler journal model. The append-only log lives in Postgres with a partition per month and a Kafka projection for downstream views. We A/B test new prompts by replaying yesterday's traffic through the candidate prompt and diffing tool calls. 37 agents · 90+ tools · 115+ DB tables · 6 verticals · pricing $149/$499/$1499 · 14-day trial · 22% affiliate. /pricing · /demo.

Build steps with code

  1. Pick a store: EventStoreDB, Axon, or Postgres with append-only constraints.
  2. Define event types — naming and versioning matter forever.
  3. Append on every state change — DB row, tool call, model decision, redaction.
  4. Build projections for read views (customer, call, conversation).
  5. Snapshot every N events so replay isn't O(n) from time zero.
  6. Migrate event schemas with versioning, never with mutation.
  7. Replay engine to test new prompts against historical events.
CREATE TABLE events (
  id bigserial PRIMARY KEY,
  aggregate_id uuid NOT NULL,
  aggregate_type text NOT NULL,
  event_type text NOT NULL,
  event_version int NOT NULL,
  payload jsonb NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  UNIQUE (aggregate_id, id)
) PARTITION BY RANGE (created_at);

CREATE INDEX events_agg ON events (aggregate_id, id);
def fold_call(events: list[dict]) -> dict:
    state = {"transcript": [], "tool_calls": [], "redactions": []}
    for e in events:
        t = e["event_type"]
        if t == "user.utterance.v1":
            state["transcript"].append({"role": "user", "text": e["payload"]["text"]})
        elif t == "agent.utterance.v1":
            state["transcript"].append({"role": "agent", "text": e["payload"]["text"]})
        elif t == "tool.call.v1":
            state["tool_calls"].append(e["payload"])
        elif t == "redaction.applied.v1":
            state["redactions"].append(e["payload"])
    return state

# Replay against new prompt
def replay_with_prompt(call_id: str, new_prompt: str) -> dict:
    events = load_events(call_id)
    user_turns = [e for e in events if e["event_type"] == "user.utterance.v1"]
    new_responses = [run_model(new_prompt, u["payload"]["text"]) for u in user_turns]
    return diff_against_recorded(events, new_responses)

Common pitfalls

  • Mutating event payloads — never; version events instead.
  • No snapshots — replay grinds when aggregates have 10k+ events.
  • Coupling read model schema to event names — keep them decoupled via projections.
  • Treating event store as write-only — you'll need to query; build projections.
  • Replay storms on prompt changes — schedule them off-peak.

FAQ

Event sourcing vs append-only log? Event sourcing is the architectural pattern; the log is the storage.

EventStoreDB vs Postgres? EventStoreDB is purpose-built; Postgres is fine up to ~1k events/sec per partition.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Do we need CQRS? Not strictly — but the read/write split falls out naturally.

How does CallSphere use replay in eval? We replay the last 7 days through every prompt change before shipping. Book a demo to see it.

Cost? ~3x storage vs CRUD; bounded with snapshot+compaction.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

A/B Testing Chat Agent Prompts in Production: 2026 Playbook

Prompt A/B testing is not about proving a winner — it is about learning how changes behave under real workloads. Here is the 2026 playbook with Langfuse, Braintrust, and PostHog.

AI Infrastructure

SOC 2 Type II Evidence Expectations for Healthcare AI Vendors in 2026

SOC 2 Type II audits in 2026 expect zero-trust posture, AI-specific evidence around model lineage, drift, and inference logging, and continuous monitoring. Here is what auditors actually ask AI voice and chat vendors.

Learn Agentic AI

Agent A/B Testing: Comparing Model Versions, Prompts, and Architectures in Production

How to A/B test AI agents in production: traffic splitting, evaluation metrics, statistical significance, prompt version comparison, and architecture experiments.

Learn Agentic AI

Logging Best Practices for AI Agents: Structured Logs for Debugging and Audit

Implement structured logging for AI agent systems with correlation IDs, log levels, sensitive data redaction, and queryable JSON output that makes debugging production agent issues fast and audit-ready.

Learn Agentic AI

The Strategy Pattern: Swappable Agent Behaviors Based on Runtime Context

Implement the Strategy pattern to dynamically swap AI agent behaviors at runtime — supporting A/B testing, context-driven model selection, and flexible agent configuration.

Learn Agentic AI

Evaluating Fine-Tuned Models: Benchmarks, Human Eval, and A/B Testing

Learn a comprehensive evaluation methodology for fine-tuned LLMs, combining automated benchmarks, human evaluation, and production A/B testing to measure real-world improvement with statistical rigor.