By Sagar Shankaran, Founder of CallSphere
Storing the agent's state mutations as immutable events lets you replay any conversation, A/B-test a new prompt against historical traffic, and prove to a regulator exactly what the agent saw and said.
Key takeaways
TL;DR — Store every agent input and output as an immutable event in an append-only log. The current state is a fold over the events. Benefits: replay a conversation against a new prompt, full audit trail for HIPAA/SOC2, time-travel debugging, and A/B testing a new model on real historical traffic without touching production.
A traditional CRUD agent stores "the current customer record". An event-sourced agent stores "every event that has ever happened". The current record is derived by replaying the events. This is more storage and more code, but you gain superpowers: replay, audit, time-travel, and A/B against history.
flowchart LR
Caller --> Agent[AI agent]
Agent -->|append| ES[(Event store<br/>EventStoreDB / Postgres)]
ES -->|projection| Read1[(Customer view)]
ES -->|projection| Read2[(Conversation view)]
ES -.replay.- Replay[Replay engine]
Replay -->|fold events| New[New prompt + model]
Replay -->|compare| Diff[A/B diff]
Each event is (aggregateId, eventType, payload, version, timestamp). Projections (CQRS read models) are derived. New prompts can be A/B tested by replaying past conversations through the new prompt and diffing the responses.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
CallSphere event-sources every agent turn for Real Estate OneRoof, Healthcare, and Sales because regulator/HIPAA replay is contractual. After-hours and Salon use a simpler journal model. The append-only log lives in Postgres with a partition per month and a Kafka projection for downstream views. We A/B test new prompts by replaying yesterday's traffic through the candidate prompt and diffing tool calls. 37 agents · 90+ tools · 115+ DB tables · 6 verticals · pricing $149/$499/$1499 · 14-day trial · 22% affiliate. /pricing · /demo.
CREATE TABLE events (
id bigserial PRIMARY KEY,
aggregate_id uuid NOT NULL,
aggregate_type text NOT NULL,
event_type text NOT NULL,
event_version int NOT NULL,
payload jsonb NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (aggregate_id, id)
) PARTITION BY RANGE (created_at);
CREATE INDEX events_agg ON events (aggregate_id, id);
def fold_call(events: list[dict]) -> dict:
state = {"transcript": [], "tool_calls": [], "redactions": []}
for e in events:
t = e["event_type"]
if t == "user.utterance.v1":
state["transcript"].append({"role": "user", "text": e["payload"]["text"]})
elif t == "agent.utterance.v1":
state["transcript"].append({"role": "agent", "text": e["payload"]["text"]})
elif t == "tool.call.v1":
state["tool_calls"].append(e["payload"])
elif t == "redaction.applied.v1":
state["redactions"].append(e["payload"])
return state
# Replay against new prompt
def replay_with_prompt(call_id: str, new_prompt: str) -> dict:
events = load_events(call_id)
user_turns = [e for e in events if e["event_type"] == "user.utterance.v1"]
new_responses = [run_model(new_prompt, u["payload"]["text"]) for u in user_turns]
return diff_against_recorded(events, new_responses)
Event sourcing vs append-only log? Event sourcing is the architectural pattern; the log is the storage.
EventStoreDB vs Postgres? EventStoreDB is purpose-built; Postgres is fine up to ~1k events/sec per partition.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Do we need CQRS? Not strictly — but the read/write split falls out naturally.
How does CallSphere use replay in eval? We replay the last 7 days through every prompt change before shipping. Book a demo to see it.
Cost? ~3x storage vs CRUD; bounded with snapshot+compaction.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A/B testing LLM features needs different metrics than traditional A/B. The 2026 patterns for sound LLM experimentation in production.
Comp AI, Scytale, Drata, and Vanta all shipped AI agents that automate evidence collection and gap analysis in 2026. Here is how to surface that capability inside the chat your buyers and auditors already use.
Prompt A/B testing is not about proving a winner — it is about learning how changes behave under real workloads. Here is the 2026 playbook with Langfuse, Braintrust, and PostHog.
Editing the prod system prompt at 2am is how you regress every agent silently. We share the 2026 prompt-lifecycle stack — extract from code, immutable IDs, champion/challenger A/B, eval gates — used to ship CallSphere's 37-agent fleet without rollbacks.
SOC 2 Type II audits in 2026 expect zero-trust posture, AI-specific evidence around model lineage, drift, and inference logging, and continuous monitoring. Here is what auditors actually ask AI voice and chat vendors.
How to A/B test AI agents in production: traffic splitting, evaluation metrics, statistical significance, prompt version comparison, and architecture experiments.
© 2026 CallSphere LLC. All rights reserved.