Outbox Pattern for AI Tool-Call Durability: One Transaction, Two Systems, Zero Lost Calls
When an AI agent calls a tool, the database update and the message publish must be atomic. The outbox pattern is the only correct answer — and it's essential when those tools are CRM updates and bookings.
TL;DR — When an AI tool call writes to the DB and publishes an event, you have a dual-write problem. The DB succeeds, the publish fails — or vice versa — and your system goes inconsistent. The transactional outbox pattern is the textbook fix: write the event into an OUTBOX table inside the same transaction, then a separate process drains the outbox to the message bus.
The pattern
In CallSphere, when the booking agent confirms an appointment, two things must happen: the booking row is upserted in Postgres, and a booking.confirmed event is published to NATS so the SMS sender, the calendar sync, and the analytics pipeline pick it up. Naive code does both in the application — and the dual-write fails about 1 in 10,000 transactions, which at our volume is several lost events per day.
The fix: write the event to an outbox table in the same DB transaction. A separate worker (or Debezium) tails the outbox and publishes to NATS/Kafka. The DB and the bus stay in sync — at-least-once, never zero.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
How it works (architecture)
flowchart LR
Tool[AI tool call] -->|BEGIN| DB[(Postgres)]
Tool -->|INSERT booking| DB
Tool -->|INSERT outbox event| DB
Tool -->|COMMIT| DB
DB -.WAL.- Drain[Outbox drain<br/>or Debezium]
Drain -->|publish| Bus[(NATS/Kafka)]
Bus --> SMS[SMS sender]
Bus --> Cal[Calendar sync]
Bus --> An[Analytics]
Two implementations: a polling drain that selects unsent rows and marks them sent, or CDC (Debezium tailing the WAL — see post #8) which is faster and lower-overhead.
CallSphere implementation
CallSphere uses the outbox pattern for every AI tool call that has a downstream side effect: bookings, callbacks, CRM upserts, escalations. Real Estate OneRoof's booking agent writes booking + outbox in one transaction; a Debezium connector drains the outbox to NATS within ~50 ms p95. After-hours uses a Bull/Redis queue for delayed retries — different mechanism, same principle (durable intent before publish). 37 agents · 90+ tools · 115+ DB tables · 6 verticals · pricing $149/$499/$1499 · 14-day trial · 22% affiliate. Browse /pricing or take a demo.
Build steps with code
- Create outbox table:
(id uuid pk, aggregate_id, event_type, payload jsonb, created_at, sent_at). - Tool wrapper writes domain row + outbox row in the same transaction.
- Either: poll
SELECT ... WHERE sent_at IS NULLevery 100 ms, OR use Debezium. - Mark sent_at after publish ack (idempotent; OK to publish twice).
- Add idempotency key to the published event for downstream dedup (post #14).
- Cleanup job deletes
sent_at < now() - 7 days. - Monitor outbox lag: count where
sent_at IS NULL AND created_at < now() - 5s.
CREATE TABLE outbox (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type text NOT NULL,
aggregate_id text NOT NULL,
event_type text NOT NULL,
payload jsonb NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
sent_at timestamptz
);
CREATE INDEX outbox_unsent ON outbox (created_at) WHERE sent_at IS NULL;
async def confirm_booking(call_id: str, slot: datetime):
async with db.transaction():
await db.execute(
"INSERT INTO bookings (call_id, slot) VALUES ($1, $2)",
call_id, slot,
)
await db.execute(
"""INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
VALUES ('booking', $1, 'booking.confirmed.v1', $2)""",
call_id, json.dumps({"callId": call_id, "slot": slot.isoformat()}),
)
Common pitfalls
- Publish inside the transaction — long-held DB locks; defeats the purpose.
- Polling without a partial index — sequential scan crushes the DB.
- Marking sent_at before ack — silent loss.
- No dedup key in payload — downstream consumers double-process.
- Outbox table never garbage-collected — table bloats; nightly cleanup is non-negotiable.
FAQ
Polling vs CDC? CDC (Debezium) is lower-latency and lower-DB-load; polling is simpler. Start with polling, graduate to CDC at scale.
Why not just publish first? Then the publish succeeds and the DB write fails — phantom event, no row.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Does this give exactly-once? No — at-least-once. Pair with idempotency keys (post #14).
Where does CallSphere put the drainer? A small Go service per pod, or a shared Debezium cluster — both in production today. See /pricing and /demo.
What's the latency? Polling: 100-500 ms. Debezium: 50-100 ms p95.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.