TL;DR — When an AI tool call writes to the DB and publishes an event, you have a dual-write problem. The DB succeeds, the publish fails — or vice versa — and your system goes inconsistent. The transactional outbox pattern is the textbook fix: write the event into an OUTBOX table inside the same transaction, then a separate process drains the outbox to the message bus.

The pattern

In CallSphere, when the booking agent confirms an appointment, two things must happen: the booking row is upserted in Postgres, and a booking.confirmed event is published to NATS so the SMS sender, the calendar sync, and the analytics pipeline pick it up. Naive code does both in the application — and the dual-write fails about 1 in 10,000 transactions, which at our volume is several lost events per day.

The fix: write the event to an outbox table in the same DB transaction. A separate worker (or Debezium) tails the outbox and publishes to NATS/Kafka. The DB and the bus stay in sync — at-least-once, never zero.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

How it works (architecture)

flowchart LR
  Tool[AI tool call] -->|BEGIN| DB[(Postgres)]
  Tool -->|INSERT booking| DB
  Tool -->|INSERT outbox event| DB
  Tool -->|COMMIT| DB
  DB -.WAL.- Drain[Outbox drain<br/>or Debezium]
  Drain -->|publish| Bus[(NATS/Kafka)]
  Bus --> SMS[SMS sender]
  Bus --> Cal[Calendar sync]
  Bus --> An[Analytics]

Two implementations: a polling drain that selects unsent rows and marks them sent, or CDC (Debezium tailing the WAL — see post #8) which is faster and lower-overhead.

CallSphere implementation

CallSphere uses the outbox pattern for every AI tool call that has a downstream side effect: bookings, callbacks, CRM upserts, escalations. Real Estate OneRoof's booking agent writes booking + outbox in one transaction; a Debezium connector drains the outbox to NATS within ~50 ms p95. After-hours uses a Bull/Redis queue for delayed retries — different mechanism, same principle (durable intent before publish). 37 agents · 90+ tools · 115+ DB tables · 6 verticals · pricing $149/$499/$1499 · 14-day trial · 22% affiliate. Browse /pricing or take a demo.

Build steps with code

Create outbox table: (id uuid pk, aggregate_id, event_type, payload jsonb, created_at, sent_at).
Tool wrapper writes domain row + outbox row in the same transaction.
Either: poll SELECT ... WHERE sent_at IS NULL every 100 ms, OR use Debezium.
Mark sent_at after publish ack (idempotent; OK to publish twice).
Add idempotency key to the published event for downstream dedup (post #14).
Cleanup job deletes sent_at < now() - 7 days.
Monitor outbox lag: count where sent_at IS NULL AND created_at < now() - 5s.

CREATE TABLE outbox (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  aggregate_type text NOT NULL,
  aggregate_id text NOT NULL,
  event_type text NOT NULL,
  payload jsonb NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  sent_at timestamptz
);
CREATE INDEX outbox_unsent ON outbox (created_at) WHERE sent_at IS NULL;

async def confirm_booking(call_id: str, slot: datetime):
    async with db.transaction():
        await db.execute(
            "INSERT INTO bookings (call_id, slot) VALUES ($1, $2)",
            call_id, slot,
        )
        await db.execute(
            """INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
               VALUES ('booking', $1, 'booking.confirmed.v1', $2)""",
            call_id, json.dumps({"callId": call_id, "slot": slot.isoformat()}),
        )

Common pitfalls

Publish inside the transaction — long-held DB locks; defeats the purpose.
Polling without a partial index — sequential scan crushes the DB.
Marking sent_at before ack — silent loss.
No dedup key in payload — downstream consumers double-process.
Outbox table never garbage-collected — table bloats; nightly cleanup is non-negotiable.

FAQ

Polling vs CDC? CDC (Debezium) is lower-latency and lower-DB-load; polling is simpler. Start with polling, graduate to CDC at scale.

Why not just publish first? Then the publish succeeds and the DB write fails — phantom event, no row.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Does this give exactly-once? No — at-least-once. Pair with idempotency keys (post #14).

Where does CallSphere put the drainer? A small Go service per pod, or a shared Debezium cluster — both in production today. See /pricing and /demo.

What's the latency? Polling: 100-500 ms. Debezium: 50-100 ms p95.

Outbox Pattern for AI Tool-Call Durability: One Transaction, Two Systems, Zero Lost Calls

The pattern

How it works (architecture)

CallSphere implementation

Build steps with code

Common pitfalls

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Inngest Agent Kit: Durable Execution for Long-Running Agent Tasks

Agent Incident Retros: How to Run a Postmortem When an LLM Made the Mistake

Regional Failover for AI Voice: Multi-Cloud, Multi-Region, Multi-Provider

Chaos Engineering for AI Voice: Gremlin Patterns and Agent-Specific Failure Injection

Webhook Patterns for AI Voice Agents: Idempotency, Retries, and Security

AI Voice Agent Failover and Reliability Patterns for Production