Skip to content
AI Engineering
AI Engineering11 min read0 views

Outbox Pattern for AI Tool-Call Durability: One Transaction, Two Systems, Zero Lost Calls

When an AI agent calls a tool, the database update and the message publish must be atomic. The outbox pattern is the only correct answer — and it's essential when those tools are CRM updates and bookings.

TL;DR — When an AI tool call writes to the DB and publishes an event, you have a dual-write problem. The DB succeeds, the publish fails — or vice versa — and your system goes inconsistent. The transactional outbox pattern is the textbook fix: write the event into an OUTBOX table inside the same transaction, then a separate process drains the outbox to the message bus.

The pattern

In CallSphere, when the booking agent confirms an appointment, two things must happen: the booking row is upserted in Postgres, and a booking.confirmed event is published to NATS so the SMS sender, the calendar sync, and the analytics pipeline pick it up. Naive code does both in the application — and the dual-write fails about 1 in 10,000 transactions, which at our volume is several lost events per day.

The fix: write the event to an outbox table in the same DB transaction. A separate worker (or Debezium) tails the outbox and publishes to NATS/Kafka. The DB and the bus stay in sync — at-least-once, never zero.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

How it works (architecture)

flowchart LR
  Tool[AI tool call] -->|BEGIN| DB[(Postgres)]
  Tool -->|INSERT booking| DB
  Tool -->|INSERT outbox event| DB
  Tool -->|COMMIT| DB
  DB -.WAL.- Drain[Outbox drain<br/>or Debezium]
  Drain -->|publish| Bus[(NATS/Kafka)]
  Bus --> SMS[SMS sender]
  Bus --> Cal[Calendar sync]
  Bus --> An[Analytics]

Two implementations: a polling drain that selects unsent rows and marks them sent, or CDC (Debezium tailing the WAL — see post #8) which is faster and lower-overhead.

CallSphere implementation

CallSphere uses the outbox pattern for every AI tool call that has a downstream side effect: bookings, callbacks, CRM upserts, escalations. Real Estate OneRoof's booking agent writes booking + outbox in one transaction; a Debezium connector drains the outbox to NATS within ~50 ms p95. After-hours uses a Bull/Redis queue for delayed retries — different mechanism, same principle (durable intent before publish). 37 agents · 90+ tools · 115+ DB tables · 6 verticals · pricing $149/$499/$1499 · 14-day trial · 22% affiliate. Browse /pricing or take a demo.

Build steps with code

  1. Create outbox table: (id uuid pk, aggregate_id, event_type, payload jsonb, created_at, sent_at).
  2. Tool wrapper writes domain row + outbox row in the same transaction.
  3. Either: poll SELECT ... WHERE sent_at IS NULL every 100 ms, OR use Debezium.
  4. Mark sent_at after publish ack (idempotent; OK to publish twice).
  5. Add idempotency key to the published event for downstream dedup (post #14).
  6. Cleanup job deletes sent_at < now() - 7 days.
  7. Monitor outbox lag: count where sent_at IS NULL AND created_at < now() - 5s.
CREATE TABLE outbox (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  aggregate_type text NOT NULL,
  aggregate_id text NOT NULL,
  event_type text NOT NULL,
  payload jsonb NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  sent_at timestamptz
);
CREATE INDEX outbox_unsent ON outbox (created_at) WHERE sent_at IS NULL;
async def confirm_booking(call_id: str, slot: datetime):
    async with db.transaction():
        await db.execute(
            "INSERT INTO bookings (call_id, slot) VALUES ($1, $2)",
            call_id, slot,
        )
        await db.execute(
            """INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
               VALUES ('booking', $1, 'booking.confirmed.v1', $2)""",
            call_id, json.dumps({"callId": call_id, "slot": slot.isoformat()}),
        )

Common pitfalls

  • Publish inside the transaction — long-held DB locks; defeats the purpose.
  • Polling without a partial index — sequential scan crushes the DB.
  • Marking sent_at before ack — silent loss.
  • No dedup key in payload — downstream consumers double-process.
  • Outbox table never garbage-collected — table bloats; nightly cleanup is non-negotiable.

FAQ

Polling vs CDC? CDC (Debezium) is lower-latency and lower-DB-load; polling is simpler. Start with polling, graduate to CDC at scale.

Why not just publish first? Then the publish succeeds and the DB write fails — phantom event, no row.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Does this give exactly-once? No — at-least-once. Pair with idempotency keys (post #14).

Where does CallSphere put the drainer? A small Go service per pod, or a shared Debezium cluster — both in production today. See /pricing and /demo.

What's the latency? Polling: 100-500 ms. Debezium: 50-100 ms p95.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.