---
title: "Outbox Pattern for AI Tool-Call Durability: One Transaction, Two Systems, Zero Lost Calls"
description: "When an AI agent calls a tool, the database update and the message publish must be atomic. The outbox pattern is the only correct answer — and it's essential when those tools are CRM updates and bookings."
canonical: https://callsphere.ai/blog/vw4c-outbox-pattern-ai-tool-call-durability
category: "AI Engineering"
tags: ["Outbox Pattern", "Transactional", "Dual Write", "AI Tools", "Reliability"]
author: "CallSphere Team"
published: 2026-04-02T00:00:00.000Z
updated: 2026-05-07T16:13:27.145Z
---

# Outbox Pattern for AI Tool-Call Durability: One Transaction, Two Systems, Zero Lost Calls

> When an AI agent calls a tool, the database update and the message publish must be atomic. The outbox pattern is the only correct answer — and it's essential when those tools are CRM updates and bookings.

> **TL;DR** — When an AI tool call writes to the DB and publishes an event, you have a dual-write problem. The DB succeeds, the publish fails — or vice versa — and your system goes inconsistent. The transactional outbox pattern is the textbook fix: write the event into an OUTBOX table inside the same transaction, then a separate process drains the outbox to the message bus.

## The pattern

In CallSphere, when the booking agent confirms an appointment, two things must happen: the booking row is upserted in Postgres, and a `booking.confirmed` event is published to NATS so the SMS sender, the calendar sync, and the analytics pipeline pick it up. Naive code does both in the application — and the dual-write fails about 1 in 10,000 transactions, which at our volume is several lost events per day.

The fix: **write the event to an outbox table in the same DB transaction**. A separate worker (or Debezium) tails the outbox and publishes to NATS/Kafka. The DB and the bus stay in sync — at-least-once, never zero.

## How it works (architecture)

```mermaid
flowchart LR
  Tool[AI tool call] -->|BEGIN| DB[(Postgres)]
  Tool -->|INSERT booking| DB
  Tool -->|INSERT outbox event| DB
  Tool -->|COMMIT| DB
  DB -.WAL.- Drain[Outbox drain
or Debezium]
  Drain -->|publish| Bus[(NATS/Kafka)]
  Bus --> SMS[SMS sender]
  Bus --> Cal[Calendar sync]
  Bus --> An[Analytics]
```

Two implementations: a polling drain that selects unsent rows and marks them sent, or CDC (Debezium tailing the WAL — see post #8) which is faster and lower-overhead.

## CallSphere implementation

CallSphere uses the outbox pattern for every AI tool call that has a downstream side effect: bookings, callbacks, CRM upserts, escalations. [Real Estate OneRoof](/industries/real-estate)'s booking agent writes booking + outbox in one transaction; a Debezium connector drains the outbox to NATS within ~50 ms p95. After-hours uses a Bull/Redis queue for delayed retries — different mechanism, same principle (durable intent before publish). 37 agents · 90+ tools · 115+ DB tables · 6 verticals · pricing $149/$499/$1499 · [14-day trial](/trial) · [22% affiliate](/affiliate). Browse [/pricing](/pricing) or take a [demo](/demo).

## Build steps with code

1. **Create outbox table**: `(id uuid pk, aggregate_id, event_type, payload jsonb, created_at, sent_at)`.
2. **Tool wrapper** writes domain row + outbox row in the same transaction.
3. **Either**: poll `SELECT ... WHERE sent_at IS NULL` every 100 ms, OR use Debezium.
4. **Mark sent_at after publish ack** (idempotent; OK to publish twice).
5. **Add idempotency key** to the published event for downstream dedup (post #14).
6. **Cleanup job** deletes `sent_at < now() - 7 days`.
7. **Monitor outbox lag**: count where `sent_at IS NULL AND created_at < now() - 5s`.

```sql
CREATE TABLE outbox (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  aggregate_type text NOT NULL,
  aggregate_id text NOT NULL,
  event_type text NOT NULL,
  payload jsonb NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  sent_at timestamptz
);
CREATE INDEX outbox_unsent ON outbox (created_at) WHERE sent_at IS NULL;
```

```python
async def confirm_booking(call_id: str, slot: datetime):
    async with db.transaction():
        await db.execute(
            "INSERT INTO bookings (call_id, slot) VALUES ($1, $2)",
            call_id, slot,
        )
        await db.execute(
            """INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
               VALUES ('booking', $1, 'booking.confirmed.v1', $2)""",
            call_id, json.dumps({"callId": call_id, "slot": slot.isoformat()}),
        )
```

## Common pitfalls

- **Publish inside the transaction** — long-held DB locks; defeats the purpose.
- **Polling without a partial index** — sequential scan crushes the DB.
- **Marking sent_at before ack** — silent loss.
- **No dedup key in payload** — downstream consumers double-process.
- **Outbox table never garbage-collected** — table bloats; nightly cleanup is non-negotiable.

## FAQ

**Polling vs CDC?** CDC (Debezium) is lower-latency and lower-DB-load; polling is simpler. Start with polling, graduate to CDC at scale.

**Why not just publish first?** Then the publish succeeds and the DB write fails — phantom event, no row.

**Does this give exactly-once?** No — at-least-once. Pair with idempotency keys (post #14).

**Where does CallSphere put the drainer?** A small Go service per pod, or a shared Debezium cluster — both in production today. See [/pricing](/pricing) and [/demo](/demo).

**What's the latency?** Polling: 100-500 ms. Debezium: 50-100 ms p95.

## Sources

- [microservices.io: Transactional Outbox](https://microservices.io/patterns/data/transactional-outbox.html)
- [AWS: Transactional Outbox Pattern](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/transactional-outbox.html)
- [Confluent: The Transactional Outbox Pattern](https://developer.confluent.io/courses/microservices/the-transactional-outbox-pattern/)
- [Decodable: Revisiting the Outbox Pattern](https://www.decodable.co/blog/revisiting-the-outbox-pattern)

---

Source: https://callsphere.ai/blog/vw4c-outbox-pattern-ai-tool-call-durability
