---
title: "Database Backup and Recovery for AI Agent State: Postgres + pgvector"
description: "Your agent's memory, embeddings, and conversation state all live in Postgres. Backups must include vector data and survive a full-region loss. Here's how CallSphere does PITR for 115+ tables."
canonical: https://callsphere.ai/blog/vw3c-postgres-pgvector-backup-recovery-ai-agents
category: "AI Infrastructure"
tags: ["Postgres", "pgvector", "Backup", "Disaster Recovery"]
author: "CallSphere Team"
published: 2026-04-29T00:00:00.000Z
updated: 2026-05-07T09:59:38.183Z
---

# Database Backup and Recovery for AI Agent State: Postgres + pgvector

> Your agent's memory, embeddings, and conversation state all live in Postgres. Backups must include vector data and survive a full-region loss. Here's how CallSphere does PITR for 115+ tables.

> **TL;DR** — pg_basebackup + WAL archiving covers vector data correctly. The hard part isn't backups — it's testing restores. Restore weekly, restore tested = restore that works.

## What goes wrong

```mermaid
flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]
```

CallSphere reference architecture

pgvector won the AI memory war by being boring: it's just Postgres with a vector type. That means standard PostgreSQL backup techniques (pg_dump, base backups, WAL archiving) work for vector columns too. Most teams know this and still get burned because:

1. They never test the restore.
2. They forget that the embedding model version is implicit — restoring an old DB without re-embedding leaves stale vectors.
3. They don't include the pgvector extension version in DR docs.
4. PITR window doesn't cover their RPO.

In 2026 most teams need 35-day PITR (Aurora's default) or longer for compliance, and 1-hour RTO for AI agent state.

## How to monitor

Run three layers:

1. **Continuous WAL archiving** to S3 (or equivalent) for PITR.
2. **Weekly base backup** with pg_basebackup, kept 13 weeks.
3. **Logical pg_dump** monthly, kept 1 year, for portable restores and schema migration tests.

Also: **monitor your backups themselves** — replication lag, archive failures, restore test success.

## CallSphere stack

CallSphere runs Postgres 16 with pgvector 0.8.0 on a k3s StatefulSet. 115+ tables including `agents`, `tools`, `calls`, `messages`, `embeddings` (pgvector with HNSW indexes), and tenant-isolated vertical tables.

DR plan:

- **WAL archiving** to S3 every 60s via wal-e/wal-g. PITR window: 35 days.
- **Base backup** weekly Sunday 02:00 UTC; 13 retained.
- **Logical pg_dump** monthly; 12 retained.
- **Streaming replica** in a second region; lag alert at 30 seconds.
- **Restore drill** every Friday 11:00 UTC: spin up a fresh cluster from the latest base+WAL, point a synthetic test at it, time-to-running and time-to-correct-vector-search measured.

Per vertical:

- **Healthcare FastAPI `:8084`** — agent state in `hc_agent_state` table; embeddings of patient intake notes in `hc_embeddings` (3072-d, OpenAI text-embedding-3-large).
- **Real Estate** — listing embeddings (1536-d) in `re_listings_embed`; the 6-container NATS pod's planning state in `re_plans`.
- **Sales** — conversation embeddings in `sales_convo_embed`; PM2 worker session metadata.
- **After-hours Bull/Redis queue** — Bull's job storage is Redis (snapshotted hourly to S3); state on completion writes to Postgres.

Last quarterly drill: full restore of 480GB DB in 38 minutes. Embedding queries returned correctly within 90 seconds of cluster ready. $1499 enterprise tier on [/pricing](/pricing) gets a documented DR plan + restore time SLA. Try the [14-day trial](/trial).

## Implementation

1. **Enable WAL archiving.**

```ini
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'wal-g wal-push %p'
```

1. **Weekly base backup.**

```bash
# k8s CronJob
wal-g backup-push /var/lib/postgresql/data
```

1. **Logical dump preserves extensions.**

```bash
pg_dump -Fc \
  --extension=vector \
  --extension=pg_trgm \
  -d callsphere > callsphere.dump
```

1. **Restore drill script.**

```bash
wal-g backup-fetch /restore LATEST
echo "restore_command = 'wal-g wal-fetch %f %p'" >> /restore/postgresql.auto.conf
pg_ctl -D /restore start
psql -c "SELECT count(*) FROM hc_embeddings;"
psql -c "SELECT id FROM hc_embeddings ORDER BY embedding  ARRAY[...]::vector LIMIT 1;"
```

1. **Document the embedding-model version** in DR runbook. Restoring stale vectors with a new model is silently wrong.

## FAQ

**Q: Does pg_dump support vectors?**
A: Yes (pgvector ≥ 0.5). It uses the binary format. Make sure the destination has the pgvector extension at the same version.

**Q: How long does a 1TB pgvector restore take?**
A: With wal-g and parallel apply, ~40 min for the base + WAL replay catchup. Index rebuild for HNSW is the long pole — plan for 3–6 hours.

**Q: What's RPO for this setup?**
A: 60 seconds. WAL ships every minute; you lose at most a minute of writes.

**Q: Should I use logical replication for DR?**
A: Pair it with physical streaming. Logical for cross-version migrations; physical for fast cluster failover.

**Q: HNSW indexes increase restore time — worth it?**
A: For  10M, HNSW is a must despite the longer rebuild.

## Sources

- [Calmops — PostgreSQL Vector Search with pgvector 2026](https://calmops.com/database/postgresql-vector-search-pgvector-2026/)
- [AWS — How Letta builds production-ready AI agents with Aurora PostgreSQL](https://aws.amazon.com/blogs/database/how-letta-builds-production-ready-ai-agents-with-amazon-aurora-postgresql/)
- [Render — Simplify Your AI Stack with Managed PostgreSQL and pgvector](https://render.com/articles/simplify-ai-stack-managed-postgresql-pgvector)
- [Instaclustr — pgvector key features tutorial 2026 guide](https://www.instaclustr.com/education/vector-database/pgvector-key-features-tutorial-and-pros-and-cons-2026-guide/)

---

Source: https://callsphere.ai/blog/vw3c-postgres-pgvector-backup-recovery-ai-agents