Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

Database Backup and Recovery for AI Agent State: Postgres + pgvector

Your agent's memory, embeddings, and conversation state all live in Postgres. Backups must include vector data and survive a full-region loss. Here's how CallSphere does PITR for 115+ tables.

TL;DR — pg_basebackup + WAL archiving covers vector data correctly. The hard part isn't backups — it's testing restores. Restore weekly, restore tested = restore that works.

What goes wrong

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]
CallSphere reference architecture

pgvector won the AI memory war by being boring: it's just Postgres with a vector type. That means standard PostgreSQL backup techniques (pg_dump, base backups, WAL archiving) work for vector columns too. Most teams know this and still get burned because:

  1. They never test the restore.
  2. They forget that the embedding model version is implicit — restoring an old DB without re-embedding leaves stale vectors.
  3. They don't include the pgvector extension version in DR docs.
  4. PITR window doesn't cover their RPO.

In 2026 most teams need 35-day PITR (Aurora's default) or longer for compliance, and 1-hour RTO for AI agent state.

How to monitor

Run three layers:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  1. Continuous WAL archiving to S3 (or equivalent) for PITR.
  2. Weekly base backup with pg_basebackup, kept 13 weeks.
  3. Logical pg_dump monthly, kept 1 year, for portable restores and schema migration tests.

Also: monitor your backups themselves — replication lag, archive failures, restore test success.

CallSphere stack

CallSphere runs Postgres 16 with pgvector 0.8.0 on a k3s StatefulSet. 115+ tables including agents, tools, calls, messages, embeddings (pgvector with HNSW indexes), and tenant-isolated vertical tables.

DR plan:

  • WAL archiving to S3 every 60s via wal-e/wal-g. PITR window: 35 days.
  • Base backup weekly Sunday 02:00 UTC; 13 retained.
  • Logical pg_dump monthly; 12 retained.
  • Streaming replica in a second region; lag alert at 30 seconds.
  • Restore drill every Friday 11:00 UTC: spin up a fresh cluster from the latest base+WAL, point a synthetic test at it, time-to-running and time-to-correct-vector-search measured.

Per vertical:

  • Healthcare FastAPI :8084 — agent state in hc_agent_state table; embeddings of patient intake notes in hc_embeddings (3072-d, OpenAI text-embedding-3-large).
  • Real Estate — listing embeddings (1536-d) in re_listings_embed; the 6-container NATS pod's planning state in re_plans.
  • Sales — conversation embeddings in sales_convo_embed; PM2 worker session metadata.
  • After-hours Bull/Redis queue — Bull's job storage is Redis (snapshotted hourly to S3); state on completion writes to Postgres.

Last quarterly drill: full restore of 480GB DB in 38 minutes. Embedding queries returned correctly within 90 seconds of cluster ready. $1499 enterprise tier on /pricing gets a documented DR plan + restore time SLA. Try the 14-day trial.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Implementation

  1. Enable WAL archiving.
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'wal-g wal-push %p'
  1. Weekly base backup.
# k8s CronJob
wal-g backup-push /var/lib/postgresql/data
  1. Logical dump preserves extensions.
pg_dump -Fc \
  --extension=vector \
  --extension=pg_trgm \
  -d callsphere > callsphere.dump
  1. Restore drill script.
wal-g backup-fetch /restore LATEST
echo "restore_command = 'wal-g wal-fetch %f %p'" >> /restore/postgresql.auto.conf
pg_ctl -D /restore start
psql -c "SELECT count(*) FROM hc_embeddings;"
psql -c "SELECT id FROM hc_embeddings ORDER BY embedding <-> ARRAY[...]::vector LIMIT 1;"
  1. Document the embedding-model version in DR runbook. Restoring stale vectors with a new model is silently wrong.

FAQ

Q: Does pg_dump support vectors? A: Yes (pgvector ≥ 0.5). It uses the binary format. Make sure the destination has the pgvector extension at the same version.

Q: How long does a 1TB pgvector restore take? A: With wal-g and parallel apply, ~40 min for the base + WAL replay catchup. Index rebuild for HNSW is the long pole — plan for 3–6 hours.

Q: What's RPO for this setup? A: 60 seconds. WAL ships every minute; you lose at most a minute of writes.

Q: Should I use logical replication for DR? A: Pair it with physical streaming. Logical for cross-version migrations; physical for fast cluster failover.

Q: HNSW indexes increase restore time — worth it? A: For < 5M vectors, no — IVFFlat rebuilds faster. For > 10M, HNSW is a must despite the longer rebuild.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.