Database Backup and Recovery for AI Agent State: Postgres + pgvector
Your agent's memory, embeddings, and conversation state all live in Postgres. Backups must include vector data and survive a full-region loss. Here's how CallSphere does PITR for 115+ tables.
TL;DR — pg_basebackup + WAL archiving covers vector data correctly. The hard part isn't backups — it's testing restores. Restore weekly, restore tested = restore that works.
What goes wrong
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]pgvector won the AI memory war by being boring: it's just Postgres with a vector type. That means standard PostgreSQL backup techniques (pg_dump, base backups, WAL archiving) work for vector columns too. Most teams know this and still get burned because:
- They never test the restore.
- They forget that the embedding model version is implicit — restoring an old DB without re-embedding leaves stale vectors.
- They don't include the pgvector extension version in DR docs.
- PITR window doesn't cover their RPO.
In 2026 most teams need 35-day PITR (Aurora's default) or longer for compliance, and 1-hour RTO for AI agent state.
How to monitor
Run three layers:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Continuous WAL archiving to S3 (or equivalent) for PITR.
- Weekly base backup with pg_basebackup, kept 13 weeks.
- Logical pg_dump monthly, kept 1 year, for portable restores and schema migration tests.
Also: monitor your backups themselves — replication lag, archive failures, restore test success.
CallSphere stack
CallSphere runs Postgres 16 with pgvector 0.8.0 on a k3s StatefulSet. 115+ tables including agents, tools, calls, messages, embeddings (pgvector with HNSW indexes), and tenant-isolated vertical tables.
DR plan:
- WAL archiving to S3 every 60s via wal-e/wal-g. PITR window: 35 days.
- Base backup weekly Sunday 02:00 UTC; 13 retained.
- Logical pg_dump monthly; 12 retained.
- Streaming replica in a second region; lag alert at 30 seconds.
- Restore drill every Friday 11:00 UTC: spin up a fresh cluster from the latest base+WAL, point a synthetic test at it, time-to-running and time-to-correct-vector-search measured.
Per vertical:
- Healthcare FastAPI
:8084— agent state inhc_agent_statetable; embeddings of patient intake notes inhc_embeddings(3072-d, OpenAI text-embedding-3-large). - Real Estate — listing embeddings (1536-d) in
re_listings_embed; the 6-container NATS pod's planning state inre_plans. - Sales — conversation embeddings in
sales_convo_embed; PM2 worker session metadata. - After-hours Bull/Redis queue — Bull's job storage is Redis (snapshotted hourly to S3); state on completion writes to Postgres.
Last quarterly drill: full restore of 480GB DB in 38 minutes. Embedding queries returned correctly within 90 seconds of cluster ready. $1499 enterprise tier on /pricing gets a documented DR plan + restore time SLA. Try the 14-day trial.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Implementation
- Enable WAL archiving.
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'wal-g wal-push %p'
- Weekly base backup.
# k8s CronJob
wal-g backup-push /var/lib/postgresql/data
- Logical dump preserves extensions.
pg_dump -Fc \
--extension=vector \
--extension=pg_trgm \
-d callsphere > callsphere.dump
- Restore drill script.
wal-g backup-fetch /restore LATEST
echo "restore_command = 'wal-g wal-fetch %f %p'" >> /restore/postgresql.auto.conf
pg_ctl -D /restore start
psql -c "SELECT count(*) FROM hc_embeddings;"
psql -c "SELECT id FROM hc_embeddings ORDER BY embedding <-> ARRAY[...]::vector LIMIT 1;"
- Document the embedding-model version in DR runbook. Restoring stale vectors with a new model is silently wrong.
FAQ
Q: Does pg_dump support vectors? A: Yes (pgvector ≥ 0.5). It uses the binary format. Make sure the destination has the pgvector extension at the same version.
Q: How long does a 1TB pgvector restore take? A: With wal-g and parallel apply, ~40 min for the base + WAL replay catchup. Index rebuild for HNSW is the long pole — plan for 3–6 hours.
Q: What's RPO for this setup? A: 60 seconds. WAL ships every minute; you lose at most a minute of writes.
Q: Should I use logical replication for DR? A: Pair it with physical streaming. Logical for cross-version migrations; physical for fast cluster failover.
Q: HNSW indexes increase restore time — worth it? A: For < 5M vectors, no — IVFFlat rebuilds faster. For > 10M, HNSW is a must despite the longer rebuild.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.