TL;DR — pg_basebackup + WAL archiving covers vector data correctly. The hard part isn't backups — it's testing restores. Restore weekly, restore tested = restore that works.

What goes wrong

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]

CallSphere reference architecture

pgvector won the AI memory war by being boring: it's just Postgres with a vector type. That means standard PostgreSQL backup techniques (pg_dump, base backups, WAL archiving) work for vector columns too. Most teams know this and still get burned because:

They never test the restore.
They forget that the embedding model version is implicit — restoring an old DB without re-embedding leaves stale vectors.
They don't include the pgvector extension version in DR docs.
PITR window doesn't cover their RPO.

In 2026 most teams need 35-day PITR (Aurora's default) or longer for compliance, and 1-hour RTO for AI agent state.

How to monitor

Run three layers:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Continuous WAL archiving to S3 (or equivalent) for PITR.
Weekly base backup with pg_basebackup, kept 13 weeks.
Logical pg_dump monthly, kept 1 year, for portable restores and schema migration tests.

Also: monitor your backups themselves — replication lag, archive failures, restore test success.

CallSphere stack

CallSphere runs Postgres 16 with pgvector 0.8.0 on a k3s StatefulSet. 115+ tables including agents, tools, calls, messages, embeddings (pgvector with HNSW indexes), and tenant-isolated vertical tables.

DR plan:

WAL archiving to S3 every 60s via wal-e/wal-g. PITR window: 35 days.
Base backup weekly Sunday 02:00 UTC; 13 retained.
Logical pg_dump monthly; 12 retained.
Streaming replica in a second region; lag alert at 30 seconds.
Restore drill every Friday 11:00 UTC: spin up a fresh cluster from the latest base+WAL, point a synthetic test at it, time-to-running and time-to-correct-vector-search measured.

Per vertical:

Healthcare FastAPI :8084 — agent state in hc_agent_state table; embeddings of patient intake notes in hc_embeddings (3072-d, OpenAI text-embedding-3-large).
Real Estate — listing embeddings (1536-d) in re_listings_embed; the 6-container NATS pod's planning state in re_plans.
Sales — conversation embeddings in sales_convo_embed; PM2 worker session metadata.
After-hours Bull/Redis queue — Bull's job storage is Redis (snapshotted hourly to S3); state on completion writes to Postgres.

Last quarterly drill: full restore of 480GB DB in 38 minutes. Embedding queries returned correctly within 90 seconds of cluster ready. $1499 enterprise tier on /pricing gets a documented DR plan + restore time SLA. Try the 14-day trial.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Implementation

Enable WAL archiving.

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'wal-g wal-push %p'

Weekly base backup.

# k8s CronJob
wal-g backup-push /var/lib/postgresql/data

Logical dump preserves extensions.

pg_dump -Fc \
  --extension=vector \
  --extension=pg_trgm \
  -d callsphere > callsphere.dump

Restore drill script.

wal-g backup-fetch /restore LATEST
echo "restore_command = 'wal-g wal-fetch %f %p'" >> /restore/postgresql.auto.conf
pg_ctl -D /restore start
psql -c "SELECT count(*) FROM hc_embeddings;"
psql -c "SELECT id FROM hc_embeddings ORDER BY embedding <-> ARRAY[...]::vector LIMIT 1;"

Document the embedding-model version in DR runbook. Restoring stale vectors with a new model is silently wrong.

FAQ

Q: Does pg_dump support vectors? A: Yes (pgvector ≥ 0.5). It uses the binary format. Make sure the destination has the pgvector extension at the same version.

Q: How long does a 1TB pgvector restore take? A: With wal-g and parallel apply, ~40 min for the base + WAL replay catchup. Index rebuild for HNSW is the long pole — plan for 3–6 hours.

Q: What's RPO for this setup? A: 60 seconds. WAL ships every minute; you lose at most a minute of writes.

Q: Should I use logical replication for DR? A: Pair it with physical streaming. Logical for cross-version migrations; physical for fast cluster failover.

Q: HNSW indexes increase restore time — worth it? A: For < 5M vectors, no — IVFFlat rebuilds faster. For > 10M, HNSW is a must despite the longer rebuild.

Database Backup and Recovery for AI Agent State: Postgres + pgvector

What goes wrong

How to monitor

CallSphere stack

Implementation

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Vector Database Benchmarks 2026: pgvector 0.9, Qdrant, Weaviate, Milvus, LanceDB

pgvector 0.9: Hybrid Search and Binary Vectors Now Production-Ready

Self-Hosting Mem0 on Postgres + pgvector: A Production Recipe

pgvector at Scale in 2026: HNSW Tuning + Binary Quantization

LangGraph Persistent State with the Postgres Checkpointer

Debezium 2.x for AI: Postgres Logical Replication as a Real-Time Customer Signal Bus