Vision-Capable Voice Agents (Property Photos): CallSphere vs Vapi

TL;DR

Vapi is voice-only — no native vision, no image-aware tool, no ability to ground a voice answer in a photo a caller just uploaded. CallSphere ships a vision-capable Property Search specialist in the Real Estate vertical that accepts buyer-uploaded photos via SMS/MMS or web link, runs GPT-4o vision analysis, and feeds structured visual features into the conversation.

This unlocks "find me a kitchen that looks like this one" as a real product, not a vaporware demo.

Why Voice + Vision Together

Most voice AI platforms are text-token-stream-to-audio pipelines. Vision is missing because the original product surface (phone calls) didn't have it. But customer expectations have moved:

A buyer texts a Zillow listing, then calls about "the one with the white kitchen"
A homeowner snaps a photo of a leaking pipe and calls plumbing dispatch
An insurance claimant photographs damage on a roadside call

In all three, the vision artifact is the central context. A voice-only agent has to fall back to "describe the photo to me," which is a worse experience than the human alternative.

Vapi's Vision Story

Vapi as of 2026-04 has:

No native multimodal input
No image upload primitive
No vision tool
Workaround: send the image to your own backend, run vision externally, return a text description, feed that text to Vapi as context

The workaround works for "describe an image and tell the agent" but loses two things:

Latency — the round-trip to your vision service plus the agent's next turn is 1-2s extra
Grounding — the agent reasons over a text description, not the actual image, so any nuance the description misses is gone forever

CallSphere Vision Approach

CallSphere's Real Estate Property Search specialist accepts photos via:

MMS through Twilio during the call ("text us the photo at this number")
Web link entered into a portal ("upload at callsphere.example/upload?call=...")
Returning user photo history pulled from Postgres on caller-ID match

The flow:

User says "I want a kitchen like the one I just texted you"
Twilio MMS webhook stores the image in S3, emits a photo_received event tagged with the active call ID
The agent sees a photo_available signal in its context and calls vision_analyze
vision_analyze invokes GPT-4o with the image plus a structured prompt: "Extract: cabinet color, countertop material, layout type, ceiling height estimate, lighting style, square footage estimate"
Returns structured JSON {cabinet_color: "white", countertop: "marble", layout: "galley", ...}
Agent calls search_listings with the structured features as filters
Agent verbally summarizes matches: "I found 4 listings with white cabinets and marble countertops in your search area"

Tool Schema

export const visionAnalyzeTool = {
  type: 'function' as const,
  name: 'vision_analyze',
  description:
    'Analyze a photo the buyer uploaded during this call. Returns structured ' +
    'features that can be passed to search_listings. Only call after photo_available.',
  parameters: {
    type: 'object',
    properties: {
      photo_id: {
        type: 'string',
        description: 'ID from the photo_available event in conversation context',
      },
      analysis_focus: {
        type: 'string',
        enum: ['kitchen', 'bathroom', 'exterior', 'living_space', 'general'],
        description: 'Hint to the vision model on what features matter most',
      },
    },
    required: ['photo_id', 'analysis_focus'],
  },
};

The Vision Prompt

The prompt the agent ships to GPT-4o is intentionally narrow:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

You are a property feature extractor. Given the image and the focus area
({analysis_focus}), return strict JSON with these keys ONLY:

  cabinet_color: string | null
  countertop_material: string | null
  flooring_material: string | null
  layout_type: string | null  // e.g., "galley", "open", "u-shape"
  lighting_style: string | null  // e.g., "pendant", "recessed", "natural"
  estimated_sqft: number | null  // null if not estimable
  notable_features: string[]  // max 5

Do not return prose. Do not add keys. Use null for unknown.

The strict-JSON contract is enforced via OpenAI's structured output. A failure here returns null fields, which the agent handles gracefully ("I could see the kitchen but couldn't make out the countertop material — can you tell me?").

Returning Visual Features to Search

The structured features become filters:

features = await vision_analyze(photo_id, focus="kitchen")
matches = await search_listings(
    city=ctx.user_filters.city,
    beds=ctx.user_filters.beds,
    feature_filters={
        "kitchen.cabinet_color": features.cabinet_color,
        "kitchen.countertop": features.countertop_material,
    },
    sort_by="visual_similarity",
)

The visual_similarity sort ranks listings by embedding distance to the buyer's photo using a CLIP-style listing image embedding stored on each property record.

Vapi vs CallSphere Vision Comparison

Dimension	Vapi	CallSphere
Native vision	No	Yes (GPT-4o)
Image input channel	Out-of-band, DIY	MMS, web link, history
Latency to first vision answer	1-2s extra (external)	600-900ms inline
Grounding	Text description proxy	Direct image reasoning
Structured output	DIY parsing	OpenAI structured output
Multi-image conversation	Awkward	Native; agent tracks photo set
Privacy	Image touches 2 vendors	Image touches OpenAI only
Use case fit	Voice-only	Voice + visual context

Vision-Enriched Search Flow

sequenceDiagram
    participant Buyer
    participant Twilio
    participant Agent as Property Search Agent
    participant Vision as GPT-4o Vision
    participant DB as Listings DB

    Buyer->>Agent: "I want a kitchen like this"
    Agent->>Buyer: "Text the photo to (415) 555-0123"
    Buyer->>Twilio: MMS with photo
    Twilio->>Agent: photo_received event
    Agent->>Agent: photo_available signal in context
    Agent->>Vision: vision_analyze(photo_id, focus=kitchen)
    Vision-->>Agent: { cabinet_color: "white", countertop: "marble", ... }
    Agent->>DB: search_listings(city, beds, feature_filters)
    DB-->>Agent: 4 matches sorted by visual_similarity
    Agent->>Buyer: "Found 4 with white cabinets, marble counters in your area"
    Buyer->>Agent: "Tell me about the second one"
    Agent->>DB: get_listing_details(id)
    Agent->>Buyer: "1247 Maple Ave, 3 bed 2 bath..."

Other Vertical Use Cases

The vision primitive in CallSphere generalizes:

Insurance — claimant texts photo of damage, agent extracts severity, auto-routes to adjuster
Healthcare — patient texts photo of rash or wound, triage agent classifies urgency (with PHI controls)
Field service — technician texts photo of broken part, dispatch agent identifies SKU and ETA

Each is a thin variant of the Real Estate pattern.

Privacy and Security

Photos are stored in tenant-isolated S3 buckets with bucket-level encryption
Default retention 30 days, configurable
Healthcare deployments use a HIPAA-compliant variant with shorter retention and BAA coverage
The agent never narrates the image content beyond what is needed to answer; the full image never enters audio output

FAQ

Does vision_analyze block the conversation?

No — the agent emits filler audio ("let me look at that photo") while the vision call runs. Total perceived gap is ~1s.

What if the buyer sends a non-property photo (selfie, etc.)?

The structured prompt returns mostly nulls, and the agent gracefully says "that doesn't look like a property photo — can you check?"

Can vision be used on the LLM's own outputs?

Yes — for QA, we run a vision pass on screenshots of search results to verify they match the agent's verbal description.

Is multi-image conversation supported?

Yes. The agent tracks a photo set for the call and can compare ("this kitchen vs the one you sent first").

Is this MMS-only, or can it work over WhatsApp?

WhatsApp Business is on the roadmap; SMS/MMS via Twilio is shipping.

See the Vision Demo

The /industries/real-estate page has a working video of the kitchen-photo flow, and /demo lets you trigger it live.