Skip to content
Vision-Capable Voice Agents (Property Photos): CallSphere vs Vapi
Guides & News13 min read9 views

Vision-Capable Voice Agents (Property Photos): CallSphere vs Vapi

By Sagar Shankaran, Founder of CallSphere

Quick answer

How CallSphere Real Estate uses GPT-4o vision on buyer-uploaded property photos during voice calls. Vapi is voice-only — what that means in practice.

Key takeaways

TL;DR

Vapi is voice-only — no native vision, no image-aware tool, no ability to ground a voice answer in a photo a caller just uploaded. CallSphere ships a vision-capable Property Search specialist in the Real Estate vertical that accepts buyer-uploaded photos via SMS/MMS or web link, runs GPT-4o vision analysis, and feeds structured visual features into the conversation.

This unlocks "find me a kitchen that looks like this one" as a real product, not a vaporware demo.

Why Voice + Vision Together

Most voice AI platforms are text-token-stream-to-audio pipelines. Vision is missing because the original product surface (phone calls) didn't have it. But customer expectations have moved:

  • A buyer texts a Zillow listing, then calls about "the one with the white kitchen"
  • A homeowner snaps a photo of a leaking pipe and calls plumbing dispatch
  • An insurance claimant photographs damage on a roadside call

In all three, the vision artifact is the central context. A voice-only agent has to fall back to "describe the photo to me," which is a worse experience than the human alternative.

Vapi's Vision Story

Vapi as of 2026-04 has:

  • No native multimodal input
  • No image upload primitive
  • No vision tool
  • Workaround: send the image to your own backend, run vision externally, return a text description, feed that text to Vapi as context

The workaround works for "describe an image and tell the agent" but loses two things:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  1. Latency — the round-trip to your vision service plus the agent's next turn is 1-2s extra
  2. Grounding — the agent reasons over a text description, not the actual image, so any nuance the description misses is gone forever

CallSphere Vision Approach

CallSphere's Real Estate Property Search specialist accepts photos via:

  • MMS through Twilio during the call ("text us the photo at this number")
  • Web link entered into a portal ("upload at callsphere.example/upload?call=...")
  • Returning user photo history pulled from Postgres on caller-ID match

The flow:

  1. User says "I want a kitchen like the one I just texted you"
  2. Twilio MMS webhook stores the image in S3, emits a photo_received event tagged with the active call ID
  3. The agent sees a photo_available signal in its context and calls vision_analyze
  4. vision_analyze invokes GPT-4o with the image plus a structured prompt: "Extract: cabinet color, countertop material, layout type, ceiling height estimate, lighting style, square footage estimate"
  5. Returns structured JSON {cabinet_color: "white", countertop: "marble", layout: "galley", ...}
  6. Agent calls search_listings with the structured features as filters
  7. Agent verbally summarizes matches: "I found 4 listings with white cabinets and marble countertops in your search area"

Tool Schema

export const visionAnalyzeTool = {
  type: 'function' as const,
  name: 'vision_analyze',
  description:
    'Analyze a photo the buyer uploaded during this call. Returns structured ' +
    'features that can be passed to search_listings. Only call after photo_available.',
  parameters: {
    type: 'object',
    properties: {
      photo_id: {
        type: 'string',
        description: 'ID from the photo_available event in conversation context',
      },
      analysis_focus: {
        type: 'string',
        enum: ['kitchen', 'bathroom', 'exterior', 'living_space', 'general'],
        description: 'Hint to the vision model on what features matter most',
      },
    },
    required: ['photo_id', 'analysis_focus'],
  },
};

The Vision Prompt

The prompt the agent ships to GPT-4o is intentionally narrow:

You are a property feature extractor. Given the image and the focus area
({analysis_focus}), return strict JSON with these keys ONLY:

  cabinet_color: string | null
  countertop_material: string | null
  flooring_material: string | null
  layout_type: string | null  // e.g., "galley", "open", "u-shape"
  lighting_style: string | null  // e.g., "pendant", "recessed", "natural"
  estimated_sqft: number | null  // null if not estimable
  notable_features: string[]  // max 5

Do not return prose. Do not add keys. Use null for unknown.

The strict-JSON contract is enforced via OpenAI's structured output. A failure here returns null fields, which the agent handles gracefully ("I could see the kitchen but couldn't make out the countertop material — can you tell me?").

The structured features become filters:

features = await vision_analyze(photo_id, focus="kitchen")
matches = await search_listings(
    city=ctx.user_filters.city,
    beds=ctx.user_filters.beds,
    feature_filters={
        "kitchen.cabinet_color": features.cabinet_color,
        "kitchen.countertop": features.countertop_material,
    },
    sort_by="visual_similarity",
)

The visual_similarity sort ranks listings by embedding distance to the buyer's photo using a CLIP-style listing image embedding stored on each property record.

Vapi vs CallSphere Vision Comparison

Dimension Vapi CallSphere
Native vision No Yes (GPT-4o)
Image input channel Out-of-band, DIY MMS, web link, history
Latency to first vision answer 1-2s extra (external) 600-900ms inline
Grounding Text description proxy Direct image reasoning
Structured output DIY parsing OpenAI structured output
Multi-image conversation Awkward Native; agent tracks photo set
Privacy Image touches 2 vendors Image touches OpenAI only
Use case fit Voice-only Voice + visual context

Vision-Enriched Search Flow

sequenceDiagram
    participant Buyer
    participant Twilio
    participant Agent as Property Search Agent
    participant Vision as GPT-4o Vision
    participant DB as Listings DB

    Buyer->>Agent: "I want a kitchen like this"
    Agent->>Buyer: "Text the photo to (415) 555-0123"
    Buyer->>Twilio: MMS with photo
    Twilio->>Agent: photo_received event
    Agent->>Agent: photo_available signal in context
    Agent->>Vision: vision_analyze(photo_id, focus=kitchen)
    Vision-->>Agent: { cabinet_color: "white", countertop: "marble", ... }
    Agent->>DB: search_listings(city, beds, feature_filters)
    DB-->>Agent: 4 matches sorted by visual_similarity
    Agent->>Buyer: "Found 4 with white cabinets, marble counters in your area"
    Buyer->>Agent: "Tell me about the second one"
    Agent->>DB: get_listing_details(id)
    Agent->>Buyer: "1247 Maple Ave, 3 bed 2 bath..."

Other Vertical Use Cases

The vision primitive in CallSphere generalizes:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • Insurance — claimant texts photo of damage, agent extracts severity, auto-routes to adjuster
  • Healthcare — patient texts photo of rash or wound, triage agent classifies urgency (with PHI controls)
  • Field service — technician texts photo of broken part, dispatch agent identifies SKU and ETA

Each is a thin variant of the Real Estate pattern.

Privacy and Security

  • Photos are stored in tenant-isolated S3 buckets with bucket-level encryption
  • Default retention 30 days, configurable
  • Healthcare deployments use a HIPAA-compliant variant with shorter retention and BAA coverage
  • The agent never narrates the image content beyond what is needed to answer; the full image never enters audio output

FAQ

Does vision_analyze block the conversation?

No — the agent emits filler audio ("let me look at that photo") while the vision call runs. Total perceived gap is ~1s.

What if the buyer sends a non-property photo (selfie, etc.)?

The structured prompt returns mostly nulls, and the agent gracefully says "that doesn't look like a property photo — can you check?"

Can vision be used on the LLM's own outputs?

Yes — for QA, we run a vision pass on screenshots of search results to verify they match the agent's verbal description.

Is multi-image conversation supported?

Yes. The agent tracks a photo set for the call and can compare ("this kitchen vs the one you sent first").

Is this MMS-only, or can it work over WhatsApp?

WhatsApp Business is on the roadmap; SMS/MMS via Twilio is shipping.

See the Vision Demo

The /industries/real-estate page has a working video of the kitchen-photo flow, and /demo lets you trigger it live.

Share
S

Written by

Sagar Shankaran· Founder, CallSphere

Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.