Skip to content
Technical Guides
Technical Guides13 min read0 views

Vision-Capable Voice Agents (Property Photos): CallSphere vs Vapi

How CallSphere Real Estate uses GPT-4o vision on buyer-uploaded property photos during voice calls. Vapi is voice-only — what that means in practice.

TL;DR

Vapi is voice-only — no native vision, no image-aware tool, no ability to ground a voice answer in a photo a caller just uploaded. CallSphere ships a vision-capable Property Search specialist in the Real Estate vertical that accepts buyer-uploaded photos via SMS/MMS or web link, runs GPT-4o vision analysis, and feeds structured visual features into the conversation.

This unlocks "find me a kitchen that looks like this one" as a real product, not a vaporware demo.

Why Voice + Vision Together

Most voice AI platforms are text-token-stream-to-audio pipelines. Vision is missing because the original product surface (phone calls) didn't have it. But customer expectations have moved:

  • A buyer texts a Zillow listing, then calls about "the one with the white kitchen"
  • A homeowner snaps a photo of a leaking pipe and calls plumbing dispatch
  • An insurance claimant photographs damage on a roadside call

In all three, the vision artifact is the central context. A voice-only agent has to fall back to "describe the photo to me," which is a worse experience than the human alternative.

Vapi's Vision Story

Vapi as of 2026-04 has:

  • No native multimodal input
  • No image upload primitive
  • No vision tool
  • Workaround: send the image to your own backend, run vision externally, return a text description, feed that text to Vapi as context

The workaround works for "describe an image and tell the agent" but loses two things:

  1. Latency — the round-trip to your vision service plus the agent's next turn is 1-2s extra
  2. Grounding — the agent reasons over a text description, not the actual image, so any nuance the description misses is gone forever

CallSphere Vision Approach

CallSphere's Real Estate Property Search specialist accepts photos via:

  • MMS through Twilio during the call ("text us the photo at this number")
  • Web link entered into a portal ("upload at callsphere.example/upload?call=...")
  • Returning user photo history pulled from Postgres on caller-ID match

The flow:

  1. User says "I want a kitchen like the one I just texted you"
  2. Twilio MMS webhook stores the image in S3, emits a photo_received event tagged with the active call ID
  3. The agent sees a photo_available signal in its context and calls vision_analyze
  4. vision_analyze invokes GPT-4o with the image plus a structured prompt: "Extract: cabinet color, countertop material, layout type, ceiling height estimate, lighting style, square footage estimate"
  5. Returns structured JSON {cabinet_color: "white", countertop: "marble", layout: "galley", ...}
  6. Agent calls search_listings with the structured features as filters
  7. Agent verbally summarizes matches: "I found 4 listings with white cabinets and marble countertops in your search area"

Tool Schema

export const visionAnalyzeTool = {
  type: 'function' as const,
  name: 'vision_analyze',
  description:
    'Analyze a photo the buyer uploaded during this call. Returns structured ' +
    'features that can be passed to search_listings. Only call after photo_available.',
  parameters: {
    type: 'object',
    properties: {
      photo_id: {
        type: 'string',
        description: 'ID from the photo_available event in conversation context',
      },
      analysis_focus: {
        type: 'string',
        enum: ['kitchen', 'bathroom', 'exterior', 'living_space', 'general'],
        description: 'Hint to the vision model on what features matter most',
      },
    },
    required: ['photo_id', 'analysis_focus'],
  },
};

The Vision Prompt

The prompt the agent ships to GPT-4o is intentionally narrow:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

You are a property feature extractor. Given the image and the focus area
({analysis_focus}), return strict JSON with these keys ONLY:

  cabinet_color: string | null
  countertop_material: string | null
  flooring_material: string | null
  layout_type: string | null  // e.g., "galley", "open", "u-shape"
  lighting_style: string | null  // e.g., "pendant", "recessed", "natural"
  estimated_sqft: number | null  // null if not estimable
  notable_features: string[]  // max 5

Do not return prose. Do not add keys. Use null for unknown.

The strict-JSON contract is enforced via OpenAI's structured output. A failure here returns null fields, which the agent handles gracefully ("I could see the kitchen but couldn't make out the countertop material — can you tell me?").

The structured features become filters:

features = await vision_analyze(photo_id, focus="kitchen")
matches = await search_listings(
    city=ctx.user_filters.city,
    beds=ctx.user_filters.beds,
    feature_filters={
        "kitchen.cabinet_color": features.cabinet_color,
        "kitchen.countertop": features.countertop_material,
    },
    sort_by="visual_similarity",
)

The visual_similarity sort ranks listings by embedding distance to the buyer's photo using a CLIP-style listing image embedding stored on each property record.

Vapi vs CallSphere Vision Comparison

Dimension Vapi CallSphere
Native vision No Yes (GPT-4o)
Image input channel Out-of-band, DIY MMS, web link, history
Latency to first vision answer 1-2s extra (external) 600-900ms inline
Grounding Text description proxy Direct image reasoning
Structured output DIY parsing OpenAI structured output
Multi-image conversation Awkward Native; agent tracks photo set
Privacy Image touches 2 vendors Image touches OpenAI only
Use case fit Voice-only Voice + visual context

Vision-Enriched Search Flow

sequenceDiagram
    participant Buyer
    participant Twilio
    participant Agent as Property Search Agent
    participant Vision as GPT-4o Vision
    participant DB as Listings DB

    Buyer->>Agent: "I want a kitchen like this"
    Agent->>Buyer: "Text the photo to (415) 555-0123"
    Buyer->>Twilio: MMS with photo
    Twilio->>Agent: photo_received event
    Agent->>Agent: photo_available signal in context
    Agent->>Vision: vision_analyze(photo_id, focus=kitchen)
    Vision-->>Agent: { cabinet_color: "white", countertop: "marble", ... }
    Agent->>DB: search_listings(city, beds, feature_filters)
    DB-->>Agent: 4 matches sorted by visual_similarity
    Agent->>Buyer: "Found 4 with white cabinets, marble counters in your area"
    Buyer->>Agent: "Tell me about the second one"
    Agent->>DB: get_listing_details(id)
    Agent->>Buyer: "1247 Maple Ave, 3 bed 2 bath..."

Other Vertical Use Cases

The vision primitive in CallSphere generalizes:

  • Insurance — claimant texts photo of damage, agent extracts severity, auto-routes to adjuster
  • Healthcare — patient texts photo of rash or wound, triage agent classifies urgency (with PHI controls)
  • Field service — technician texts photo of broken part, dispatch agent identifies SKU and ETA

Each is a thin variant of the Real Estate pattern.

Privacy and Security

  • Photos are stored in tenant-isolated S3 buckets with bucket-level encryption
  • Default retention 30 days, configurable
  • Healthcare deployments use a HIPAA-compliant variant with shorter retention and BAA coverage
  • The agent never narrates the image content beyond what is needed to answer; the full image never enters audio output

FAQ

Does vision_analyze block the conversation?

No — the agent emits filler audio ("let me look at that photo") while the vision call runs. Total perceived gap is ~1s.

What if the buyer sends a non-property photo (selfie, etc.)?

The structured prompt returns mostly nulls, and the agent gracefully says "that doesn't look like a property photo — can you check?"

Can vision be used on the LLM's own outputs?

Yes — for QA, we run a vision pass on screenshots of search results to verify they match the agent's verbal description.

Is multi-image conversation supported?

Yes. The agent tracks a photo set for the call and can compare ("this kitchen vs the one you sent first").

Is this MMS-only, or can it work over WhatsApp?

WhatsApp Business is on the roadmap; SMS/MMS via Twilio is shipping.

See the Vision Demo

The /industries/real-estate page has a working video of the kitchen-photo flow, and /demo lets you trigger it live.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Comparisons

Smart Escalation Ladders: CallSphere Built-In vs Vapi DIY

Acknowledgments table, ladder configs, 120s timeout — built-in on CallSphere. On Vapi this is a from-scratch state-machine engineering project.

Real Estate

Why Memphis Real Estate Teams Are Wiring CallSphere AI Into Follow Up Boss, kvCORE, and AppFolio in 72 Hours

Tennessee brokerages and property managers: how to drop CallSphere voice and chat agents into your MLS, CRM, and PMS in 24-72 hours without disrupting your team.

Real Estate

Idaho Brokerages: A Friction-Free CallSphere Voice + Chat Integration With MLS, CRM, and Your PMS — Starting in Boise

Idaho brokerages and property managers: how to drop CallSphere voice and chat agents into your MLS, CRM, and PMS in 24-72 hours without disrupting your team.

Real Estate

Hassle-Free CallSphere Rollout for Utah Real Estate: Orem Operators Lead the Wave

Utah brokerages and property managers: how to drop CallSphere voice and chat agents into your MLS, CRM, and PMS in 24-72 hours without disrupting your team.

Real Estate

Oregon Property Brokers' Playbook for Voice & Chat AI That Actually Talks to Your MLS — No Rewrites

Oregon brokerages and property managers: how to drop CallSphere voice and chat agents into your MLS, CRM, and PMS in 24-72 hours without disrupting your team.

Real Estate

From Naperville to the Rest of Illinois: How CallSphere Voice + Chat Plugs Into Your Listings Stack Without Drama

Illinois brokerages and property managers: how to drop CallSphere voice and chat agents into your MLS, CRM, and PMS in 24-72 hours without disrupting your team.