Vision-Capable Voice Agents (Property Photos): CallSphere vs Vapi
How CallSphere Real Estate uses GPT-4o vision on buyer-uploaded property photos during voice calls. Vapi is voice-only — what that means in practice.
TL;DR
Vapi is voice-only — no native vision, no image-aware tool, no ability to ground a voice answer in a photo a caller just uploaded. CallSphere ships a vision-capable Property Search specialist in the Real Estate vertical that accepts buyer-uploaded photos via SMS/MMS or web link, runs GPT-4o vision analysis, and feeds structured visual features into the conversation.
This unlocks "find me a kitchen that looks like this one" as a real product, not a vaporware demo.
Why Voice + Vision Together
Most voice AI platforms are text-token-stream-to-audio pipelines. Vision is missing because the original product surface (phone calls) didn't have it. But customer expectations have moved:
- A buyer texts a Zillow listing, then calls about "the one with the white kitchen"
- A homeowner snaps a photo of a leaking pipe and calls plumbing dispatch
- An insurance claimant photographs damage on a roadside call
In all three, the vision artifact is the central context. A voice-only agent has to fall back to "describe the photo to me," which is a worse experience than the human alternative.
Vapi's Vision Story
Vapi as of 2026-04 has:
- No native multimodal input
- No image upload primitive
- No vision tool
- Workaround: send the image to your own backend, run vision externally, return a text description, feed that text to Vapi as context
The workaround works for "describe an image and tell the agent" but loses two things:
- Latency — the round-trip to your vision service plus the agent's next turn is 1-2s extra
- Grounding — the agent reasons over a text description, not the actual image, so any nuance the description misses is gone forever
CallSphere Vision Approach
CallSphere's Real Estate Property Search specialist accepts photos via:
- MMS through Twilio during the call ("text us the photo at this number")
- Web link entered into a portal ("upload at callsphere.example/upload?call=...")
- Returning user photo history pulled from Postgres on caller-ID match
The flow:
- User says "I want a kitchen like the one I just texted you"
- Twilio MMS webhook stores the image in S3, emits a
photo_receivedevent tagged with the active call ID - The agent sees a
photo_availablesignal in its context and callsvision_analyze vision_analyzeinvokes GPT-4o with the image plus a structured prompt: "Extract: cabinet color, countertop material, layout type, ceiling height estimate, lighting style, square footage estimate"- Returns structured JSON
{cabinet_color: "white", countertop: "marble", layout: "galley", ...} - Agent calls
search_listingswith the structured features as filters - Agent verbally summarizes matches: "I found 4 listings with white cabinets and marble countertops in your search area"
Tool Schema
export const visionAnalyzeTool = {
type: 'function' as const,
name: 'vision_analyze',
description:
'Analyze a photo the buyer uploaded during this call. Returns structured ' +
'features that can be passed to search_listings. Only call after photo_available.',
parameters: {
type: 'object',
properties: {
photo_id: {
type: 'string',
description: 'ID from the photo_available event in conversation context',
},
analysis_focus: {
type: 'string',
enum: ['kitchen', 'bathroom', 'exterior', 'living_space', 'general'],
description: 'Hint to the vision model on what features matter most',
},
},
required: ['photo_id', 'analysis_focus'],
},
};
The Vision Prompt
The prompt the agent ships to GPT-4o is intentionally narrow:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
You are a property feature extractor. Given the image and the focus area
({analysis_focus}), return strict JSON with these keys ONLY:
cabinet_color: string | null
countertop_material: string | null
flooring_material: string | null
layout_type: string | null // e.g., "galley", "open", "u-shape"
lighting_style: string | null // e.g., "pendant", "recessed", "natural"
estimated_sqft: number | null // null if not estimable
notable_features: string[] // max 5
Do not return prose. Do not add keys. Use null for unknown.
The strict-JSON contract is enforced via OpenAI's structured output. A failure here returns null fields, which the agent handles gracefully ("I could see the kitchen but couldn't make out the countertop material — can you tell me?").
Returning Visual Features to Search
The structured features become filters:
features = await vision_analyze(photo_id, focus="kitchen")
matches = await search_listings(
city=ctx.user_filters.city,
beds=ctx.user_filters.beds,
feature_filters={
"kitchen.cabinet_color": features.cabinet_color,
"kitchen.countertop": features.countertop_material,
},
sort_by="visual_similarity",
)
The visual_similarity sort ranks listings by embedding distance to the buyer's photo using a CLIP-style listing image embedding stored on each property record.
Vapi vs CallSphere Vision Comparison
| Dimension | Vapi | CallSphere |
|---|---|---|
| Native vision | No | Yes (GPT-4o) |
| Image input channel | Out-of-band, DIY | MMS, web link, history |
| Latency to first vision answer | 1-2s extra (external) | 600-900ms inline |
| Grounding | Text description proxy | Direct image reasoning |
| Structured output | DIY parsing | OpenAI structured output |
| Multi-image conversation | Awkward | Native; agent tracks photo set |
| Privacy | Image touches 2 vendors | Image touches OpenAI only |
| Use case fit | Voice-only | Voice + visual context |
Vision-Enriched Search Flow
sequenceDiagram
participant Buyer
participant Twilio
participant Agent as Property Search Agent
participant Vision as GPT-4o Vision
participant DB as Listings DB
Buyer->>Agent: "I want a kitchen like this"
Agent->>Buyer: "Text the photo to (415) 555-0123"
Buyer->>Twilio: MMS with photo
Twilio->>Agent: photo_received event
Agent->>Agent: photo_available signal in context
Agent->>Vision: vision_analyze(photo_id, focus=kitchen)
Vision-->>Agent: { cabinet_color: "white", countertop: "marble", ... }
Agent->>DB: search_listings(city, beds, feature_filters)
DB-->>Agent: 4 matches sorted by visual_similarity
Agent->>Buyer: "Found 4 with white cabinets, marble counters in your area"
Buyer->>Agent: "Tell me about the second one"
Agent->>DB: get_listing_details(id)
Agent->>Buyer: "1247 Maple Ave, 3 bed 2 bath..."
Other Vertical Use Cases
The vision primitive in CallSphere generalizes:
- Insurance — claimant texts photo of damage, agent extracts severity, auto-routes to adjuster
- Healthcare — patient texts photo of rash or wound, triage agent classifies urgency (with PHI controls)
- Field service — technician texts photo of broken part, dispatch agent identifies SKU and ETA
Each is a thin variant of the Real Estate pattern.
Privacy and Security
- Photos are stored in tenant-isolated S3 buckets with bucket-level encryption
- Default retention 30 days, configurable
- Healthcare deployments use a HIPAA-compliant variant with shorter retention and BAA coverage
- The agent never narrates the image content beyond what is needed to answer; the full image never enters audio output
FAQ
Does vision_analyze block the conversation?
No — the agent emits filler audio ("let me look at that photo") while the vision call runs. Total perceived gap is ~1s.
What if the buyer sends a non-property photo (selfie, etc.)?
The structured prompt returns mostly nulls, and the agent gracefully says "that doesn't look like a property photo — can you check?"
Can vision be used on the LLM's own outputs?
Yes — for QA, we run a vision pass on screenshots of search results to verify they match the agent's verbal description.
Is multi-image conversation supported?
Yes. The agent tracks a photo set for the call and can compare ("this kitchen vs the one you sent first").
Is this MMS-only, or can it work over WhatsApp?
WhatsApp Business is on the roadmap; SMS/MMS via Twilio is shipping.
See the Vision Demo
The /industries/real-estate page has a working video of the kitchen-photo flow, and /demo lets you trigger it live.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.