Vision-Capable Voice Agents (Property Photos): CallSphere vs Vapi
By Sagar Shankaran, Founder of CallSphere
How CallSphere Real Estate uses GPT-4o vision on buyer-uploaded property photos during voice calls. Vapi is voice-only — what that means in practice.
Key takeaways
TL;DR
Vapi is voice-only — no native vision, no image-aware tool, no ability to ground a voice answer in a photo a caller just uploaded. CallSphere ships a vision-capable Property Search specialist in the Real Estate vertical that accepts buyer-uploaded photos via SMS/MMS or web link, runs GPT-4o vision analysis, and feeds structured visual features into the conversation.
This unlocks "find me a kitchen that looks like this one" as a real product, not a vaporware demo.
Why Voice + Vision Together
Most voice AI platforms are text-token-stream-to-audio pipelines. Vision is missing because the original product surface (phone calls) didn't have it. But customer expectations have moved:
- A buyer texts a Zillow listing, then calls about "the one with the white kitchen"
- A homeowner snaps a photo of a leaking pipe and calls plumbing dispatch
- An insurance claimant photographs damage on a roadside call
In all three, the vision artifact is the central context. A voice-only agent has to fall back to "describe the photo to me," which is a worse experience than the human alternative.
Vapi's Vision Story
Vapi as of 2026-04 has:
- No native multimodal input
- No image upload primitive
- No vision tool
- Workaround: send the image to your own backend, run vision externally, return a text description, feed that text to Vapi as context
The workaround works for "describe an image and tell the agent" but loses two things:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Latency — the round-trip to your vision service plus the agent's next turn is 1-2s extra
- Grounding — the agent reasons over a text description, not the actual image, so any nuance the description misses is gone forever
CallSphere Vision Approach
CallSphere's Real Estate Property Search specialist accepts photos via:
- MMS through Twilio during the call ("text us the photo at this number")
- Web link entered into a portal ("upload at callsphere.example/upload?call=...")
- Returning user photo history pulled from Postgres on caller-ID match
The flow:
- User says "I want a kitchen like the one I just texted you"
- Twilio MMS webhook stores the image in S3, emits a
photo_receivedevent tagged with the active call ID - The agent sees a
photo_availablesignal in its context and callsvision_analyze vision_analyzeinvokes GPT-4o with the image plus a structured prompt: "Extract: cabinet color, countertop material, layout type, ceiling height estimate, lighting style, square footage estimate"- Returns structured JSON
{cabinet_color: "white", countertop: "marble", layout: "galley", ...} - Agent calls
search_listingswith the structured features as filters - Agent verbally summarizes matches: "I found 4 listings with white cabinets and marble countertops in your search area"
Tool Schema
export const visionAnalyzeTool = {
type: 'function' as const,
name: 'vision_analyze',
description:
'Analyze a photo the buyer uploaded during this call. Returns structured ' +
'features that can be passed to search_listings. Only call after photo_available.',
parameters: {
type: 'object',
properties: {
photo_id: {
type: 'string',
description: 'ID from the photo_available event in conversation context',
},
analysis_focus: {
type: 'string',
enum: ['kitchen', 'bathroom', 'exterior', 'living_space', 'general'],
description: 'Hint to the vision model on what features matter most',
},
},
required: ['photo_id', 'analysis_focus'],
},
};
The Vision Prompt
The prompt the agent ships to GPT-4o is intentionally narrow:
You are a property feature extractor. Given the image and the focus area
({analysis_focus}), return strict JSON with these keys ONLY:
cabinet_color: string | null
countertop_material: string | null
flooring_material: string | null
layout_type: string | null // e.g., "galley", "open", "u-shape"
lighting_style: string | null // e.g., "pendant", "recessed", "natural"
estimated_sqft: number | null // null if not estimable
notable_features: string[] // max 5
Do not return prose. Do not add keys. Use null for unknown.
The strict-JSON contract is enforced via OpenAI's structured output. A failure here returns null fields, which the agent handles gracefully ("I could see the kitchen but couldn't make out the countertop material — can you tell me?").
Returning Visual Features to Search
The structured features become filters:
features = await vision_analyze(photo_id, focus="kitchen")
matches = await search_listings(
city=ctx.user_filters.city,
beds=ctx.user_filters.beds,
feature_filters={
"kitchen.cabinet_color": features.cabinet_color,
"kitchen.countertop": features.countertop_material,
},
sort_by="visual_similarity",
)
The visual_similarity sort ranks listings by embedding distance to the buyer's photo using a CLIP-style listing image embedding stored on each property record.
Vapi vs CallSphere Vision Comparison
| Dimension | Vapi | CallSphere |
|---|---|---|
| Native vision | No | Yes (GPT-4o) |
| Image input channel | Out-of-band, DIY | MMS, web link, history |
| Latency to first vision answer | 1-2s extra (external) | 600-900ms inline |
| Grounding | Text description proxy | Direct image reasoning |
| Structured output | DIY parsing | OpenAI structured output |
| Multi-image conversation | Awkward | Native; agent tracks photo set |
| Privacy | Image touches 2 vendors | Image touches OpenAI only |
| Use case fit | Voice-only | Voice + visual context |
Vision-Enriched Search Flow
sequenceDiagram
participant Buyer
participant Twilio
participant Agent as Property Search Agent
participant Vision as GPT-4o Vision
participant DB as Listings DB
Buyer->>Agent: "I want a kitchen like this"
Agent->>Buyer: "Text the photo to (415) 555-0123"
Buyer->>Twilio: MMS with photo
Twilio->>Agent: photo_received event
Agent->>Agent: photo_available signal in context
Agent->>Vision: vision_analyze(photo_id, focus=kitchen)
Vision-->>Agent: { cabinet_color: "white", countertop: "marble", ... }
Agent->>DB: search_listings(city, beds, feature_filters)
DB-->>Agent: 4 matches sorted by visual_similarity
Agent->>Buyer: "Found 4 with white cabinets, marble counters in your area"
Buyer->>Agent: "Tell me about the second one"
Agent->>DB: get_listing_details(id)
Agent->>Buyer: "1247 Maple Ave, 3 bed 2 bath..."
Other Vertical Use Cases
The vision primitive in CallSphere generalizes:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Insurance — claimant texts photo of damage, agent extracts severity, auto-routes to adjuster
- Healthcare — patient texts photo of rash or wound, triage agent classifies urgency (with PHI controls)
- Field service — technician texts photo of broken part, dispatch agent identifies SKU and ETA
Each is a thin variant of the Real Estate pattern.
Privacy and Security
- Photos are stored in tenant-isolated S3 buckets with bucket-level encryption
- Default retention 30 days, configurable
- Healthcare deployments use a HIPAA-compliant variant with shorter retention and BAA coverage
- The agent never narrates the image content beyond what is needed to answer; the full image never enters audio output
FAQ
Does vision_analyze block the conversation?
No — the agent emits filler audio ("let me look at that photo") while the vision call runs. Total perceived gap is ~1s.
What if the buyer sends a non-property photo (selfie, etc.)?
The structured prompt returns mostly nulls, and the agent gracefully says "that doesn't look like a property photo — can you check?"
Can vision be used on the LLM's own outputs?
Yes — for QA, we run a vision pass on screenshots of search results to verify they match the agent's verbal description.
Is multi-image conversation supported?
Yes. The agent tracks a photo set for the call and can compare ("this kitchen vs the one you sent first").
Is this MMS-only, or can it work over WhatsApp?
WhatsApp Business is on the roadmap; SMS/MMS via Twilio is shipping.
See the Vision Demo
The /industries/real-estate page has a working video of the kitchen-photo flow, and /demo lets you trigger it live.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.