---
title: "Vision-Capable Voice Agents (Property Photos): CallSphere vs Vapi"
description: "How CallSphere Real Estate uses GPT-4o vision on buyer-uploaded property photos during voice calls. Vapi is voice-only — what that means in practice."
canonical: https://callsphere.ai/blog/vision-capable-voice-agents-property-photos-callsphere-vs-vapi
category: "Technical Guides"
tags: ["Vision AI", "Voice AI", "CallSphere", "Vapi", "Real Estate", "GPT-4o", "Multimodal"]
author: "CallSphere Team"
published: 2026-04-20T00:00:00.000Z
updated: 2026-05-06T17:06:02.814Z
---

# Vision-Capable Voice Agents (Property Photos): CallSphere vs Vapi

> How CallSphere Real Estate uses GPT-4o vision on buyer-uploaded property photos during voice calls. Vapi is voice-only — what that means in practice.

## TL;DR

**Vapi is voice-only** — no native vision, no image-aware tool, no ability to ground a voice answer in a photo a caller just uploaded. **CallSphere** ships a vision-capable Property Search specialist in the Real Estate vertical that accepts buyer-uploaded photos via SMS/MMS or web link, runs GPT-4o vision analysis, and feeds structured visual features into the conversation.

This unlocks "find me a kitchen that looks like this one" as a real product, not a vaporware demo.

## Why Voice + Vision Together

Most voice AI platforms are text-token-stream-to-audio pipelines. Vision is missing because the original product surface (phone calls) didn't have it. But customer expectations have moved:

- A buyer texts a Zillow listing, then calls about "the one with the white kitchen"
- A homeowner snaps a photo of a leaking pipe and calls plumbing dispatch
- An insurance claimant photographs damage on a roadside call

In all three, the vision artifact is the central context. A voice-only agent has to fall back to "describe the photo to me," which is a worse experience than the human alternative.

## Vapi's Vision Story

Vapi as of 2026-04 has:

- No native multimodal input
- No image upload primitive
- No vision tool
- Workaround: send the image to your own backend, run vision externally, return a text description, feed that text to Vapi as context

The workaround works for "describe an image and tell the agent" but loses two things:

1. **Latency** — the round-trip to your vision service plus the agent's next turn is 1-2s extra
2. **Grounding** — the agent reasons over a text description, not the actual image, so any nuance the description misses is gone forever

## CallSphere Vision Approach

CallSphere's Real Estate Property Search specialist accepts photos via:

- **MMS** through Twilio during the call ("text us the photo at this number")
- **Web link** entered into a portal ("upload at callsphere.example/upload?call=...")
- **Returning user** photo history pulled from Postgres on caller-ID match

The flow:

1. User says "I want a kitchen like the one I just texted you"
2. Twilio MMS webhook stores the image in S3, emits a `photo_received` event tagged with the active call ID
3. The agent sees a `photo_available` signal in its context and calls `vision_analyze`
4. `vision_analyze` invokes GPT-4o with the image plus a structured prompt: "Extract: cabinet color, countertop material, layout type, ceiling height estimate, lighting style, square footage estimate"
5. Returns structured JSON `{cabinet_color: "white", countertop: "marble", layout: "galley", ...}`
6. Agent calls `search_listings` with the structured features as filters
7. Agent verbally summarizes matches: "I found 4 listings with white cabinets and marble countertops in your search area"

### Tool Schema

```typescript
export const visionAnalyzeTool = {
  type: 'function' as const,
  name: 'vision_analyze',
  description:
    'Analyze a photo the buyer uploaded during this call. Returns structured ' +
    'features that can be passed to search_listings. Only call after photo_available.',
  parameters: {
    type: 'object',
    properties: {
      photo_id: {
        type: 'string',
        description: 'ID from the photo_available event in conversation context',
      },
      analysis_focus: {
        type: 'string',
        enum: ['kitchen', 'bathroom', 'exterior', 'living_space', 'general'],
        description: 'Hint to the vision model on what features matter most',
      },
    },
    required: ['photo_id', 'analysis_focus'],
  },
};
```

### The Vision Prompt

The prompt the agent ships to GPT-4o is intentionally narrow:

```
You are a property feature extractor. Given the image and the focus area
({analysis_focus}), return strict JSON with these keys ONLY:

  cabinet_color: string | null
  countertop_material: string | null
  flooring_material: string | null
  layout_type: string | null  // e.g., "galley", "open", "u-shape"
  lighting_style: string | null  // e.g., "pendant", "recessed", "natural"
  estimated_sqft: number | null  // null if not estimable
  notable_features: string[]  // max 5

Do not return prose. Do not add keys. Use null for unknown.
```

The strict-JSON contract is enforced via OpenAI's structured output. A failure here returns null fields, which the agent handles gracefully ("I could see the kitchen but couldn't make out the countertop material — can you tell me?").

### Returning Visual Features to Search

The structured features become filters:

```python
features = await vision_analyze(photo_id, focus="kitchen")
matches = await search_listings(
    city=ctx.user_filters.city,
    beds=ctx.user_filters.beds,
    feature_filters={
        "kitchen.cabinet_color": features.cabinet_color,
        "kitchen.countertop": features.countertop_material,
    },
    sort_by="visual_similarity",
)
```

The visual_similarity sort ranks listings by embedding distance to the buyer's photo using a CLIP-style listing image embedding stored on each property record.

## Vapi vs CallSphere Vision Comparison

| Dimension | Vapi | CallSphere |
| --- | --- | --- |
| Native vision | No | Yes (GPT-4o) |
| Image input channel | Out-of-band, DIY | MMS, web link, history |
| Latency to first vision answer | 1-2s extra (external) | 600-900ms inline |
| Grounding | Text description proxy | Direct image reasoning |
| Structured output | DIY parsing | OpenAI structured output |
| Multi-image conversation | Awkward | Native; agent tracks photo set |
| Privacy | Image touches 2 vendors | Image touches OpenAI only |
| Use case fit | Voice-only | Voice + visual context |

## Vision-Enriched Search Flow

```mermaid
sequenceDiagram
    participant Buyer
    participant Twilio
    participant Agent as Property Search Agent
    participant Vision as GPT-4o Vision
    participant DB as Listings DB

    Buyer->>Agent: "I want a kitchen like this"
    Agent->>Buyer: "Text the photo to (415) 555-0123"
    Buyer->>Twilio: MMS with photo
    Twilio->>Agent: photo_received event
    Agent->>Agent: photo_available signal in context
    Agent->>Vision: vision_analyze(photo_id, focus=kitchen)
    Vision-->>Agent: { cabinet_color: "white", countertop: "marble", ... }
    Agent->>DB: search_listings(city, beds, feature_filters)
    DB-->>Agent: 4 matches sorted by visual_similarity
    Agent->>Buyer: "Found 4 with white cabinets, marble counters in your area"
    Buyer->>Agent: "Tell me about the second one"
    Agent->>DB: get_listing_details(id)
    Agent->>Buyer: "1247 Maple Ave, 3 bed 2 bath..."
```

## Other Vertical Use Cases

The vision primitive in CallSphere generalizes:

- **Insurance** — claimant texts photo of damage, agent extracts severity, auto-routes to adjuster
- **Healthcare** — patient texts photo of rash or wound, triage agent classifies urgency (with PHI controls)
- **Field service** — technician texts photo of broken part, dispatch agent identifies SKU and ETA

Each is a thin variant of the Real Estate pattern.

## Privacy and Security

- Photos are stored in tenant-isolated S3 buckets with bucket-level encryption
- Default retention 30 days, configurable
- Healthcare deployments use a HIPAA-compliant variant with shorter retention and BAA coverage
- The agent never narrates the image content beyond what is needed to answer; the full image never enters audio output

## FAQ

### Does vision_analyze block the conversation?

No — the agent emits filler audio ("let me look at that photo") while the vision call runs. Total perceived gap is ~1s.

### What if the buyer sends a non-property photo (selfie, etc.)?

The structured prompt returns mostly nulls, and the agent gracefully says "that doesn't look like a property photo — can you check?"

### Can vision be used on the LLM's own outputs?

Yes — for QA, we run a vision pass on screenshots of search results to verify they match the agent's verbal description.

### Is multi-image conversation supported?

Yes. The agent tracks a photo set for the call and can compare ("this kitchen vs the one you sent first").

### Is this MMS-only, or can it work over WhatsApp?

WhatsApp Business is on the roadmap; SMS/MMS via Twilio is shipping.

## See the Vision Demo

The [/industries/real-estate](/industries) page has a working video of the kitchen-photo flow, and [/demo](/demo) lets you trigger it live.

---

Source: https://callsphere.ai/blog/vision-capable-voice-agents-property-photos-callsphere-vs-vapi
