Voicemail Detection Accuracy: CallSphere vs Vapi (with Examples)

TL;DR

Voicemail detection (AMD - Answering Machine Detection) is the single biggest predictor of outbound campaign quality. False negatives (treating voicemail as a human) burn message budget and look spammy; false positives (treating humans as voicemail) make you look broken. Vapi uses provider-default AMD with limited customization. CallSphere uses a two-stage cascade: Twilio AMD signals first, then a VoicemailAnalyzerAgent built on gpt-4o-mini that listens to the first 4 seconds and confirms voicemail vs human with structured reasoning.

In production traffic across After-Hours dispatch, the cascade lands at ~96% accuracy vs ~83% for AMD-only.

Why Voicemail Detection Is Hard

The naive heuristic — "wait for the beep" — fails because:

People answer with long greetings ("Hello? Hi, this is John, who is this?")
Voicemail systems have variable pre-beep delays (1.5s to 8s)
Some voicemails skip the beep entirely
Mobile carriers compress audio differently
Background noise on humans imitates voicemail tone shifts

A single signal source is never enough. Production systems cascade.

Vapi Voicemail Detection Approach

Vapi exposes a config block:

{
  "voicemailDetection": {
    "provider": "twilio",
    "enabled": true,
    "machineDetectionTimeout": 30,
    "machineDetectionSpeechThreshold": 2400,
    "machineDetectionSpeechEndThreshold": 1200,
    "machineDetectionSilenceTimeout": 5000
  }
}

This delegates to Twilio's AMD plus Vapi's own assistant-side hint detection. The thresholds are exposed but the assistant logic is opaque.

Strengths: sane defaults work for most simple use cases.

Weaknesses:

No second-pass LLM verification
No way to inject domain knowledge ("this customer's voicemail says X")
Hard to debug a false-positive
Action on detection is binary (leave message / hang up)

CallSphere Voicemail Detection Approach

CallSphere uses a three-stage cascade:

Twilio AMD runs in parallel with the call connect, returning AnsweredBy within ~2-3s
Audio fingerprint — first 1.5s of audio is matched against known voicemail intro patterns (regional carrier specifics)
VoicemailAnalyzerAgent — a gpt-4o-mini agent listens to the first 4 seconds of transcript + audio features and returns {is_voicemail: bool, confidence: float, reasoning: string}

The decision is a weighted vote.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Twilio AMD Configuration

client.calls.create(
    to=lead.phone,
    from_=campaign.caller_id,
    url=callback_url,
    machine_detection="DetectMessageEnd",  # waits for greeting end
    async_amd=True,                         # don't block call connect
    async_amd_status_callback=amd_callback_url,
    machine_detection_timeout=30,
    machine_detection_speech_threshold=2400,
    machine_detection_speech_end_threshold=1200,
    machine_detection_silence_timeout=5000,
)

DetectMessageEnd waits for the voicemail greeting to finish — important if you want to leave a message after the beep.

VoicemailAnalyzerAgent

The second-pass agent is intentionally cheap (gpt-4o-mini) and structured:

voicemail_analyzer = Agent(
    name="VoicemailAnalyzerAgent",
    model="gpt-4o-mini",
    instructions="""You analyze the first 4 seconds of an outbound call.
    Return strict JSON.

    Voicemail signals:
    - "You've reached the voicemail of..."
    - "I'm not available right now..."
    - "Please leave a message after the tone"
    - Long uninterrupted single voice >3s
    - "Please record your message"

    Human signals:
    - Question response: "Hello?" "Who is this?"
    - Short utterance under 2s with rising intonation
    - Background noise + brief greeting
    - Conversational hesitation: "Uh, hi?"

    Return: {"is_voicemail": bool, "confidence": 0.0-1.0, "reasoning": "..."}
    """,
    output_type=VoicemailVerdict,
)

Cascade Logic

async def detect_voicemail(call: OutboundCall) -> Verdict:
    twilio_signal = await call.amd_signal_within(2.5)
    audio_fingerprint = await call.audio_fingerprint_first_1500ms()

    if twilio_signal == "machine_start" and audio_fingerprint.match_voicemail:
        return Verdict.VOICEMAIL  # high confidence, skip LLM

    if twilio_signal == "human" and audio_fingerprint.match_human:
        return Verdict.HUMAN  # high confidence, skip LLM

    # Ambiguous — escalate to LLM
    transcript = await call.transcript_first_4s()
    audio_features = await call.audio_features_first_4s()
    verdict = await voicemail_analyzer.run({
        "transcript": transcript,
        "audio_features": audio_features.dict(),
        "twilio_amd": twilio_signal,
    })

    if verdict.confidence < 0.65:
        return Verdict.UNCERTAIN  # treat as human, log for review

    return Verdict.VOICEMAIL if verdict.is_voicemail else Verdict.HUMAN

The cascade only invokes the LLM (~$0.0002/call) when Twilio + fingerprint disagree, which is roughly 12% of calls. Net cost overhead is negligible.

Real-World Examples

Three calls from a recent campaign (sanitized):

Call A — clear voicemail

Audio: "You've reached Sarah Williams. I'm not available..."
Twilio AMD: machine_start
Fingerprint: voicemail match
LLM: not invoked
Verdict: VOICEMAIL ✓

Call B — ambiguous

Audio: "Hello, this is the answering service for Dr. Patel's office, please wait..."
Twilio AMD: human (false positive due to "this is" framing)
Fingerprint: weak voicemail
LLM verdict: {is_voicemail: true, confidence: 0.81, reasoning: "answering service phrasing"}
Final: VOICEMAIL ✓ (cascade saved a wasted message)

Call C — long human greeting

Audio: "Hi! I'm so glad you called. Just one second, let me find a quieter spot..."
Twilio AMD: machine_start (false positive due to length)
Fingerprint: weak human
LLM verdict: {is_voicemail: false, confidence: 0.92, reasoning: "second person address, conversational"}
Final: HUMAN ✓ (cascade saved an awkward "leave a message")

Vapi vs CallSphere Voicemail Detection Comparison

Metric	Vapi	CallSphere
Detection signals	Twilio AMD + provider hints	Twilio AMD + audio fingerprint + LLM
LLM second-pass	No	Yes (gpt-4o-mini)
Production accuracy (campaign)	~83%	~96%
Cost per detection	Bundled	+$0.0002 LLM cost on ambiguous
Custom voicemail rules	Limited	Full LLM prompt + fingerprint config
Action on detection	Leave message or hang up	Leave message, hang up, replay tomorrow, send SMS
Inspectability	Vapi log	Per-call cascade trace + reasoning

Detection Cascade Diagram

graph TD
    Start[Outbound call connects] --> Twilio[Twilio AMD<br/>2.5s window]
    Start --> FP[Audio fingerprint<br/>1.5s window]
    Twilio --> Agree{Both agree?}
    FP --> Agree
    Agree -->|yes voicemail| VM[Verdict: VOICEMAIL<br/>cost: $0]
    Agree -->|yes human| H[Verdict: HUMAN<br/>cost: $0]
    Agree -->|disagree| LLM[VoicemailAnalyzerAgent<br/>gpt-4o-mini, 4s transcript]
    LLM --> Conf{conf > 0.65?}
    Conf -->|yes voicemail| VM2[Verdict: VOICEMAIL]
    Conf -->|yes human| H2[Verdict: HUMAN]
    Conf -->|no| U[Verdict: UNCERTAIN<br/>treat as human, log]
    VM --> Action{Leave msg?}
    VM2 --> Action
    Action -->|yes| Beep[Wait beep, deliver SMS-ready msg]
    Action -->|no| Hangup[Hang up, retry tomorrow]
    H --> Live[Run human conversation flow]
    H2 --> Live
    U --> Live

Practical Tips

Cascade > single signal. Always.
Use DetectMessageEnd, not Enable. Enable returns too early.
Log the LLM reasoning. When detection disagrees with reality, the reasoning tells you what to fix.
Per-region tuning. Audio fingerprints differ by carrier and region; ship a per-region config map.
Recheck weekly. Voicemail patterns drift as carriers update prompts.

FAQ

Does the LLM second pass slow down the call?

Slightly — about 250-400ms on top of Twilio's 2.5s window. For outbound, this is invisible because the agent isn't speaking yet.

Can I customize the voicemail message left?

Yes — CallSphere After-Hours flows include a per-campaign voicemail script tool, so the left message reflects the call purpose.

What is the inbound counterpart?

Inbound rarely needs voicemail detection (the user is calling you), but the same cascade detects "you have reached an answering service for X" loops if you transfer.

How often does the LLM disagree with Twilio?

About 12% of ambiguous cases land on the LLM, of which ~30% flip the verdict. Net: ~3.5% of all calls have their verdict corrected by the LLM second pass.

What about regional/non-English voicemail?

The LLM prompt is multilingual; we ship Spanish-language voicemail patterns by default and add per-region configs as needed.

See It Live

The /features page lists per-vertical voicemail handling, and /demo includes an outbound test that triggers the full cascade you can inspect.