Voicemail Detection Accuracy: CallSphere vs Vapi (with Examples)
Voicemail detection accuracy makes or breaks outbound voice AI. CallSphere VoicemailAnalyzerAgent + Twilio AMD vs Vapi defaults. Real call examples included.
TL;DR
Voicemail detection (AMD - Answering Machine Detection) is the single biggest predictor of outbound campaign quality. False negatives (treating voicemail as a human) burn message budget and look spammy; false positives (treating humans as voicemail) make you look broken. Vapi uses provider-default AMD with limited customization. CallSphere uses a two-stage cascade: Twilio AMD signals first, then a VoicemailAnalyzerAgent built on gpt-4o-mini that listens to the first 4 seconds and confirms voicemail vs human with structured reasoning.
In production traffic across After-Hours dispatch, the cascade lands at ~96% accuracy vs ~83% for AMD-only.
Why Voicemail Detection Is Hard
The naive heuristic — "wait for the beep" — fails because:
- People answer with long greetings ("Hello? Hi, this is John, who is this?")
- Voicemail systems have variable pre-beep delays (1.5s to 8s)
- Some voicemails skip the beep entirely
- Mobile carriers compress audio differently
- Background noise on humans imitates voicemail tone shifts
A single signal source is never enough. Production systems cascade.
Vapi Voicemail Detection Approach
Vapi exposes a config block:
{
"voicemailDetection": {
"provider": "twilio",
"enabled": true,
"machineDetectionTimeout": 30,
"machineDetectionSpeechThreshold": 2400,
"machineDetectionSpeechEndThreshold": 1200,
"machineDetectionSilenceTimeout": 5000
}
}
This delegates to Twilio's AMD plus Vapi's own assistant-side hint detection. The thresholds are exposed but the assistant logic is opaque.
Strengths: sane defaults work for most simple use cases.
Weaknesses:
- No second-pass LLM verification
- No way to inject domain knowledge ("this customer's voicemail says X")
- Hard to debug a false-positive
- Action on detection is binary (leave message / hang up)
CallSphere Voicemail Detection Approach
CallSphere uses a three-stage cascade:
- Twilio AMD runs in parallel with the call connect, returning
AnsweredBywithin ~2-3s - Audio fingerprint — first 1.5s of audio is matched against known voicemail intro patterns (regional carrier specifics)
- VoicemailAnalyzerAgent — a
gpt-4o-miniagent listens to the first 4 seconds of transcript + audio features and returns{is_voicemail: bool, confidence: float, reasoning: string}
The decision is a weighted vote.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Twilio AMD Configuration
client.calls.create(
to=lead.phone,
from_=campaign.caller_id,
url=callback_url,
machine_detection="DetectMessageEnd", # waits for greeting end
async_amd=True, # don't block call connect
async_amd_status_callback=amd_callback_url,
machine_detection_timeout=30,
machine_detection_speech_threshold=2400,
machine_detection_speech_end_threshold=1200,
machine_detection_silence_timeout=5000,
)
DetectMessageEnd waits for the voicemail greeting to finish — important if you want to leave a message after the beep.
VoicemailAnalyzerAgent
The second-pass agent is intentionally cheap (gpt-4o-mini) and structured:
voicemail_analyzer = Agent(
name="VoicemailAnalyzerAgent",
model="gpt-4o-mini",
instructions="""You analyze the first 4 seconds of an outbound call.
Return strict JSON.
Voicemail signals:
- "You've reached the voicemail of..."
- "I'm not available right now..."
- "Please leave a message after the tone"
- Long uninterrupted single voice >3s
- "Please record your message"
Human signals:
- Question response: "Hello?" "Who is this?"
- Short utterance under 2s with rising intonation
- Background noise + brief greeting
- Conversational hesitation: "Uh, hi?"
Return: {"is_voicemail": bool, "confidence": 0.0-1.0, "reasoning": "..."}
""",
output_type=VoicemailVerdict,
)
Cascade Logic
async def detect_voicemail(call: OutboundCall) -> Verdict:
twilio_signal = await call.amd_signal_within(2.5)
audio_fingerprint = await call.audio_fingerprint_first_1500ms()
if twilio_signal == "machine_start" and audio_fingerprint.match_voicemail:
return Verdict.VOICEMAIL # high confidence, skip LLM
if twilio_signal == "human" and audio_fingerprint.match_human:
return Verdict.HUMAN # high confidence, skip LLM
# Ambiguous — escalate to LLM
transcript = await call.transcript_first_4s()
audio_features = await call.audio_features_first_4s()
verdict = await voicemail_analyzer.run({
"transcript": transcript,
"audio_features": audio_features.dict(),
"twilio_amd": twilio_signal,
})
if verdict.confidence < 0.65:
return Verdict.UNCERTAIN # treat as human, log for review
return Verdict.VOICEMAIL if verdict.is_voicemail else Verdict.HUMAN
The cascade only invokes the LLM (~$0.0002/call) when Twilio + fingerprint disagree, which is roughly 12% of calls. Net cost overhead is negligible.
Real-World Examples
Three calls from a recent campaign (sanitized):
Call A — clear voicemail
- Audio: "You've reached Sarah Williams. I'm not available..."
- Twilio AMD:
machine_start - Fingerprint: voicemail match
- LLM: not invoked
- Verdict: VOICEMAIL ✓
Call B — ambiguous
- Audio: "Hello, this is the answering service for Dr. Patel's office, please wait..."
- Twilio AMD:
human(false positive due to "this is" framing) - Fingerprint: weak voicemail
- LLM verdict:
{is_voicemail: true, confidence: 0.81, reasoning: "answering service phrasing"} - Final: VOICEMAIL ✓ (cascade saved a wasted message)
Call C — long human greeting
- Audio: "Hi! I'm so glad you called. Just one second, let me find a quieter spot..."
- Twilio AMD:
machine_start(false positive due to length) - Fingerprint: weak human
- LLM verdict:
{is_voicemail: false, confidence: 0.92, reasoning: "second person address, conversational"} - Final: HUMAN ✓ (cascade saved an awkward "leave a message")
Vapi vs CallSphere Voicemail Detection Comparison
| Metric | Vapi | CallSphere |
|---|---|---|
| Detection signals | Twilio AMD + provider hints | Twilio AMD + audio fingerprint + LLM |
| LLM second-pass | No | Yes (gpt-4o-mini) |
| Production accuracy (campaign) | ~83% | ~96% |
| Cost per detection | Bundled | +$0.0002 LLM cost on ambiguous |
| Custom voicemail rules | Limited | Full LLM prompt + fingerprint config |
| Action on detection | Leave message or hang up | Leave message, hang up, replay tomorrow, send SMS |
| Inspectability | Vapi log | Per-call cascade trace + reasoning |
Detection Cascade Diagram
graph TD
Start[Outbound call connects] --> Twilio[Twilio AMD<br/>2.5s window]
Start --> FP[Audio fingerprint<br/>1.5s window]
Twilio --> Agree{Both agree?}
FP --> Agree
Agree -->|yes voicemail| VM[Verdict: VOICEMAIL<br/>cost: $0]
Agree -->|yes human| H[Verdict: HUMAN<br/>cost: $0]
Agree -->|disagree| LLM[VoicemailAnalyzerAgent<br/>gpt-4o-mini, 4s transcript]
LLM --> Conf{conf > 0.65?}
Conf -->|yes voicemail| VM2[Verdict: VOICEMAIL]
Conf -->|yes human| H2[Verdict: HUMAN]
Conf -->|no| U[Verdict: UNCERTAIN<br/>treat as human, log]
VM --> Action{Leave msg?}
VM2 --> Action
Action -->|yes| Beep[Wait beep, deliver SMS-ready msg]
Action -->|no| Hangup[Hang up, retry tomorrow]
H --> Live[Run human conversation flow]
H2 --> Live
U --> Live
Practical Tips
- Cascade > single signal. Always.
- Use
DetectMessageEnd, notEnable.Enablereturns too early. - Log the LLM reasoning. When detection disagrees with reality, the reasoning tells you what to fix.
- Per-region tuning. Audio fingerprints differ by carrier and region; ship a per-region config map.
- Recheck weekly. Voicemail patterns drift as carriers update prompts.
FAQ
Does the LLM second pass slow down the call?
Slightly — about 250-400ms on top of Twilio's 2.5s window. For outbound, this is invisible because the agent isn't speaking yet.
Can I customize the voicemail message left?
Yes — CallSphere After-Hours flows include a per-campaign voicemail script tool, so the left message reflects the call purpose.
What is the inbound counterpart?
Inbound rarely needs voicemail detection (the user is calling you), but the same cascade detects "you have reached an answering service for X" loops if you transfer.
How often does the LLM disagree with Twilio?
About 12% of ambiguous cases land on the LLM, of which ~30% flip the verdict. Net: ~3.5% of all calls have their verdict corrected by the LLM second pass.
What about regional/non-English voicemail?
The LLM prompt is multilingual; we ship Spanish-language voicemail patterns by default and add per-region configs as needed.
See It Live
The /features page lists per-vertical voicemail handling, and /demo includes an outbound test that triggers the full cascade you can inspect.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.