DTMF Handling for Voice Agents: CallSphere vs Vapi Reliability
DTMF tone capture during agent speech, IVR-style menus, key suppression. How CallSphere handles DTMF via Twilio + custom logic vs Vapi defaults.
TL;DR
DTMF (touch-tone) handling looks trivial until you realize agents speak over user inputs, carriers compress audio in ways that mangle tones, and users press keys at unpredictable moments. Vapi offers basic DTMF capture during silent listening windows. CallSphere uses Twilio's native DTMF event stream with custom in-flight logic to capture digits even while the agent is speaking, debounce carrier echoes, and route to IVR-style menus when needed.
This is the engineer-level guide to not letting "press 1 for English" eat your call quality.
Why DTMF Still Matters in Voice AI
Voice AI is great at speech recognition; DTMF still wins for:
- Sensitive data — credit card numbers, SSNs (PCI/HIPAA compliance)
- Loud environments — drive-throughs, factory floors
- Speech-impaired callers — accessibility
- IVR fallback — when speech recognition keeps failing
- Quick confirmation — "press 1 to confirm, 2 to reschedule"
Stripping DTMF support to look more "AI-native" is a downgrade.
Vapi DTMF Approach
Vapi exposes DTMF through assistant config and webhook events:
{
"voicemailDetection": {...},
"endCallFunctionEnabled": true,
"dtmfReceivedFunction": {
"name": "on_dtmf",
"url": "https://your-app.com/dtmf"
}
}
Default behavior: DTMF is captured during silent listening. If the agent is mid-utterance, DTMF events may be dropped, captured on next pause, or arrive without context.
Strengths: simple to set up.
Weaknesses:
- No in-flight DTMF capture
- No echo debounce
- IVR menus require building a separate stateful flow per assistant
CallSphere DTMF Approach
CallSphere subscribes to Twilio's native DTMF event stream on the Media Stream WebSocket. Twilio delivers DTMF as discrete events independently of audio, so CallSphere captures them whether or not the agent is speaking.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Event Subscription
async def media_stream_handler(ws):
async for raw in ws:
event = json.loads(raw)
if event["event"] == "dtmf":
digit = event["dtmf"]["digit"]
await handle_dtmf(digit, ctx)
elif event["event"] == "media":
await forward_audio(event["media"]["payload"])
In-Flight Capture and Speech Suppression
When a DTMF event arrives while the agent is speaking, CallSphere:
- Pauses the OpenAI Realtime audio output
- Captures the digit and any digits that follow within 1500ms
- Resumes (or replans) the response based on the captured key sequence
async def handle_dtmf(digit: str, ctx: CallContext):
ctx.dtmf_buffer.append(digit)
# Pause TTS if agent is speaking
if ctx.agent_speaking:
await ctx.realtime_session.cancel_response()
# Reset debounce window
ctx.dtmf_window_task and ctx.dtmf_window_task.cancel()
ctx.dtmf_window_task = asyncio.create_task(
finalize_dtmf_after_silence(ctx, silence_ms=1500)
)
Echo Debounce
Some carriers echo DTMF tones back as audio, which speech-to-text occasionally transcribes as words like "two" or "five." CallSphere maintains a 200ms suppression window after each DTMF event during which speech transcripts are filtered for digit-words paired with the just-pressed digit.
def is_echo_transcript(transcript: str, recent_dtmf: list[tuple[str, float]]) -> bool:
word_to_digit = {"one": "1", "two": "2", ...}
now = time.monotonic()
for digit, ts in recent_dtmf:
if now - ts > 0.2:
continue
for word, d in word_to_digit.items():
if word in transcript.lower() and d == digit:
return True
return False
IVR-Style Menus
For verticals that need traditional IVR fallback (Healthcare, After-Hours), CallSphere supports a config-driven menu mode:
ivr_menu:
prompt: "For appointments, press 1. For billing, press 2. To speak with someone, press 0."
options:
"1": handoff:scheduling_specialist
"2": handoff:billing_specialist
"0": handoff:human
timeout_ms: 8000
on_timeout: handoff:human
on_invalid: replay_prompt
The agent can drop into menu mode mid-call ("I'll switch to a touch-tone menu") and exit back to conversational mode after the routing decision.
PCI Mode for Card Capture
For credit card capture, CallSphere flips into PCI mode: speech transcript is dropped (no logging, no LLM forwarding), only DTMF events are captured into a tokenization service (Stripe / Square), and the agent only knows the card was captured successfully or not.
async def collect_card_pci_mode(ctx: CallContext):
ctx.pci_mode = True # disables transcript logging
ctx.realtime_session.disable_speech_input()
digits = await collect_dtmf(ctx, count=16, timeout_ms=30000)
cvv = await collect_dtmf(ctx, count=3, timeout_ms=10000)
expiry = await collect_dtmf(ctx, count=4, timeout_ms=10000)
token = await stripe.tokenize(digits, cvv, expiry)
ctx.pci_mode = False
ctx.realtime_session.enable_speech_input()
return token
Vapi vs CallSphere DTMF Comparison
| Dimension | Vapi | CallSphere |
|---|---|---|
| In-flight capture (during agent speech) | Limited | Yes (TTS pauses) |
| Echo debounce | None | 200ms suppression |
| IVR menu mode | DIY | Config-driven |
| PCI mode (card capture) | DIY | Built-in |
| Multi-digit sequences | Webhook per digit | Buffered with debounce |
| Carrier compatibility | Vendor-side | Twilio native, all carriers |
| Custom action per digit | Webhook | Inline handler or handoff |
DTMF Interrupt Flow
sequenceDiagram
participant User
participant Twilio
participant Agent
participant Realtime as OpenAI Realtime
participant Tokenize as Stripe Tokenize
Agent->>Realtime: Generate "Please enter your card"
Realtime-->>Twilio: PCM16 audio
Twilio-->>User: "Please enter your card..."
User->>Twilio: Press 4
Twilio->>Agent: dtmf event "4"
Agent->>Realtime: cancel_response()
Agent->>Agent: pci_mode=true, disable speech
User->>Twilio: Press 1, 2, 3, ... (16 digits)
Twilio->>Agent: dtmf events
Agent->>Tokenize: tokenize(digits, cvv, expiry)
Tokenize-->>Agent: token_xyz
Agent->>Agent: pci_mode=false
Agent->>Realtime: "Card captured. Confirm?"
Realtime-->>Twilio: PCM16
Twilio-->>User: "Card captured. Confirm?"
Practical Tips
- Always suppress speech recognition during DTMF capture. Otherwise, "press 1" gets transcribed as "press one" and you double-capture.
- Buffer multi-digit input. Most users type a 4-digit code in 800-1500ms — never act on the first digit alone unless the menu is single-digit.
- Provide both modes. Voice-first users hate DTMF; accessibility users hate voice-only. Config the same flow for both.
- Log DTMF events outside PCI mode for debugging. Inside PCI mode, log only the count of captured digits, never the digits.
- Test with at least three carriers. Verizon, AT&T, and T-Mobile have observably different DTMF tone signatures.
FAQ
Does CallSphere support pulse dialing?
No — pulse dialing is rare in 2026 and Twilio does not deliver pulse events. Pulse callers must dial differently or use voice.
What happens if the carrier strips DTMF tones in audio?
Twilio's signaling-channel DTMF events bypass audio, so this is not an issue. Inband DTMF (rare) is detected separately.
Can DTMF interrupt the agent mid-tool-call?
Yes — the dtmf event handler can cancel an in-flight tool call if the user presses the universal cancel key (configurable, default 0).
Does PCI mode require additional certification?
Compliance posture is your call; CallSphere's PCI mode is designed to support a SAQ A scope by never persisting PAN data. Confirm with your QSA.
How does this affect voicemail-detection accuracy?
DTMF after voicemail prompts ("press 1 to skip greeting") is captured the same way; CallSphere's voicemail detector uses a separate signal cascade (covered in another post).
Build a Reliable IVR + Voice Hybrid
The /features page lists DTMF-supported verticals, and /demo includes a credit-capture flow that shows PCI mode live.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.