DTMF Handling for Voice Agents: CallSphere vs Vapi Reliability

TL;DR

DTMF (touch-tone) handling looks trivial until you realize agents speak over user inputs, carriers compress audio in ways that mangle tones, and users press keys at unpredictable moments. Vapi offers basic DTMF capture during silent listening windows. CallSphere uses Twilio's native DTMF event stream with custom in-flight logic to capture digits even while the agent is speaking, debounce carrier echoes, and route to IVR-style menus when needed.

This is the engineer-level guide to not letting "press 1 for English" eat your call quality.

Why DTMF Still Matters in Voice AI

Voice AI is great at speech recognition; DTMF still wins for:

Sensitive data — credit card numbers, SSNs (PCI/HIPAA compliance)
Loud environments — drive-throughs, factory floors
Speech-impaired callers — accessibility
IVR fallback — when speech recognition keeps failing
Quick confirmation — "press 1 to confirm, 2 to reschedule"

Stripping DTMF support to look more "AI-native" is a downgrade.

Vapi DTMF Approach

Vapi exposes DTMF through assistant config and webhook events:

{
  "voicemailDetection": {...},
  "endCallFunctionEnabled": true,
  "dtmfReceivedFunction": {
    "name": "on_dtmf",
    "url": "https://your-app.com/dtmf"
  }
}

Default behavior: DTMF is captured during silent listening. If the agent is mid-utterance, DTMF events may be dropped, captured on next pause, or arrive without context.

Strengths: simple to set up.

Weaknesses:

No in-flight DTMF capture
No echo debounce
IVR menus require building a separate stateful flow per assistant

CallSphere DTMF Approach

CallSphere subscribes to Twilio's native DTMF event stream on the Media Stream WebSocket. Twilio delivers DTMF as discrete events independently of audio, so CallSphere captures them whether or not the agent is speaking.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Event Subscription

async def media_stream_handler(ws):
    async for raw in ws:
        event = json.loads(raw)
        if event["event"] == "dtmf":
            digit = event["dtmf"]["digit"]
            await handle_dtmf(digit, ctx)
        elif event["event"] == "media":
            await forward_audio(event["media"]["payload"])

In-Flight Capture and Speech Suppression

When a DTMF event arrives while the agent is speaking, CallSphere:

Pauses the OpenAI Realtime audio output
Captures the digit and any digits that follow within 1500ms
Resumes (or replans) the response based on the captured key sequence

async def handle_dtmf(digit: str, ctx: CallContext):
    ctx.dtmf_buffer.append(digit)

    # Pause TTS if agent is speaking
    if ctx.agent_speaking:
        await ctx.realtime_session.cancel_response()

    # Reset debounce window
    ctx.dtmf_window_task and ctx.dtmf_window_task.cancel()
    ctx.dtmf_window_task = asyncio.create_task(
        finalize_dtmf_after_silence(ctx, silence_ms=1500)
    )

Echo Debounce

Some carriers echo DTMF tones back as audio, which speech-to-text occasionally transcribes as words like "two" or "five." CallSphere maintains a 200ms suppression window after each DTMF event during which speech transcripts are filtered for digit-words paired with the just-pressed digit.

def is_echo_transcript(transcript: str, recent_dtmf: list[tuple[str, float]]) -> bool:
    word_to_digit = {"one": "1", "two": "2", ...}
    now = time.monotonic()
    for digit, ts in recent_dtmf:
        if now - ts > 0.2:
            continue
        for word, d in word_to_digit.items():
            if word in transcript.lower() and d == digit:
                return True
    return False

IVR-Style Menus

For verticals that need traditional IVR fallback (Healthcare, After-Hours), CallSphere supports a config-driven menu mode:

ivr_menu:
  prompt: "For appointments, press 1. For billing, press 2. To speak with someone, press 0."
  options:
    "1": handoff:scheduling_specialist
    "2": handoff:billing_specialist
    "0": handoff:human
  timeout_ms: 8000
  on_timeout: handoff:human
  on_invalid: replay_prompt

The agent can drop into menu mode mid-call ("I'll switch to a touch-tone menu") and exit back to conversational mode after the routing decision.

PCI Mode for Card Capture

For credit card capture, CallSphere flips into PCI mode: speech transcript is dropped (no logging, no LLM forwarding), only DTMF events are captured into a tokenization service (Stripe / Square), and the agent only knows the card was captured successfully or not.

async def collect_card_pci_mode(ctx: CallContext):
    ctx.pci_mode = True  # disables transcript logging
    ctx.realtime_session.disable_speech_input()

    digits = await collect_dtmf(ctx, count=16, timeout_ms=30000)
    cvv = await collect_dtmf(ctx, count=3, timeout_ms=10000)
    expiry = await collect_dtmf(ctx, count=4, timeout_ms=10000)

    token = await stripe.tokenize(digits, cvv, expiry)
    ctx.pci_mode = False
    ctx.realtime_session.enable_speech_input()
    return token

Vapi vs CallSphere DTMF Comparison

Dimension	Vapi	CallSphere
In-flight capture (during agent speech)	Limited	Yes (TTS pauses)
Echo debounce	None	200ms suppression
IVR menu mode	DIY	Config-driven
PCI mode (card capture)	DIY	Built-in
Multi-digit sequences	Webhook per digit	Buffered with debounce
Carrier compatibility	Vendor-side	Twilio native, all carriers
Custom action per digit	Webhook	Inline handler or handoff

DTMF Interrupt Flow

sequenceDiagram
    participant User
    participant Twilio
    participant Agent
    participant Realtime as OpenAI Realtime
    participant Tokenize as Stripe Tokenize

    Agent->>Realtime: Generate "Please enter your card"
    Realtime-->>Twilio: PCM16 audio
    Twilio-->>User: "Please enter your card..."
    User->>Twilio: Press 4
    Twilio->>Agent: dtmf event "4"
    Agent->>Realtime: cancel_response()
    Agent->>Agent: pci_mode=true, disable speech
    User->>Twilio: Press 1, 2, 3, ... (16 digits)
    Twilio->>Agent: dtmf events
    Agent->>Tokenize: tokenize(digits, cvv, expiry)
    Tokenize-->>Agent: token_xyz
    Agent->>Agent: pci_mode=false
    Agent->>Realtime: "Card captured. Confirm?"
    Realtime-->>Twilio: PCM16
    Twilio-->>User: "Card captured. Confirm?"

Practical Tips

Always suppress speech recognition during DTMF capture. Otherwise, "press 1" gets transcribed as "press one" and you double-capture.
Buffer multi-digit input. Most users type a 4-digit code in 800-1500ms — never act on the first digit alone unless the menu is single-digit.
Provide both modes. Voice-first users hate DTMF; accessibility users hate voice-only. Config the same flow for both.
Log DTMF events outside PCI mode for debugging. Inside PCI mode, log only the count of captured digits, never the digits.
Test with at least three carriers. Verizon, AT&T, and T-Mobile have observably different DTMF tone signatures.

FAQ

Does CallSphere support pulse dialing?

No — pulse dialing is rare in 2026 and Twilio does not deliver pulse events. Pulse callers must dial differently or use voice.

What happens if the carrier strips DTMF tones in audio?

Twilio's signaling-channel DTMF events bypass audio, so this is not an issue. Inband DTMF (rare) is detected separately.

Can DTMF interrupt the agent mid-tool-call?

Yes — the dtmf event handler can cancel an in-flight tool call if the user presses the universal cancel key (configurable, default 0).

Does PCI mode require additional certification?

Compliance posture is your call; CallSphere's PCI mode is designed to support a SAQ A scope by never persisting PAN data. Confirm with your QSA.

How does this affect voicemail-detection accuracy?

DTMF after voicemail prompts ("press 1 to skip greeting") is captured the same way; CallSphere's voicemail detector uses a separate signal cascade (covered in another post).

Build a Reliable IVR + Voice Hybrid

The /features page lists DTMF-supported verticals, and /demo includes a credit-capture flow that shows PCI mode live.