---
title: "DTMF Handling for Voice Agents: CallSphere vs Vapi Reliability"
description: "DTMF tone capture during agent speech, IVR-style menus, key suppression. How CallSphere handles DTMF via Twilio + custom logic vs Vapi defaults."
canonical: https://callsphere.ai/blog/dtmf-handling-voice-agents-callsphere-vs-vapi
category: "Technical Guides"
tags: ["DTMF", "IVR", "Voice AI", "CallSphere", "Vapi", "Twilio", "Telephony"]
author: "CallSphere Team"
published: 2026-04-22T00:00:00.000Z
updated: 2026-05-01T06:44:46.989Z
---

# DTMF Handling for Voice Agents: CallSphere vs Vapi Reliability

> DTMF tone capture during agent speech, IVR-style menus, key suppression. How CallSphere handles DTMF via Twilio + custom logic vs Vapi defaults.

## TL;DR

DTMF (touch-tone) handling looks trivial until you realize agents speak over user inputs, carriers compress audio in ways that mangle tones, and users press keys at unpredictable moments. **Vapi** offers basic DTMF capture during silent listening windows. **CallSphere** uses **Twilio's native DTMF event stream** with custom in-flight logic to capture digits even while the agent is speaking, debounce carrier echoes, and route to IVR-style menus when needed.

This is the engineer-level guide to not letting "press 1 for English" eat your call quality.

## Why DTMF Still Matters in Voice AI

Voice AI is great at speech recognition; DTMF still wins for:

- **Sensitive data** — credit card numbers, SSNs (PCI/HIPAA compliance)
- **Loud environments** — drive-throughs, factory floors
- **Speech-impaired callers** — accessibility
- **IVR fallback** — when speech recognition keeps failing
- **Quick confirmation** — "press 1 to confirm, 2 to reschedule"

Stripping DTMF support to look more "AI-native" is a downgrade.

## Vapi DTMF Approach

Vapi exposes DTMF through assistant config and webhook events:

```json
{
  "voicemailDetection": {...},
  "endCallFunctionEnabled": true,
  "dtmfReceivedFunction": {
    "name": "on_dtmf",
    "url": "https://your-app.com/dtmf"
  }
}
```

Default behavior: DTMF is captured during silent listening. If the agent is mid-utterance, DTMF events may be dropped, captured on next pause, or arrive without context.

**Strengths:** simple to set up.

**Weaknesses:**

- No in-flight DTMF capture
- No echo debounce
- IVR menus require building a separate stateful flow per assistant

## CallSphere DTMF Approach

CallSphere subscribes to Twilio's native DTMF event stream on the Media Stream WebSocket. Twilio delivers DTMF as discrete events independently of audio, so CallSphere captures them whether or not the agent is speaking.

### Event Subscription

```python
async def media_stream_handler(ws):
    async for raw in ws:
        event = json.loads(raw)
        if event["event"] == "dtmf":
            digit = event["dtmf"]["digit"]
            await handle_dtmf(digit, ctx)
        elif event["event"] == "media":
            await forward_audio(event["media"]["payload"])
```

### In-Flight Capture and Speech Suppression

When a DTMF event arrives while the agent is speaking, CallSphere:

1. Pauses the OpenAI Realtime audio output
2. Captures the digit and any digits that follow within 1500ms
3. Resumes (or replans) the response based on the captured key sequence

```python
async def handle_dtmf(digit: str, ctx: CallContext):
    ctx.dtmf_buffer.append(digit)

    # Pause TTS if agent is speaking
    if ctx.agent_speaking:
        await ctx.realtime_session.cancel_response()

    # Reset debounce window
    ctx.dtmf_window_task and ctx.dtmf_window_task.cancel()
    ctx.dtmf_window_task = asyncio.create_task(
        finalize_dtmf_after_silence(ctx, silence_ms=1500)
    )
```

### Echo Debounce

Some carriers echo DTMF tones back as audio, which speech-to-text occasionally transcribes as words like "two" or "five." CallSphere maintains a 200ms suppression window after each DTMF event during which speech transcripts are filtered for digit-words paired with the just-pressed digit.

```python
def is_echo_transcript(transcript: str, recent_dtmf: list[tuple[str, float]]) -> bool:
    word_to_digit = {"one": "1", "two": "2", ...}
    now = time.monotonic()
    for digit, ts in recent_dtmf:
        if now - ts > 0.2:
            continue
        for word, d in word_to_digit.items():
            if word in transcript.lower() and d == digit:
                return True
    return False
```

### IVR-Style Menus

For verticals that need traditional IVR fallback (Healthcare, After-Hours), CallSphere supports a config-driven menu mode:

```yaml
ivr_menu:
  prompt: "For appointments, press 1. For billing, press 2. To speak with someone, press 0."
  options:
    "1": handoff:scheduling_specialist
    "2": handoff:billing_specialist
    "0": handoff:human
  timeout_ms: 8000
  on_timeout: handoff:human
  on_invalid: replay_prompt
```

The agent can drop into menu mode mid-call ("I'll switch to a touch-tone menu") and exit back to conversational mode after the routing decision.

### PCI Mode for Card Capture

For credit card capture, CallSphere flips into **PCI mode**: speech transcript is dropped (no logging, no LLM forwarding), only DTMF events are captured into a tokenization service (Stripe / Square), and the agent only knows the card was captured successfully or not.

```python
async def collect_card_pci_mode(ctx: CallContext):
    ctx.pci_mode = True  # disables transcript logging
    ctx.realtime_session.disable_speech_input()

    digits = await collect_dtmf(ctx, count=16, timeout_ms=30000)
    cvv = await collect_dtmf(ctx, count=3, timeout_ms=10000)
    expiry = await collect_dtmf(ctx, count=4, timeout_ms=10000)

    token = await stripe.tokenize(digits, cvv, expiry)
    ctx.pci_mode = False
    ctx.realtime_session.enable_speech_input()
    return token
```

## Vapi vs CallSphere DTMF Comparison

| Dimension | Vapi | CallSphere |
| --- | --- | --- |
| In-flight capture (during agent speech) | Limited | Yes (TTS pauses) |
| Echo debounce | None | 200ms suppression |
| IVR menu mode | DIY | Config-driven |
| PCI mode (card capture) | DIY | Built-in |
| Multi-digit sequences | Webhook per digit | Buffered with debounce |
| Carrier compatibility | Vendor-side | Twilio native, all carriers |
| Custom action per digit | Webhook | Inline handler or handoff |

## DTMF Interrupt Flow

```mermaid
sequenceDiagram
    participant User
    participant Twilio
    participant Agent
    participant Realtime as OpenAI Realtime
    participant Tokenize as Stripe Tokenize

    Agent->>Realtime: Generate "Please enter your card"
    Realtime-->>Twilio: PCM16 audio
    Twilio-->>User: "Please enter your card..."
    User->>Twilio: Press 4
    Twilio->>Agent: dtmf event "4"
    Agent->>Realtime: cancel_response()
    Agent->>Agent: pci_mode=true, disable speech
    User->>Twilio: Press 1, 2, 3, ... (16 digits)
    Twilio->>Agent: dtmf events
    Agent->>Tokenize: tokenize(digits, cvv, expiry)
    Tokenize-->>Agent: token_xyz
    Agent->>Agent: pci_mode=false
    Agent->>Realtime: "Card captured. Confirm?"
    Realtime-->>Twilio: PCM16
    Twilio-->>User: "Card captured. Confirm?"
```

## Practical Tips

- **Always suppress speech recognition during DTMF capture.** Otherwise, "press 1" gets transcribed as "press one" and you double-capture.
- **Buffer multi-digit input.** Most users type a 4-digit code in 800-1500ms — never act on the first digit alone unless the menu is single-digit.
- **Provide both modes.** Voice-first users hate DTMF; accessibility users hate voice-only. Config the same flow for both.
- **Log DTMF events outside PCI mode for debugging.** Inside PCI mode, log only the count of captured digits, never the digits.
- **Test with at least three carriers.** Verizon, AT&T, and T-Mobile have observably different DTMF tone signatures.

## FAQ

### Does CallSphere support pulse dialing?

No — pulse dialing is rare in 2026 and Twilio does not deliver pulse events. Pulse callers must dial differently or use voice.

### What happens if the carrier strips DTMF tones in audio?

Twilio's signaling-channel DTMF events bypass audio, so this is not an issue. Inband DTMF (rare) is detected separately.

### Can DTMF interrupt the agent mid-tool-call?

Yes — the dtmf event handler can cancel an in-flight tool call if the user presses the universal cancel key (configurable, default 0).

### Does PCI mode require additional certification?

Compliance posture is your call; CallSphere's PCI mode is designed to support a SAQ A scope by never persisting PAN data. Confirm with your QSA.

### How does this affect voicemail-detection accuracy?

DTMF after voicemail prompts ("press 1 to skip greeting") is captured the same way; CallSphere's voicemail detector uses a separate signal cascade (covered in another post).

## Build a Reliable IVR + Voice Hybrid

The [/features](/features) page lists DTMF-supported verticals, and [/demo](/demo) includes a credit-capture flow that shows PCI mode live.

---

Source: https://callsphere.ai/blog/dtmf-handling-voice-agents-callsphere-vs-vapi