By Sagar Shankaran, Founder of CallSphere
RFC 4733 RTP events handle most DTMF, but SIP INFO is the workaround when carriers strip telephone-events or when your AI agent needs out-of-band signaling. Here is when to use which in 2026.
Key takeaways
The user pressed 4 to confirm. Your AI agent never heard it. Welcome to the DTMF transport problem - a 30-year-old wart that still bites AI voice deployments in 2026.
flowchart LR
UA[SIP UA] -- REGISTER --> Reg[Registrar]
UA -- INVITE --> Proxy[SIP Proxy]
Proxy --> Dispatcher[Kamailio dispatcher]
Dispatcher --> Worker1[FreeSWITCH worker]
Dispatcher --> Worker2[FreeSWITCH worker]
Worker1 --> AI[(AI agent)]
Worker2 --> AIDTMF (touch-tones) over IP has three transport methods. RFC 4733 (which obsoletes the older RFC 2833) defines telephone-event payloads carried inside the RTP stream as a special payload type. SIP INFO, defined by RFC 2976 and refined for keypad use by RFC 6086, carries the digit as a SIP signaling message outside the media path. In-band DTMF actually plays the audible tone in the audio.
For AI voice agents, the picture is messy. Most US carriers prefer RFC 4733 telephone-events on egress because they are precise and tone-faithful. But carrier-level transcoding can strip the events on transit (PCMU + tel-event mismatch), wholesale resellers sometimes drop the negotiated payload type, and AI bridges that decode RTP straight to PCM may not detect tel-events at all. SIP INFO is the fallback when RTP events do not arrive.
A SIP INFO DTMF message looks like:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
INFO sip:bridge@callsphere.ai SIP/2.0
Via: SIP/2.0/TLS sbc.twilio.com;branch=z9hG4bK-info-1
Content-Type: application/dtmf-relay
Content-Length: 24
Signal=4
Duration=200
Signal is the digit (0-9, *, #, A-D), Duration is the tone duration in milliseconds. RFC 6086 also defines application/dtmf as a simpler one-line body but application/dtmf-relay is the de-facto standard, originating from Cisco and adopted broadly.
For AI agents the typical event flow is:
A robust AI bridge listens for all three. The OpenAI Realtime API can detect DTMF tones in the audio stream as a server-side feature, but the timing is less precise than RFC 4733 events; for menu-driven flows, a parser on SIP INFO is faster and more reliable.
# FastAPI handler that merges DTMF sources
@app.post("/twilio/webhook/dtmf")
async def handle_dtmf(call_sid: str, digits: str):
"""Twilio sends DTMF as a webhook (its preferred method)."""
await dtmf_queue.put({"sid": call_sid, "digit": digits, "src": "webhook"})
@app.websocket("/realtime/{call_sid}")
async def media_stream(ws: WebSocket, call_sid: str):
async for msg in ws.iter_text():
evt = json.loads(msg)
if evt.get("event") == "dtmf":
await dtmf_queue.put({"sid": call_sid, "digit": evt["dtmf"]["digit"], "src": "media"})
The Twilio webhook path is roughly equivalent to SIP INFO out-of-band; the WebSocket media-stream DTMF event is roughly equivalent to RFC 4733. We dedupe in dtmf_queue since both can fire.
CallSphere uses Twilio Programmable Voice across all six verticals. Twilio handles DTMF detection across in-band tone and RFC 4733 events automatically and forwards us either a webhook (TwiML <Gather>) or a Media Streams DTMF event. For Healthcare AI on FastAPI :8084 we accept both paths into a unified queue and feed digits into the OpenAI Realtime conversation as user-content events. Sales Calling AI uses DTMF for opt-out (press 9 to stop) on outbound legs - 5 concurrent per tenant - and we log every digit for TCPA records. After-Hours AI listens for confirmation digits during simul call+SMS to on-call staff (120-second timeout) so the on-call can press 1 to accept the page. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 alignment, $149/$499/$1499 pricing, and 14-day trial, the DTMF parsing layer is shared infrastructure across products.
Should I support all three DTMF transports or just one? For inbound AI in 2026, accept all three. The cost is one extra parser; the cost of missing a digit is a frustrated user.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Why is RFC 4733 not enough? Some carrier interconnects strip non-default payload types during transcoding; some softphones default to in-band only; some PBXs default to SIP INFO.
Does OpenAI Realtime detect DTMF natively? The model can hear in-band tones and react to them, but timing is less precise than parsed events. For menu logic always parse the event channel.
Is SIP INFO going away?
No. RFC 6086 reaffirms it, and Cisco/Avaya/Microsoft Teams continue to use application/dtmf-relay widely.
What about pulse dialing or rotary? Anachronism. Modern PSTN converts pulse to DTMF at the CO; you will never see pulse on IP signaling.
Start a 14-day trial and test DTMF flows live, see pricing, or contact us about menu-driven AI voice flows.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.
DTMF tone capture during agent speech, IVR-style menus, key suppression. How CallSphere handles DTMF via Twilio + custom logic vs Vapi defaults.
Texas SB 1188 requires US-resident EHRs from January 1, 2026; Nevada's consumer-health-data law constrains health data; Colorado AI Act takes effect June 30, 2026. AI voice agents must architect for state-by-state data localization.
When your AI voice agent gets one-way audio, missed DTMF, or codec mismatch, sngrep and Wireshark are still the fastest path to root cause in 2026. Here is the playbook.
PCI DSS 4.0.1 future-dated requirements went mandatory March 31, 2025. AI voice agents that take card payments on behalf of healthcare providers — copays, deductibles, payment plans — must meet 12 requirements with DTMF masking and scope reduction.
© 2026 CallSphere LLC. All rights reserved.