Skip to content
AI Engineering
AI Engineering10 min read0 views

SIP INFO for DTMF in AI Agent Flows in 2026: When Out-of-Band Beats RTP Events

RFC 4733 RTP events handle most DTMF, but SIP INFO is the workaround when carriers strip telephone-events or when your AI agent needs out-of-band signaling. Here is when to use which in 2026.

The user pressed 4 to confirm. Your AI agent never heard it. Welcome to the DTMF transport problem - a 30-year-old wart that still bites AI voice deployments in 2026.

Background

flowchart LR
  UA[SIP UA] -- REGISTER --> Reg[Registrar]
  UA -- INVITE --> Proxy[SIP Proxy]
  Proxy --> Dispatcher[Kamailio dispatcher]
  Dispatcher --> Worker1[FreeSWITCH worker]
  Dispatcher --> Worker2[FreeSWITCH worker]
  Worker1 --> AI[(AI agent)]
  Worker2 --> AI
CallSphere reference architecture

DTMF (touch-tones) over IP has three transport methods. RFC 4733 (which obsoletes the older RFC 2833) defines telephone-event payloads carried inside the RTP stream as a special payload type. SIP INFO, defined by RFC 2976 and refined for keypad use by RFC 6086, carries the digit as a SIP signaling message outside the media path. In-band DTMF actually plays the audible tone in the audio.

For AI voice agents, the picture is messy. Most US carriers prefer RFC 4733 telephone-events on egress because they are precise and tone-faithful. But carrier-level transcoding can strip the events on transit (PCMU + tel-event mismatch), wholesale resellers sometimes drop the negotiated payload type, and AI bridges that decode RTP straight to PCM may not detect tel-events at all. SIP INFO is the fallback when RTP events do not arrive.

Technical deep-dive

A SIP INFO DTMF message looks like:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
INFO sip:[email protected] SIP/2.0
Via: SIP/2.0/TLS sbc.twilio.com;branch=z9hG4bK-info-1
Content-Type: application/dtmf-relay
Content-Length: 24

Signal=4
Duration=200

Signal is the digit (0-9, *, #, A-D), Duration is the tone duration in milliseconds. RFC 6086 also defines application/dtmf as a simpler one-line body but application/dtmf-relay is the de-facto standard, originating from Cisco and adopted broadly.

For AI agents the typical event flow is:

  1. Caller presses 4 on their phone
  2. Their device generates an in-band tone or RFC 4733 telephone-event
  3. The carrier hops transcode somewhere along the way
  4. By the time the call hits your AI bridge, the digit may have arrived as: in-band audio (audible tone), RFC 4733 events, or SIP INFO - or all three

A robust AI bridge listens for all three. The OpenAI Realtime API can detect DTMF tones in the audio stream as a server-side feature, but the timing is less precise than RFC 4733 events; for menu-driven flows, a parser on SIP INFO is faster and more reliable.

# FastAPI handler that merges DTMF sources
@app.post("/twilio/webhook/dtmf")
async def handle_dtmf(call_sid: str, digits: str):
    """Twilio sends DTMF as a webhook (its preferred method)."""
    await dtmf_queue.put({"sid": call_sid, "digit": digits, "src": "webhook"})

@app.websocket("/realtime/{call_sid}")
async def media_stream(ws: WebSocket, call_sid: str):
    async for msg in ws.iter_text():
        evt = json.loads(msg)
        if evt.get("event") == "dtmf":
            await dtmf_queue.put({"sid": call_sid, "digit": evt["dtmf"]["digit"], "src": "media"})

The Twilio webhook path is roughly equivalent to SIP INFO out-of-band; the WebSocket media-stream DTMF event is roughly equivalent to RFC 4733. We dedupe in dtmf_queue since both can fire.

CallSphere implementation

CallSphere uses Twilio Programmable Voice across all six verticals. Twilio handles DTMF detection across in-band tone and RFC 4733 events automatically and forwards us either a webhook (TwiML <Gather>) or a Media Streams DTMF event. For Healthcare AI on FastAPI :8084 we accept both paths into a unified queue and feed digits into the OpenAI Realtime conversation as user-content events. Sales Calling AI uses DTMF for opt-out (press 9 to stop) on outbound legs - 5 concurrent per tenant - and we log every digit for TCPA records. After-Hours AI listens for confirmation digits during simul call+SMS to on-call staff (120-second timeout) so the on-call can press 1 to accept the page. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 alignment, $149/$499/$1499 pricing, and 14-day trial, the DTMF parsing layer is shared infrastructure across products.

Implementation steps

  1. Negotiate RFC 4733 telephone-events in your SDP answer (payload type 101 is conventional).
  2. Have your AI bridge subscribe to whichever DTMF event your provider exposes (Twilio webhook + Media Streams).
  3. Add a SIP INFO listener if your provider can pass it through; useful for upstream legs that strip RTP events.
  4. Dedupe digits across sources within a 200 ms window; phones often produce both an in-band tone and an event.
  5. Feed the digit into the AI as a synthetic user message ("[user pressed 4]") so the LLM can react.
  6. Log every digit to your CDR with source and timestamp for TCPA opt-out evidence.
  7. Test on a real cell phone, a real landline, and a softphone; behavior varies wildly.
  8. Set a debounce to avoid double-firing on long key presses.

FAQ

Should I support all three DTMF transports or just one? For inbound AI in 2026, accept all three. The cost is one extra parser; the cost of missing a digit is a frustrated user.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Why is RFC 4733 not enough? Some carrier interconnects strip non-default payload types during transcoding; some softphones default to in-band only; some PBXs default to SIP INFO.

Does OpenAI Realtime detect DTMF natively? The model can hear in-band tones and react to them, but timing is less precise than parsed events. For menu logic always parse the event channel.

Is SIP INFO going away? No. RFC 6086 reaffirms it, and Cisco/Avaya/Microsoft Teams continue to use application/dtmf-relay widely.

What about pulse dialing or rotary? Anachronism. Modern PSTN converts pulse to DTMF at the CO; you will never see pulse on IP signaling.

Sources

Start a 14-day trial and test DTMF flows live, see pricing, or contact us about menu-driven AI voice flows.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.

Technical Guides

DTMF Handling for Voice Agents: CallSphere vs Vapi Reliability

DTMF tone capture during agent speech, IVR-style menus, key suppression. How CallSphere handles DTMF via Twilio + custom logic vs Vapi defaults.

AI Engineering

SIP Debugging with sngrep and Wireshark for AI Voice Calls in 2026: The Hands-On Playbook

When your AI voice agent gets one-way audio, missed DTMF, or codec mismatch, sngrep and Wireshark are still the fastest path to root cause in 2026. Here is the playbook.

AI Strategy

State Data Residency for AI Voice in Healthcare — Texas, Nevada, Colorado in 2026

Texas SB 1188 requires US-resident EHRs from January 1, 2026; Nevada's consumer-health-data law constrains health data; Colorado AI Act takes effect June 30, 2026. AI voice agents must architect for state-by-state data localization.

AI Infrastructure

RTP Transcoding Cost for AI Voice in 2026: Why Edge Placement Beats Central GPU

Transcoding RTP to WebSocket is more CPU-intensive than people expect. For AI voice in 2026, where you place the transcode (edge near the carrier vs central near the model) decides your cost-per-minute.

AI Infrastructure

Kamailio Dispatcher for AI Voice Scaling in 2026: Round-Robin Is Not Enough

Kamailio 6.0's dispatcher module is how you horizontally scale AI voice bridges behind a SIP front-end. Round-robin is the easy answer; call-load and weight-based dispatching is the right one.