By Sagar Shankaran, Founder of CallSphere
Native iOS voice agent in Swift using AVFoundation and WebRTC. Real working SwiftUI code for ephemeral key flow, RTCPeerConnection setup, and a live waveform.
Key takeaways
TL;DR — Apple ships WebRTC inside WebKit, but for native voice you want the standalone
WebRTC.framework. Pair it with an ephemeral OpenAI Realtime token, and a 200-line SwiftUI app gets you sub-700ms voice on iOS.
A SwiftUI iOS app with one tap-to-talk button that opens a WebRTC RTCPeerConnection to OpenAI Realtime. We set the system instructions over the data channel, render a live audio waveform, and handle background audio session interruptions.
WebRTC.framework via SPM: https://github.com/stasel/WebRTC.OPENAI_API_KEY on your backend (never in the app).NSMicrophoneUsageDescription in Info.plist.async/await and AVAudioSession.sequenceDiagram
participant I as iOS app
participant K as Your /session endpoint
participant O as OpenAI Realtime
I->>K: GET /session (mint ephemeral)
K-->>I: client_secret
I->>I: RTCPeerConnection.offer
I->>O: POST /v1/realtime (SDP, Bearer eph)
O-->>I: SDP answer
I<->O: Opus + DataChannel events
```swift import AVFoundation
func activateAudio() throws { let session = AVAudioSession.sharedInstance() try session.setCategory(.playAndRecord, mode: .voiceChat, options: [.defaultToSpeaker, .allowBluetooth, .duckOthers]) try session.setActive(true) } ```
```swift import WebRTC
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
final class RealtimeClient: NSObject { private let factory: RTCPeerConnectionFactory = { RTCInitializeSSL() return RTCPeerConnectionFactory( encoderFactory: RTCDefaultVideoEncoderFactory(), decoderFactory: RTCDefaultVideoDecoderFactory()) }() var pc: RTCPeerConnection! var dc: RTCDataChannel!
func makeConnection() {
let cfg = RTCConfiguration()
cfg.iceServers = [RTCIceServer(urlStrings: ["stun:stun.l.google.com:19302"])]
cfg.sdpSemantics = .unifiedPlan
let constraints = RTCMediaConstraints(
mandatoryConstraints: nil, optionalConstraints: nil)
pc = factory.peerConnection(with: cfg, constraints: constraints, delegate: self)!
let audioSrc = factory.audioSource(with: nil)
let audioTrack = factory.audioTrack(with: audioSrc, trackId: "mic0")
pc.add(audioTrack, streamIds: ["s0"])
let dcCfg = RTCDataChannelConfiguration()
dc = pc.dataChannel(forLabel: "oai-events", configuration: dcCfg)
dc.delegate = self
}
} ```
```swift struct Ephemeral: Decodable { struct Secret: Decodable { let value: String } let client_secret: Secret }
func fetchKey() async throws -> String { let url = URL(string: "https://api.callsphere.ai/voice/session")! let (data, _) = try await URLSession.shared.data(from: url) return try JSONDecoder().decode(Ephemeral.self, from: data).client_secret.value } ```
```swift func connect() async throws { let key = try await fetchKey() let constraints = RTCMediaConstraints( mandatoryConstraints: ["OfferToReceiveAudio":"true"], optionalConstraints: nil) let offer: RTCSessionDescription = try await withCheckedThrowingContinuation { c in pc.offer(for: constraints) { sdp, err in if let sdp = sdp { c.resume(returning: sdp) } else { c.resume(throwing: err!) } } } try await withCheckedThrowingContinuation { c in pc.setLocalDescription(offer) { e in if let e = e { c.resume(throwing: e) } else { c.resume() } } }
var req = URLRequest(url: URL(
string:"https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03")!)
req.httpMethod = "POST"
req.setValue("Bearer \(key)", forHTTPHeaderField:"Authorization")
req.setValue("application/sdp", forHTTPHeaderField:"Content-Type")
req.httpBody = offer.sdp.data(using:.utf8)
let (ans, _) = try await URLSession.shared.data(for: req)
let answer = RTCSessionDescription(type: .answer,
sdp: String(data: ans, encoding:.utf8)!)
try await withCheckedThrowingContinuation { c in
pc.setRemoteDescription(answer) { e in
if let e = e { c.resume(throwing: e) } else { c.resume() }
}
}
} ```
```swift struct VoiceView: View { @StateObject var vm = VoiceVM() var body: some View { VStack(spacing: 24) { Text(vm.status).font(.headline) WaveformView(level: vm.audioLevel) .frame(width: 220, height: 220) Button(vm.connected ? "End" : "Talk") { Task { vm.connected ? vm.end() : await vm.start() } } .buttonStyle(.borderedProminent) } .padding() } } ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```swift extension RealtimeClient: RTCDataChannelDelegate { func dataChannelDidChangeState(_ ch: RTCDataChannel) { guard ch.readyState == .open else { return } let payload: [String: Any] = [ "type": "session.update", "session": [ "instructions":"You are CallSphere's iOS demo agent.", "voice":"alloy", "turn_detection":["type":"server_vad"] ] ] let data = try! JSONSerialization.data(withJSONObject: payload) ch.sendData(RTCDataBuffer(data: data, isBinary: false)) } } ```
.voiceChat is what gives you echo cancellation.AVAudioSession.interruptionNotification — phone call kills your mic until you reactivate.RTCInitializeSSL() — silent crash on first connect.CallSphere's iOS partner app uses this exact pattern, talking to the same FastAPI :8084 backend that powers our Healthcare HIPAA voice agent. 37 agents, 115+ DB tables, SOC 2 + HIPAA. Try it for 14 days — see /pricing.
Can I skip WebRTC and use WebSocket? Yes, but jitter + AEC are way harder.
Why ephemeral keys? App-store binaries can be unpacked; long-lived keys leak.
Does CallKit play nice? Yes — set the audio session in your CXProvider delegate.
Background audio? Add the audio background mode capability in Info.plist.
Catalyst / iPad? Same code path — WebRTC.framework is universal.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
BrowserStack offers 30,000+ real devices; Sauce Labs ships deep Appium automation. Here is how AI voice agent teams use both for WebRTC mobile QA in 2026.
WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.
Evaluate build vs buy for enterprise calling platforms. Architecture patterns, SIP infrastructure, WebRTC, cost models, and timeline estimates for custom telephony systems.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI