Build a Swift iOS Voice Agent with SwiftUI and WebRTC
Native iOS voice agent in Swift using AVFoundation and WebRTC. Real working SwiftUI code for ephemeral key flow, RTCPeerConnection setup, and a live waveform.
TL;DR — Apple ships WebRTC inside WebKit, but for native voice you want the standalone
WebRTC.framework. Pair it with an ephemeral OpenAI Realtime token, and a 200-line SwiftUI app gets you sub-700ms voice on iOS.
What you'll build
A SwiftUI iOS app with one tap-to-talk button that opens a WebRTC RTCPeerConnection to OpenAI Realtime. We set the system instructions over the data channel, render a live audio waveform, and handle background audio session interruptions.
Prerequisites
- Xcode 16+, iOS 17 deployment target.
WebRTC.frameworkvia SPM:https://github.com/stasel/WebRTC.OPENAI_API_KEYon your backend (never in the app).NSMicrophoneUsageDescriptioninInfo.plist.- Familiarity with
async/awaitandAVAudioSession.
Architecture
sequenceDiagram
participant I as iOS app
participant K as Your /session endpoint
participant O as OpenAI Realtime
I->>K: GET /session (mint ephemeral)
K-->>I: client_secret
I->>I: RTCPeerConnection.offer
I->>O: POST /v1/realtime (SDP, Bearer eph)
O-->>I: SDP answer
I<->O: Opus + DataChannel events
Step 1 — Configure the audio session
```swift import AVFoundation
func activateAudio() throws { let session = AVAudioSession.sharedInstance() try session.setCategory(.playAndRecord, mode: .voiceChat, options: [.defaultToSpeaker, .allowBluetooth, .duckOthers]) try session.setActive(true) } ```
Step 2 — Build the peer connection
```swift import WebRTC
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
final class RealtimeClient: NSObject { private let factory: RTCPeerConnectionFactory = { RTCInitializeSSL() return RTCPeerConnectionFactory( encoderFactory: RTCDefaultVideoEncoderFactory(), decoderFactory: RTCDefaultVideoDecoderFactory()) }() var pc: RTCPeerConnection! var dc: RTCDataChannel!
func makeConnection() {
let cfg = RTCConfiguration()
cfg.iceServers = [RTCIceServer(urlStrings: ["stun:stun.l.google.com:19302"])]
cfg.sdpSemantics = .unifiedPlan
let constraints = RTCMediaConstraints(
mandatoryConstraints: nil, optionalConstraints: nil)
pc = factory.peerConnection(with: cfg, constraints: constraints, delegate: self)!
let audioSrc = factory.audioSource(with: nil)
let audioTrack = factory.audioTrack(with: audioSrc, trackId: "mic0")
pc.add(audioTrack, streamIds: ["s0"])
let dcCfg = RTCDataChannelConfiguration()
dc = pc.dataChannel(forLabel: "oai-events", configuration: dcCfg)
dc.delegate = self
}
} ```
Step 3 — Mint ephemeral key on your server
```swift struct Ephemeral: Decodable { struct Secret: Decodable { let value: String } let client_secret: Secret }
func fetchKey() async throws -> String { let url = URL(string: "https://api.callsphere.ai/voice/session")! let (data, _) = try await URLSession.shared.data(from: url) return try JSONDecoder().decode(Ephemeral.self, from: data).client_secret.value } ```
Step 4 — Trade SDP
```swift func connect() async throws { let key = try await fetchKey() let constraints = RTCMediaConstraints( mandatoryConstraints: ["OfferToReceiveAudio":"true"], optionalConstraints: nil) let offer: RTCSessionDescription = try await withCheckedThrowingContinuation { c in pc.offer(for: constraints) { sdp, err in if let sdp = sdp { c.resume(returning: sdp) } else { c.resume(throwing: err!) } } } try await withCheckedThrowingContinuation { c in pc.setLocalDescription(offer) { e in if let e = e { c.resume(throwing: e) } else { c.resume() } } }
var req = URLRequest(url: URL(
string:"https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03")!)
req.httpMethod = "POST"
req.setValue("Bearer \(key)", forHTTPHeaderField:"Authorization")
req.setValue("application/sdp", forHTTPHeaderField:"Content-Type")
req.httpBody = offer.sdp.data(using:.utf8)
let (ans, _) = try await URLSession.shared.data(for: req)
let answer = RTCSessionDescription(type: .answer,
sdp: String(data: ans, encoding:.utf8)!)
try await withCheckedThrowingContinuation { c in
pc.setRemoteDescription(answer) { e in
if let e = e { c.resume(throwing: e) } else { c.resume() }
}
}
} ```
Step 5 — SwiftUI screen
```swift struct VoiceView: View { @StateObject var vm = VoiceVM() var body: some View { VStack(spacing: 24) { Text(vm.status).font(.headline) WaveformView(level: vm.audioLevel) .frame(width: 220, height: 220) Button(vm.connected ? "End" : "Talk") { Task { vm.connected ? vm.end() : await vm.start() } } .buttonStyle(.borderedProminent) } .padding() } } ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Push session.update on data-channel open
```swift extension RealtimeClient: RTCDataChannelDelegate { func dataChannelDidChangeState(_ ch: RTCDataChannel) { guard ch.readyState == .open else { return } let payload: [String: Any] = [ "type": "session.update", "session": [ "instructions":"You are CallSphere's iOS demo agent.", "voice":"alloy", "turn_detection":["type":"server_vad"] ] ] let data = try! JSONSerialization.data(withJSONObject: payload) ch.sendData(RTCDataBuffer(data: data, isBinary: false)) } } ```
Common pitfalls
- Wrong AVAudioSession mode —
.voiceChatis what gives you echo cancellation. - Not handling
AVAudioSession.interruptionNotification— phone call kills your mic until you reactivate. - Shipping the API key — always mint ephemeral on your server.
- Forgetting to call
RTCInitializeSSL()— silent crash on first connect.
How CallSphere does this in production
CallSphere's iOS partner app uses this exact pattern, talking to the same FastAPI :8084 backend that powers our Healthcare HIPAA voice agent. 37 agents, 115+ DB tables, SOC 2 + HIPAA. Try it for 14 days — see /pricing.
FAQ
Can I skip WebRTC and use WebSocket? Yes, but jitter + AEC are way harder.
Why ephemeral keys? App-store binaries can be unpacked; long-lived keys leak.
Does CallKit play nice? Yes — set the audio session in your CXProvider delegate.
Background audio? Add the audio background mode capability in Info.plist.
Catalyst / iPad? Same code path — WebRTC.framework is universal.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.