---
title: "Build a Swift iOS Voice Agent with SwiftUI and WebRTC"
description: "Native iOS voice agent in Swift using AVFoundation and WebRTC. Real working SwiftUI code for ephemeral key flow, RTCPeerConnection setup, and a live waveform."
canonical: https://callsphere.ai/blog/vw2h-build-swift-ios-voice-agent-swiftui-webrtc
category: "AI Voice Agents"
tags: ["Tutorial", "Build", "Swift", "iOS", "SwiftUI", "WebRTC"]
author: "CallSphere Team"
published: 2026-03-29T00:00:00.000Z
updated: 2026-05-07T09:27:40.098Z
---

# Build a Swift iOS Voice Agent with SwiftUI and WebRTC

> Native iOS voice agent in Swift using AVFoundation and WebRTC. Real working SwiftUI code for ephemeral key flow, RTCPeerConnection setup, and a live waveform.

> **TL;DR** — Apple ships WebRTC inside WebKit, but for native voice you want the standalone `WebRTC.framework`. Pair it with an ephemeral OpenAI Realtime token, and a 200-line SwiftUI app gets you sub-700ms voice on iOS.

## What you'll build

A SwiftUI iOS app with one tap-to-talk button that opens a WebRTC `RTCPeerConnection` to OpenAI Realtime. We set the system instructions over the data channel, render a live audio waveform, and handle background audio session interruptions.

## Prerequisites

1. Xcode 16+, iOS 17 deployment target.
2. `WebRTC.framework` via SPM: `https://github.com/stasel/WebRTC`.
3. `OPENAI_API_KEY` on your backend (never in the app).
4. `NSMicrophoneUsageDescription` in `Info.plist`.
5. Familiarity with `async/await` and `AVAudioSession`.

## Architecture

```mermaid
sequenceDiagram
  participant I as iOS app
  participant K as Your /session endpoint
  participant O as OpenAI Realtime
  I->>K: GET /session (mint ephemeral)
  K-->>I: client_secret
  I->>I: RTCPeerConnection.offer
  I->>O: POST /v1/realtime (SDP, Bearer eph)
  O-->>I: SDP answer
  IO: Opus + DataChannel events
```

## Step 1 — Configure the audio session

```swift
import AVFoundation

func activateAudio() throws {
    let session = AVAudioSession.sharedInstance()
    try session.setCategory(.playAndRecord,
        mode: .voiceChat,
        options: [.defaultToSpeaker, .allowBluetooth, .duckOthers])
    try session.setActive(true)
}
```

## Step 2 — Build the peer connection

```swift
import WebRTC

final class RealtimeClient: NSObject {
    private let factory: RTCPeerConnectionFactory = {
        RTCInitializeSSL()
        return RTCPeerConnectionFactory(
            encoderFactory: RTCDefaultVideoEncoderFactory(),
            decoderFactory: RTCDefaultVideoDecoderFactory())
    }()
    var pc: RTCPeerConnection!
    var dc: RTCDataChannel!

```
func makeConnection() {
    let cfg = RTCConfiguration()
    cfg.iceServers = [RTCIceServer(urlStrings: ["stun:stun.l.google.com:19302"])]
    cfg.sdpSemantics = .unifiedPlan
    let constraints = RTCMediaConstraints(
        mandatoryConstraints: nil, optionalConstraints: nil)
    pc = factory.peerConnection(with: cfg, constraints: constraints, delegate: self)!

    let audioSrc = factory.audioSource(with: nil)
    let audioTrack = factory.audioTrack(with: audioSrc, trackId: "mic0")
    pc.add(audioTrack, streamIds: ["s0"])

    let dcCfg = RTCDataChannelConfiguration()
    dc = pc.dataChannel(forLabel: "oai-events", configuration: dcCfg)
    dc.delegate = self
}
```

}
```

## Step 3 — Mint ephemeral key on your server

```swift
struct Ephemeral: Decodable {
    struct Secret: Decodable { let value: String }
    let client_secret: Secret
}

func fetchKey() async throws -> String {
    let url = URL(string: "[https://api.callsphere.ai/voice/session](https://api.callsphere.ai/voice/session)")!
    let (data, _) = try await URLSession.shared.data(from: url)
    return try JSONDecoder().decode(Ephemeral.self, from: data).client_secret.value
}
```

## Step 4 — Trade SDP

```swift
func connect() async throws {
    let key = try await fetchKey()
    let constraints = RTCMediaConstraints(
        mandatoryConstraints: ["OfferToReceiveAudio":"true"], optionalConstraints: nil)
    let offer: RTCSessionDescription = try await withCheckedThrowingContinuation { c in
        pc.offer(for: constraints) { sdp, err in
            if let sdp = sdp { c.resume(returning: sdp) }
            else { c.resume(throwing: err!) }
        }
    }
    try await withCheckedThrowingContinuation { c in
        pc.setLocalDescription(offer) { e in
            if let e = e { c.resume(throwing: e) } else { c.resume() }
        }
    }

```
var req = URLRequest(url: URL(
    string:"https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03")!)
req.httpMethod = "POST"
req.setValue("Bearer \(key)", forHTTPHeaderField:"Authorization")
req.setValue("application/sdp", forHTTPHeaderField:"Content-Type")
req.httpBody = offer.sdp.data(using:.utf8)
let (ans, _) = try await URLSession.shared.data(for: req)
let answer = RTCSessionDescription(type: .answer,
    sdp: String(data: ans, encoding:.utf8)!)
try await withCheckedThrowingContinuation { c in
    pc.setRemoteDescription(answer) { e in
        if let e = e { c.resume(throwing: e) } else { c.resume() }
    }
}
```

}
```

## Step 5 — SwiftUI screen

```swift
struct VoiceView: View {
    @StateObject var vm = VoiceVM()
    var body: some View {
        VStack(spacing: 24) {
            Text(vm.status).font(.headline)
            WaveformView(level: vm.audioLevel)
                .frame(width: 220, height: 220)
            Button(vm.connected ? "End" : "Talk") {
                Task { vm.connected ? vm.end() : await vm.start() }
            }
            .buttonStyle(.borderedProminent)
        }
        .padding()
    }
}
```

## Step 6 — Push session.update on data-channel open

```swift
extension RealtimeClient: RTCDataChannelDelegate {
    func dataChannelDidChangeState(_ ch: RTCDataChannel) {
        guard ch.readyState == .open else { return }
        let payload: [String: Any] = [
            "type": "session.update",
            "session": [
                "instructions":"You are CallSphere's iOS demo agent.",
                "voice":"alloy",
                "turn_detection":["type":"server_vad"]
            ]
        ]
        let data = try! JSONSerialization.data(withJSONObject: payload)
        ch.sendData(RTCDataBuffer(data: data, isBinary: false))
    }
}
```

## Common pitfalls

- **Wrong AVAudioSession mode** — `.voiceChat` is what gives you echo cancellation.
- **Not handling `AVAudioSession.interruptionNotification`** — phone call kills your mic until you reactivate.
- **Shipping the API key** — always mint ephemeral on your server.
- **Forgetting to call `RTCInitializeSSL()`** — silent crash on first connect.

## How CallSphere does this in production

CallSphere's iOS partner app uses this exact pattern, talking to the same FastAPI :8084 backend that powers our [Healthcare](/lp/healthcare) HIPAA voice agent. 37 agents, 115+ DB tables, SOC 2 + HIPAA. Try it for 14 days — see [/pricing](/pricing).

## FAQ

**Can I skip WebRTC and use WebSocket?** Yes, but jitter + AEC are way harder.

**Why ephemeral keys?** App-store binaries can be unpacked; long-lived keys leak.

**Does CallKit play nice?** Yes — set the audio session in your CXProvider delegate.

**Background audio?** Add the `audio` background mode capability in Info.plist.

**Catalyst / iPad?** Same code path — WebRTC.framework is universal.

## Sources

- [OpenAI Realtime WebRTC](https://developers.openai.com/api/docs/guides/realtime-webrtc)
- [PallavAg/VoiceModeWebRTCSwift](https://github.com/PallavAg/VoiceModeWebRTCSwift)
- [m1guelpf/swift-realtime-openai](https://github.com/m1guelpf/swift-realtime-openai)
- [stasel/WebRTC SPM package](https://github.com/stasel/WebRTC)

---

Source: https://callsphere.ai/blog/vw2h-build-swift-ios-voice-agent-swiftui-webrtc
