Build a Voice Agent in Go with Pion WebRTC and OpenAI Realtime
Wire Pion WebRTC in pure Go to OpenAI Realtime over a single PeerConnection. Real working code for SDP exchange, Opus tracks, data channel events, and barge-in.
TL;DR — OpenAI's own infra runs on Pion, so you can hit the WebRTC endpoint directly from Go with zero CGo. One
PeerConnection, one Opus track, one data channel — that's the entire agent.
What you'll build
A standalone Go binary that opens a WebRTC PeerConnection to https://api.openai.com/v1/realtime, attaches a microphone Opus track, and prints model events from the data channel. Total latency on a US east-coast box lands around 480ms — Go + Pion is the lowest-overhead client you can ship.
Prerequisites
- Go 1.23+ and a working
pion/webrtc/v4install. - OpenAI API key with Realtime access.
- An ephemeral key endpoint (do not ship the raw key in your binary).
go get github.com/pion/webrtc/v4andgithub.com/pion/mediadevicesfor mic capture.- Familiarity with SDP offer/answer and ICE.
Architecture
sequenceDiagram
participant G as Go binary
participant K as Your /session endpoint
participant O as OpenAI Realtime
G->>K: POST /session (mint ephemeral key)
K-->>G: { client_secret.value }
G->>G: pc.CreateOffer() -> SDP
G->>O: POST /v1/realtime?model=... (SDP, Bearer eph)
O-->>G: SDP answer
G->>O: ICE + DTLS handshake
G->>O: Opus mic track
O-->>G: Opus TTS track + DC events
Step 1 — Mint an ephemeral key
Never ship a long-lived OpenAI key inside a desktop binary. Stand up a tiny HTTPS endpoint that calls /v1/realtime/sessions with your real key and returns the 60-second client_secret.
```go type session struct { ClientSecret struct{ Value string `json:"value"` } `json:"client_secret"` }
func mintEphemeral(ctx context.Context) (string, error) { body := `{"model":"gpt-4o-realtime-preview-2025-06-03","voice":"alloy"}` req, _ := http.NewRequestWithContext(ctx, "POST", "https://api.openai.com/v1/realtime/sessions", strings.NewReader(body)) req.Header.Set("Authorization", "Bearer "+os.Getenv("OPENAI_API_KEY")) req.Header.Set("Content-Type", "application/json") resp, err := http.DefaultClient.Do(req) if err != nil { return "", err } defer resp.Body.Close() var s session if err := json.NewDecoder(resp.Body).Decode(&s); err != nil { return "", err } return s.ClientSecret.Value, nil } ```
Step 2 — Build the PeerConnection
```go import "github.com/pion/webrtc/v4"
config := webrtc.Configuration{ ICEServers: []webrtc.ICEServer{{URLs: []string{"stun:stun.l.google.com:19302"}}}, } pc, err := webrtc.NewPeerConnection(config) if err != nil { log.Fatal(err) }
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
// Recv-only audio transceiver — OpenAI sends back TTS on this. _, _ = pc.AddTransceiverFromKind(webrtc.RTPCodecTypeAudio, webrtc.RTPTransceiverInit{Direction: webrtc.RTPTransceiverDirectionSendrecv})
dc, err := pc.CreateDataChannel("oai-events", nil) if err != nil { log.Fatal(err) } dc.OnMessage(func(m webrtc.DataChannelMessage) { fmt.Println("event:", string(m.Data)) }) ```
Step 3 — Capture mic and add the track
Use mediadevices to wrap a host mic into an Opus-encoded track. Pion will negotiate the codec automatically:
```go import ( "github.com/pion/mediadevices" "github.com/pion/mediadevices/pkg/codec/opus" _ "github.com/pion/mediadevices/pkg/driver/microphone" )
opusParams, _ := opus.NewParams() codecSelector := mediadevices.NewCodecSelector( mediadevices.WithAudioEncoders(&opusParams)) ms, err := mediadevices.GetUserMedia(mediadevices.MediaStreamConstraints{ Audio: func(c *mediadevices.MediaTrackConstraints) {}, Codec: codecSelector, }) if err != nil { log.Fatal(err) } for _, t := range ms.GetAudioTracks() { pc.AddTransceiverFromTrack(t.(webrtc.TrackLocal), webrtc.RTPTransceiverInit{Direction: webrtc.RTPTransceiverDirectionSendonly}) } ```
Step 4 — Trade SDP with OpenAI
This is the only OpenAI-specific bit: POST your SDP offer as application/sdp and you get the answer back as plain text.
```go offer, _ := pc.CreateOffer(nil) _ = pc.SetLocalDescription(offer) <-webrtc.GatheringCompletePromise(pc)
eph, _ := mintEphemeral(ctx) url := "https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03" req, _ := http.NewRequest("POST", url, strings.NewReader(pc.LocalDescription().SDP)) req.Header.Set("Authorization", "Bearer "+eph) req.Header.Set("Content-Type", "application/sdp") resp, _ := http.DefaultClient.Do(req) ans, _ := io.ReadAll(resp.Body) _ = pc.SetRemoteDescription(webrtc.SessionDescription{ Type: webrtc.SDPTypeAnswer, SDP: string(ans), }) ```
Step 5 — Send a session.update over the data channel
Once the channel is open, push your system prompt and turn-detection config:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```go dc.OnOpen(func() { payload, _ := json.Marshal(map[string]any{ "type": "session.update", "session": map[string]any{ "instructions": "You are CallSphere, a friendly receptionist.", "voice": "alloy", "turn_detection": map[string]any{"type": "server_vad", "threshold": 0.5}, "input_audio_transcription": map[string]any{"model": "whisper-1"}, }, }) _ = dc.SendText(string(payload)) }) ```
Step 6 — Play remote audio
Subscribe to the inbound track and pipe it to your speaker. mediadevices exposes a Player driver that handles the OS-level glue:
```go pc.OnTrack(func(t *webrtc.TrackRemote, _ *webrtc.RTPReceiver) { log.Printf("got remote %s track", t.Kind()) buf := make([]byte, 1500) for { n, _, err := t.Read(buf) if err != nil { return } // forward Opus packet to your audio sink sink.Write(buf[:n]) } }) ```
Common pitfalls
- Forgetting GatheringCompletePromise. OpenAI rejects half-baked SDP. Wait for ICE gathering before POSTing.
- Long-lived API key in the binary. Always mint ephemeral keys server-side.
- Wrong codec. Force Opus on both sides; Pion will fall back to PCMU otherwise.
- No
OnICEConnectionStateChangehandler. You'll fly blind on transient drops.
How CallSphere does this in production
CallSphere's real-estate agent OneRoof runs a Pion-based Go gateway at the edge. Each call gets its own PeerConnection, NATS hands the audio frames off to a transcription worker, and Postgres stores the run. Across 6 verticals and 37 agents we see 480–620ms p50 voice latency. Try it on the 14-day trial or book a live demo.
FAQ
Why Pion over Janus or LiveKit for this? Single binary, no media server, no Docker — perfect for a per-call sidecar.
Does it work behind NAT? Yes, with Google STUN. Add a TURN server for symmetric NAT users.
Can I run this on Fly.io? Yes, but pin to one region per call — WebRTC sessions are stateful.
What about Whisper transcription? Add input_audio_transcription in session.update; deltas arrive on the data channel.
How do I scale? One pod per N concurrent calls; OpenAI's quota dominates, not Pion.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.