Skip to content
AI Voice Agents
AI Voice Agents13 min read0 views

Build a Voice Agent in Go with Pion WebRTC and OpenAI Realtime

Wire Pion WebRTC in pure Go to OpenAI Realtime over a single PeerConnection. Real working code for SDP exchange, Opus tracks, data channel events, and barge-in.

TL;DR — OpenAI's own infra runs on Pion, so you can hit the WebRTC endpoint directly from Go with zero CGo. One PeerConnection, one Opus track, one data channel — that's the entire agent.

What you'll build

A standalone Go binary that opens a WebRTC PeerConnection to https://api.openai.com/v1/realtime, attaches a microphone Opus track, and prints model events from the data channel. Total latency on a US east-coast box lands around 480ms — Go + Pion is the lowest-overhead client you can ship.

Prerequisites

  1. Go 1.23+ and a working pion/webrtc/v4 install.
  2. OpenAI API key with Realtime access.
  3. An ephemeral key endpoint (do not ship the raw key in your binary).
  4. go get github.com/pion/webrtc/v4 and github.com/pion/mediadevices for mic capture.
  5. Familiarity with SDP offer/answer and ICE.

Architecture

sequenceDiagram
  participant G as Go binary
  participant K as Your /session endpoint
  participant O as OpenAI Realtime
  G->>K: POST /session (mint ephemeral key)
  K-->>G: { client_secret.value }
  G->>G: pc.CreateOffer() -> SDP
  G->>O: POST /v1/realtime?model=... (SDP, Bearer eph)
  O-->>G: SDP answer
  G->>O: ICE + DTLS handshake
  G->>O: Opus mic track
  O-->>G: Opus TTS track + DC events

Step 1 — Mint an ephemeral key

Never ship a long-lived OpenAI key inside a desktop binary. Stand up a tiny HTTPS endpoint that calls /v1/realtime/sessions with your real key and returns the 60-second client_secret.

```go type session struct { ClientSecret struct{ Value string `json:"value"` } `json:"client_secret"` }

func mintEphemeral(ctx context.Context) (string, error) { body := `{"model":"gpt-4o-realtime-preview-2025-06-03","voice":"alloy"}` req, _ := http.NewRequestWithContext(ctx, "POST", "https://api.openai.com/v1/realtime/sessions", strings.NewReader(body)) req.Header.Set("Authorization", "Bearer "+os.Getenv("OPENAI_API_KEY")) req.Header.Set("Content-Type", "application/json") resp, err := http.DefaultClient.Do(req) if err != nil { return "", err } defer resp.Body.Close() var s session if err := json.NewDecoder(resp.Body).Decode(&s); err != nil { return "", err } return s.ClientSecret.Value, nil } ```

Step 2 — Build the PeerConnection

```go import "github.com/pion/webrtc/v4"

config := webrtc.Configuration{ ICEServers: []webrtc.ICEServer{{URLs: []string{"stun:stun.l.google.com:19302"}}}, } pc, err := webrtc.NewPeerConnection(config) if err != nil { log.Fatal(err) }

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

// Recv-only audio transceiver — OpenAI sends back TTS on this. _, _ = pc.AddTransceiverFromKind(webrtc.RTPCodecTypeAudio, webrtc.RTPTransceiverInit{Direction: webrtc.RTPTransceiverDirectionSendrecv})

dc, err := pc.CreateDataChannel("oai-events", nil) if err != nil { log.Fatal(err) } dc.OnMessage(func(m webrtc.DataChannelMessage) { fmt.Println("event:", string(m.Data)) }) ```

Step 3 — Capture mic and add the track

Use mediadevices to wrap a host mic into an Opus-encoded track. Pion will negotiate the codec automatically:

```go import ( "github.com/pion/mediadevices" "github.com/pion/mediadevices/pkg/codec/opus" _ "github.com/pion/mediadevices/pkg/driver/microphone" )

opusParams, _ := opus.NewParams() codecSelector := mediadevices.NewCodecSelector( mediadevices.WithAudioEncoders(&opusParams)) ms, err := mediadevices.GetUserMedia(mediadevices.MediaStreamConstraints{ Audio: func(c *mediadevices.MediaTrackConstraints) {}, Codec: codecSelector, }) if err != nil { log.Fatal(err) } for _, t := range ms.GetAudioTracks() { pc.AddTransceiverFromTrack(t.(webrtc.TrackLocal), webrtc.RTPTransceiverInit{Direction: webrtc.RTPTransceiverDirectionSendonly}) } ```

Step 4 — Trade SDP with OpenAI

This is the only OpenAI-specific bit: POST your SDP offer as application/sdp and you get the answer back as plain text.

```go offer, _ := pc.CreateOffer(nil) _ = pc.SetLocalDescription(offer) <-webrtc.GatheringCompletePromise(pc)

eph, _ := mintEphemeral(ctx) url := "https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03" req, _ := http.NewRequest("POST", url, strings.NewReader(pc.LocalDescription().SDP)) req.Header.Set("Authorization", "Bearer "+eph) req.Header.Set("Content-Type", "application/sdp") resp, _ := http.DefaultClient.Do(req) ans, _ := io.ReadAll(resp.Body) _ = pc.SetRemoteDescription(webrtc.SessionDescription{ Type: webrtc.SDPTypeAnswer, SDP: string(ans), }) ```

Step 5 — Send a session.update over the data channel

Once the channel is open, push your system prompt and turn-detection config:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

```go dc.OnOpen(func() { payload, _ := json.Marshal(map[string]any{ "type": "session.update", "session": map[string]any{ "instructions": "You are CallSphere, a friendly receptionist.", "voice": "alloy", "turn_detection": map[string]any{"type": "server_vad", "threshold": 0.5}, "input_audio_transcription": map[string]any{"model": "whisper-1"}, }, }) _ = dc.SendText(string(payload)) }) ```

Step 6 — Play remote audio

Subscribe to the inbound track and pipe it to your speaker. mediadevices exposes a Player driver that handles the OS-level glue:

```go pc.OnTrack(func(t *webrtc.TrackRemote, _ *webrtc.RTPReceiver) { log.Printf("got remote %s track", t.Kind()) buf := make([]byte, 1500) for { n, _, err := t.Read(buf) if err != nil { return } // forward Opus packet to your audio sink sink.Write(buf[:n]) } }) ```

Common pitfalls

  • Forgetting GatheringCompletePromise. OpenAI rejects half-baked SDP. Wait for ICE gathering before POSTing.
  • Long-lived API key in the binary. Always mint ephemeral keys server-side.
  • Wrong codec. Force Opus on both sides; Pion will fall back to PCMU otherwise.
  • No OnICEConnectionStateChange handler. You'll fly blind on transient drops.

How CallSphere does this in production

CallSphere's real-estate agent OneRoof runs a Pion-based Go gateway at the edge. Each call gets its own PeerConnection, NATS hands the audio frames off to a transcription worker, and Postgres stores the run. Across 6 verticals and 37 agents we see 480–620ms p50 voice latency. Try it on the 14-day trial or book a live demo.

FAQ

Why Pion over Janus or LiveKit for this? Single binary, no media server, no Docker — perfect for a per-call sidecar.

Does it work behind NAT? Yes, with Google STUN. Add a TURN server for symmetric NAT users.

Can I run this on Fly.io? Yes, but pin to one region per call — WebRTC sessions are stateful.

What about Whisper transcription? Add input_audio_transcription in session.update; deltas arrive on the data channel.

How do I scale? One pod per N concurrent calls; OpenAI's quota dominates, not Pion.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.