Skip to content
AI Infrastructure
AI Infrastructure12 min read0 views

Build a Serverless Voice Agent on Lambda + API Gateway WebSocket (2026)

Sub-second voice agent with zero idle cost: API Gateway WebSocket ($connect/$disconnect/$default), Lambda per-message handler, DynamoDB session store, and OpenAI Realtime over WebSocket.

TL;DR — API Gateway WebSocket decouples the persistent client connection from compute, so Lambdas only spin up per message. For a voice agent you need three Lambdas ($connect, $disconnect, $default), a DynamoDB table mapping connection IDs to OpenAI Realtime session IDs, and an SQS-backed worker that holds the OpenAI socket open. ~$0.04 per call-minute at low scale.

What you'll build

A serverless WebSocket endpoint that accepts PCM audio frames from a browser, forwards them to OpenAI Realtime via a long-lived worker (Fargate Spot or EC2 Spot), receives audio back, and pushes it down the API Gateway connection using the @connections POST API. The browser plays it through Web Audio. No always-on infrastructure for the WebSocket layer — only the OpenAI bridge worker.

Prerequisites

  1. AWS account with API Gateway + Lambda + DynamoDB + SQS.
  2. OpenAI API key with Realtime (gpt-realtime or gpt-realtime-mini).
  3. AWS SAM or CDK; Node 20 runtime for the Lambdas.
  4. A small always-on worker (Fargate task or t4g.nano) — Lambda can't hold an external WS open for more than 15 min cleanly.

Architecture

flowchart LR
  B[Browser WS Client] -->|wss://| AG[API Gateway WebSocket]
  AG -->|$connect / $default| LAM[Lambda Handlers]
  LAM <-->|connectionId map| DDB[(DynamoDB sessions)]
  LAM -->|SendMessage| SQS[(SQS audio_in)]
  SQS --> WK[Bridge Worker Fargate]
  WK <-->|wss://| OA[OpenAI Realtime]
  WK -->|@connections POST| AG
  AG -->|audio frames| B

Step 1 — Provision the API Gateway WebSocket

```yaml

template.yaml (AWS SAM)

Resources: WS: Type: AWS::ApiGatewayV2::Api Properties: Name: voice-agent ProtocolType: WEBSOCKET RouteSelectionExpression: "$request.body.action" ```

Define three integrations: $connect, $disconnect, $default — each pointing at a Lambda.

Step 2 — DynamoDB table for connection state

```yaml Sessions: Type: AWS::DynamoDB::Table Properties: BillingMode: PAY_PER_REQUEST AttributeDefinitions: - { AttributeName: connectionId, AttributeType: S } KeySchema: - { AttributeName: connectionId, KeyType: HASH } TimeToLiveSpecification: AttributeName: ttl Enabled: true ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

TTL kills stale entries automatically; voice sessions rarely exceed 1 hour.

Step 3 — $connect and $default Lambdas

```js // connect.mjs import { DynamoDBClient, PutItemCommand } from "@aws-sdk/client-dynamodb"; const ddb = new DynamoDBClient({}); export const handler = async (event) => { const cid = event.requestContext.connectionId; await ddb.send(new PutItemCommand({ TableName: process.env.SESSIONS, Item: { connectionId: { S: cid }, ttl: { N: String(Math.floor(Date.now()/1000)+3600) } } })); return { statusCode: 200 }; }; ```

```js // default.mjs — receive audio frames from browser, hand off to SQS import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs"; const sqs = new SQSClient({}); export const handler = async (event) => { const cid = event.requestContext.connectionId; await sqs.send(new SendMessageCommand({ QueueUrl: process.env.AUDIO_IN, MessageBody: JSON.stringify({ cid, frame: event.body }) // base64 PCM })); return { statusCode: 200 }; }; ```

Step 4 — The bridge worker (Fargate)

```js import WebSocket from "ws"; import { ApiGatewayManagementApiClient, PostToConnectionCommand } from "@aws-sdk/client-apigatewaymanagementapi"; const sessions = new Map(); // cid -> WS to OpenAI const apigw = new ApiGatewayManagementApiClient({ endpoint: process.env.WS_ENDPOINT });

async function getOrOpen(cid) { if (sessions.has(cid)) return sessions.get(cid); const ws = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-realtime", { headers: { Authorization: Bearer ${process.env.OPENAI_API_KEY}, "OpenAI-Beta": "realtime=v1" } }); ws.on("message", async (data) => { const ev = JSON.parse(data.toString()); if (ev.type === "response.audio.delta") { await apigw.send(new PostToConnectionCommand({ ConnectionId: cid, Data: Buffer.from(ev.delta, "base64") })); } }); sessions.set(cid, ws); return ws; } ```

The worker pulls from SQS, opens (or reuses) an OpenAI socket per connection, forwards frames in, and pushes audio out via @connections.

Step 5 — Browser side: 16-bit PCM over WebSocket

```js const ws = new WebSocket("wss://abc123.execute-api.us-east-1.amazonaws.com/prod"); const ctx = new AudioContext({ sampleRate: 24000 }); const src = ctx.createMediaStreamSource(await navigator.mediaDevices.getUserMedia({ audio: true })); const proc = ctx.createScriptProcessor(4096, 1, 1); src.connect(proc); proc.connect(ctx.destination); proc.onaudioprocess = (e) => { const f32 = e.inputBuffer.getChannelData(0); const i16 = new Int16Array(f32.length); for (let i=0;i<f32.length;i++) i16[i] = Math.max(-1, Math.min(1, f32[i])) * 0x7fff; ws.send(i16.buffer); }; ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — $disconnect cleanup

```js export const handler = async (event) => { await ddb.send(new DeleteItemCommand({ TableName: process.env.SESSIONS, Key: { connectionId: { S: event.requestContext.connectionId } } })); return { statusCode: 200 }; }; ```

The bridge worker subscribes to a DynamoDB Stream (or polls) and closes its OpenAI socket when the row vanishes.

Pitfalls

  • API Gateway WebSocket has a 10-minute idle timeout — send a {type:"ping"} frame every 5 min from the worker to keep alive.
  • Frame size limit is 32 KB per WS message; chunk audio if needed.
  • Lambda cold start on $default adds 200-300ms — provisioned concurrency = 5 fixes it for ~$10/mo.
  • @connections POST is region-locked — your worker must use the same region as the API.
  • Cost trap: API Gateway charges $1/M messages. At 50 fps audio that's $1.80/hour per active call before Lambda. Pre-aggregate frames if scaling.

How CallSphere does this in production

CallSphere doesn't use API Gateway WebSocket for voice — we measured that at our scale (millions of minutes/month) it's 3-4x more expensive than running our own websocket fleet behind ALB. We run FastAPI :8084 on bare k3s nodes for Healthcare and Pion Go + NATS for OneRoof multi-family. 37 agents, 90+ tools, 115+ DB tables, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate. For early-stage builders without our scale, the serverless pattern in this post is the right answer.

FAQ

Q: Can I drop the worker and put everything in Lambda? Only if every call lasts <15 min and you're OK with cold starts on every reconnect. The worker pattern is what makes this production-grade.

Q: Do I have to use OpenAI? No — swap the worker target for AWS Bedrock Nova Sonic, Azure GPT Realtime, or Cloudflare Workers AI.

Q: How do I add Twilio? Replace the $default handler with a Twilio Media Stream WebSocket served from a non-API-Gateway endpoint (Twilio doesn't sign API Gateway URLs out of the box). Cleanest is a separate ALB+Fargate path for Twilio.

Q: What about HIPAA? API Gateway WebSocket is HIPAA-eligible since 2024. Sign a BAA, enable VPC endpoints, encrypt SQS with KMS CMK.

Q: Cost at 1k concurrent calls? Roughly: $1.50 API Gateway messages + $0.20 Lambda + $0.10 SQS + $0.50 Fargate + $20 OpenAI = $22.30/hour, or $0.022/min/call.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.