Build a Serverless Voice Agent on Lambda + API Gateway WebSocket (2026)
Sub-second voice agent with zero idle cost: API Gateway WebSocket ($connect/$disconnect/$default), Lambda per-message handler, DynamoDB session store, and OpenAI Realtime over WebSocket.
TL;DR — API Gateway WebSocket decouples the persistent client connection from compute, so Lambdas only spin up per message. For a voice agent you need three Lambdas (
$connect,$disconnect,$default), a DynamoDB table mapping connection IDs to OpenAI Realtime session IDs, and an SQS-backed worker that holds the OpenAI socket open. ~$0.04 per call-minute at low scale.
What you'll build
A serverless WebSocket endpoint that accepts PCM audio frames from a browser, forwards them to OpenAI Realtime via a long-lived worker (Fargate Spot or EC2 Spot), receives audio back, and pushes it down the API Gateway connection using the @connections POST API. The browser plays it through Web Audio. No always-on infrastructure for the WebSocket layer — only the OpenAI bridge worker.
Prerequisites
- AWS account with API Gateway + Lambda + DynamoDB + SQS.
- OpenAI API key with Realtime (
gpt-realtimeorgpt-realtime-mini). - AWS SAM or CDK; Node 20 runtime for the Lambdas.
- A small always-on worker (Fargate task or t4g.nano) — Lambda can't hold an external WS open for more than 15 min cleanly.
Architecture
flowchart LR
B[Browser WS Client] -->|wss://| AG[API Gateway WebSocket]
AG -->|$connect / $default| LAM[Lambda Handlers]
LAM <-->|connectionId map| DDB[(DynamoDB sessions)]
LAM -->|SendMessage| SQS[(SQS audio_in)]
SQS --> WK[Bridge Worker Fargate]
WK <-->|wss://| OA[OpenAI Realtime]
WK -->|@connections POST| AG
AG -->|audio frames| B
Step 1 — Provision the API Gateway WebSocket
```yaml
template.yaml (AWS SAM)
Resources: WS: Type: AWS::ApiGatewayV2::Api Properties: Name: voice-agent ProtocolType: WEBSOCKET RouteSelectionExpression: "$request.body.action" ```
Define three integrations: $connect, $disconnect, $default — each pointing at a Lambda.
Step 2 — DynamoDB table for connection state
```yaml Sessions: Type: AWS::DynamoDB::Table Properties: BillingMode: PAY_PER_REQUEST AttributeDefinitions: - { AttributeName: connectionId, AttributeType: S } KeySchema: - { AttributeName: connectionId, KeyType: HASH } TimeToLiveSpecification: AttributeName: ttl Enabled: true ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
TTL kills stale entries automatically; voice sessions rarely exceed 1 hour.
Step 3 — $connect and $default Lambdas
```js // connect.mjs import { DynamoDBClient, PutItemCommand } from "@aws-sdk/client-dynamodb"; const ddb = new DynamoDBClient({}); export const handler = async (event) => { const cid = event.requestContext.connectionId; await ddb.send(new PutItemCommand({ TableName: process.env.SESSIONS, Item: { connectionId: { S: cid }, ttl: { N: String(Math.floor(Date.now()/1000)+3600) } } })); return { statusCode: 200 }; }; ```
```js // default.mjs — receive audio frames from browser, hand off to SQS import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs"; const sqs = new SQSClient({}); export const handler = async (event) => { const cid = event.requestContext.connectionId; await sqs.send(new SendMessageCommand({ QueueUrl: process.env.AUDIO_IN, MessageBody: JSON.stringify({ cid, frame: event.body }) // base64 PCM })); return { statusCode: 200 }; }; ```
Step 4 — The bridge worker (Fargate)
```js import WebSocket from "ws"; import { ApiGatewayManagementApiClient, PostToConnectionCommand } from "@aws-sdk/client-apigatewaymanagementapi"; const sessions = new Map(); // cid -> WS to OpenAI const apigw = new ApiGatewayManagementApiClient({ endpoint: process.env.WS_ENDPOINT });
async function getOrOpen(cid) {
if (sessions.has(cid)) return sessions.get(cid);
const ws = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-realtime", {
headers: { Authorization: Bearer ${process.env.OPENAI_API_KEY}, "OpenAI-Beta": "realtime=v1" }
});
ws.on("message", async (data) => {
const ev = JSON.parse(data.toString());
if (ev.type === "response.audio.delta") {
await apigw.send(new PostToConnectionCommand({ ConnectionId: cid, Data: Buffer.from(ev.delta, "base64") }));
}
});
sessions.set(cid, ws);
return ws;
}
```
The worker pulls from SQS, opens (or reuses) an OpenAI socket per connection, forwards frames in, and pushes audio out via @connections.
Step 5 — Browser side: 16-bit PCM over WebSocket
```js const ws = new WebSocket("wss://abc123.execute-api.us-east-1.amazonaws.com/prod"); const ctx = new AudioContext({ sampleRate: 24000 }); const src = ctx.createMediaStreamSource(await navigator.mediaDevices.getUserMedia({ audio: true })); const proc = ctx.createScriptProcessor(4096, 1, 1); src.connect(proc); proc.connect(ctx.destination); proc.onaudioprocess = (e) => { const f32 = e.inputBuffer.getChannelData(0); const i16 = new Int16Array(f32.length); for (let i=0;i<f32.length;i++) i16[i] = Math.max(-1, Math.min(1, f32[i])) * 0x7fff; ws.send(i16.buffer); }; ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — $disconnect cleanup
```js export const handler = async (event) => { await ddb.send(new DeleteItemCommand({ TableName: process.env.SESSIONS, Key: { connectionId: { S: event.requestContext.connectionId } } })); return { statusCode: 200 }; }; ```
The bridge worker subscribes to a DynamoDB Stream (or polls) and closes its OpenAI socket when the row vanishes.
Pitfalls
- API Gateway WebSocket has a 10-minute idle timeout — send a
{type:"ping"}frame every 5 min from the worker to keep alive. - Frame size limit is 32 KB per WS message; chunk audio if needed.
- Lambda cold start on
$defaultadds 200-300ms — provisioned concurrency = 5 fixes it for ~$10/mo. @connectionsPOST is region-locked — your worker must use the same region as the API.- Cost trap: API Gateway charges $1/M messages. At 50 fps audio that's $1.80/hour per active call before Lambda. Pre-aggregate frames if scaling.
How CallSphere does this in production
CallSphere doesn't use API Gateway WebSocket for voice — we measured that at our scale (millions of minutes/month) it's 3-4x more expensive than running our own websocket fleet behind ALB. We run FastAPI :8084 on bare k3s nodes for Healthcare and Pion Go + NATS for OneRoof multi-family. 37 agents, 90+ tools, 115+ DB tables, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate. For early-stage builders without our scale, the serverless pattern in this post is the right answer.
FAQ
Q: Can I drop the worker and put everything in Lambda? Only if every call lasts <15 min and you're OK with cold starts on every reconnect. The worker pattern is what makes this production-grade.
Q: Do I have to use OpenAI? No — swap the worker target for AWS Bedrock Nova Sonic, Azure GPT Realtime, or Cloudflare Workers AI.
Q: How do I add Twilio?
Replace the $default handler with a Twilio Media Stream WebSocket served from a non-API-Gateway endpoint (Twilio doesn't sign API Gateway URLs out of the box). Cleanest is a separate ALB+Fargate path for Twilio.
Q: What about HIPAA? API Gateway WebSocket is HIPAA-eligible since 2024. Sign a BAA, enable VPC endpoints, encrypt SQS with KMS CMK.
Q: Cost at 1k concurrent calls? Roughly: $1.50 API Gateway messages + $0.20 Lambda + $0.10 SQS + $0.50 Fargate + $20 OpenAI = $22.30/hour, or $0.022/min/call.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.