---
title: "Build a Voice Agent on AWS Bedrock: Claude + Polly + Transcribe (2026)"
description: "Wire Amazon Transcribe streaming, Claude 4.7 Sonnet on Bedrock, and Polly generative voices into a sub-second voice agent. Real Python + boto3 code, IAM policy, and production tips."
canonical: https://callsphere.ai/blog/vw5h-build-voice-agent-aws-bedrock-claude-polly-transcribe
category: "AI Voice Agents"
tags: ["AWS", "Bedrock", "Claude", "Polly", "Transcribe", "Tutorial"]
author: "CallSphere Team"
published: 2026-03-15T00:00:00.000Z
updated: 2026-05-07T16:30:05.079Z
---

# Build a Voice Agent on AWS Bedrock: Claude + Polly + Transcribe (2026)

> Wire Amazon Transcribe streaming, Claude 4.7 Sonnet on Bedrock, and Polly generative voices into a sub-second voice agent. Real Python + boto3 code, IAM policy, and production tips.

> **TL;DR** — Amazon Transcribe streams partial transcripts over a WebSocket-style `StartStreamTranscription` API, you forward final segments to Claude 4.7 Sonnet on Bedrock with the InvokeModel API, then synthesize the reply with Polly's `generative` engine. Three boto3 clients, one event loop, ~700ms voice-to-voice on us-east-1.

## What you'll build

A Python service that accepts an HTTP POST with raw 16kHz PCM (or via WebSocket), pipes the audio into Amazon Transcribe streaming, sends each finalized utterance to `anthropic.claude-sonnet-4-7-20250620-v1:0` on Bedrock, then streams the response text into Polly with the `generative` engine. The whole agent runs on a single `t3.small` EC2 or in a Fargate task — no GPU required.

## Prerequisites

1. AWS account with Bedrock access enabled in `us-east-1` (request access for Anthropic models in the Bedrock console).
2. IAM role with `transcribe:StartStreamTranscription`, `bedrock:InvokeModel`, and `polly:SynthesizeSpeech`.
3. Python 3.11, `boto3>=1.34`, `amazon-transcribe>=0.6.2`, `asyncio`.
4. An audio source: 16kHz mono PCM (works directly with Transcribe).

## Architecture

```mermaid
flowchart LR
  CALLER[Caller / Browser] -->|PCM16 16kHz| APP[Python Agent]
  APP -->|StartStreamTranscription| TRANS[Amazon Transcribe Streaming]
  TRANS -->|partial + final| APP
  APP -->|InvokeModel claude-4-7-sonnet| BR[Amazon Bedrock]
  BR -->|text reply| APP
  APP -->|SynthesizeSpeech engine=generative| POLLY[Amazon Polly]
  POLLY -->|MP3 / PCM| CALLER
```

## Step 1 — IAM policy for the agent role

Attach this minimal inline policy to the EC2 instance role or task role:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    { "Effect": "Allow", "Action": "transcribe:StartStreamTranscription", "Resource": "*" },
    { "Effect": "Allow", "Action": "bedrock:InvokeModel",
      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-7-20250620-v1:0" },
    { "Effect": "Allow", "Action": "polly:SynthesizeSpeech", "Resource": "*" }
  ]
}
```

## Step 2 — Stream audio into Amazon Transcribe

```python
from amazon_transcribe.client import TranscribeStreamingClient
from amazon_transcribe.handlers import TranscriptResultStreamHandler
import asyncio

class Handler(TranscriptResultStreamHandler):
    def **init**(self, stream, on_final):
        super().**init**(stream)
        self.on_final = on_final
    async def handle_transcript_event(self, event):
        for r in event.transcript.results:
            if not r.is_partial and r.alternatives:
                await self.on_final(r.alternatives[0].transcript)

async def transcribe(pcm_iter, on_final):
    client = TranscribeStreamingClient(region="us-east-1")
    stream = await client.start_stream_transcription(
        language_code="en-US", media_sample_rate_hz=16000,
        media_encoding="pcm")
    async def feed():
        async for chunk in pcm_iter:
            await stream.input_stream.send_audio_event(audio_chunk=chunk)
        await stream.input_stream.end_stream()
    await asyncio.gather(feed(), Handler(stream.output_stream, on_final).handle_events())
```

## Step 3 — Call Claude 4.7 Sonnet on Bedrock

```python
import boto3, json
br = boto3.client("bedrock-runtime", region_name="us-east-1")

def ask_claude(history, user_text):
    history.append({"role": "user", "content": [{"type": "text", "text": user_text}]})
    resp = br.invoke_model(
        modelId="anthropic.claude-sonnet-4-7-20250620-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 512,
            "system": "You are a concise voice agent. Keep replies under 2 sentences.",
            "messages": history,
        }))
    text = json.loads(resp["body"].read())["content"][0]["text"]
    history.append({"role": "assistant", "content": [{"type": "text", "text": text}]})
    return text
```

For lower latency, switch to `invoke_model_with_response_stream` and pipe deltas straight into Polly.

## Step 4 — Synthesize with Polly generative voices

```python
polly = boto3.client("polly", region_name="us-east-1")
def synth(text, voice="Ruth"):  # Ruth/Stephen are generative voices
    out = polly.synthesize_speech(
        Text=text, VoiceId=voice, OutputFormat="pcm",
        SampleRate="16000", Engine="generative")
    return out["AudioStream"].read()
```

Generative voices add ~150ms vs neural but sound dramatically more human; use `neural` for stricter latency budgets.

## Step 5 — Glue: VAD, tool-use, and barge-in

Use a simple energy-based VAD (RMS threshold) to chunk inputs to Transcribe; throw away anything below 600ms of speech. For barge-in, kill the current Polly playback the moment Transcribe emits a non-empty partial. For tool-use, switch from `invoke_model` to Bedrock's `converse` API which supports native tool calling — Claude returns a `toolUse` block, you execute, and reply with a `toolResult` block.

## Step 6 — Containerize and deploy on Fargate

```dockerfile
FROM python:3.11-slim
RUN pip install boto3 amazon-transcribe uvicorn fastapi
COPY app.py /app/app.py
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
```

`docker build -t voice-agent . && aws ecr-public get-login-password | docker login ... && docker push ...`. Then run as a Fargate service behind an NLB; mTLS to Twilio if you're terminating PSTN.

## Step 7 — Wire to Twilio Media Streams

Convert Twilio's mu-law 8kHz frames to PCM16 16kHz with `audioop.ulaw2lin` + `audioop.ratecv` before forwarding into the Transcribe stream. Reverse the chain (PCM16 16kHz → mu-law 8kHz) on Polly output frames before `media` events back to Twilio.

## Pitfalls

- **Bedrock model access** isn't on by default — request it once per region in the Bedrock console.
- **Transcribe streaming has a 4-hour cap** per session; reset on long calls.
- **Polly generative is regional** — only available in `us-east-1`, `eu-west-1`, `ap-northeast-1` as of May 2026.
- **Cost trap**: Polly generative is $30/M chars vs $4/M for neural. Cache common greetings.
- **boto3 retry storms**: set `Config(retries={"max_attempts": 1, "mode": "standard"})` on Bedrock; the default exponential backoff will blow your latency budget.

## How CallSphere does this in production

CallSphere's Healthcare voice stack runs on FastAPI :8084 with OpenAI Realtime as the primary path because we measured 350ms cheaper TTFT vs Bedrock InvokeModel for short utterances. We keep an AWS Bedrock + Polly fallback wired through the same FastAPI surface for HIPAA-locked tenants who need their audio to never leave AWS, and Claude 4.7 Sonnet on Bedrock powers our 90+ tools across 6 verticals. We run 37 voice agents under one orchestration layer with 115+ Postgres tables tracking every turn. Pricing tiers are $149/$499/$1499 with a 14-day trial and a 22% lifetime affiliate cut.

## FAQ

**Q: Why not just use Bedrock AgentCore?**
AgentCore is great for chat but doesn't give you raw audio control — you can't bridge Twilio media streams without a wrapper service anyway. Going direct to Transcribe + InvokeModel + Polly keeps you in the audio path.

**Q: Can I use Nova Sonic instead of this stack?**
Nova Sonic (Amazon's speech-to-speech model) is excellent and cuts latency further, but it's currently only routable through Bedrock InvokeModelWithBidirectionalStream which requires SigV4 signing on a streaming socket — more code than this tutorial.

**Q: How do I handle PHI?**
Sign a BAA with AWS, enable VPC endpoints for all three services so audio never traverses the public internet, and turn off Transcribe content redaction logging.

**Q: What's the realistic latency?**
On us-east-1 with warm clients: Transcribe partial ~250ms, Bedrock TTFT ~400ms, Polly first-byte ~200ms. Voice-to-voice ~700ms.

**Q: Can I stream Claude's output into Polly?**
Yes — use `invoke_model_with_response_stream`, accumulate deltas into sentence boundaries (`. ! ?`), and call Polly per sentence. Cuts perceived latency by 40%.

## Sources

- [Building intelligent AI voice agents with Pipecat and Amazon Bedrock — AWS](https://aws.amazon.com/blogs/machine-learning/building-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock-part-1/)
- [Amazon Transcribe Streaming Python SDK](https://github.com/awslabs/amazon-transcribe-streaming-sdk)
- [Amazon Polly Generative voices documentation](https://docs.aws.amazon.com/polly/latest/dg/generative-voices.html)
- [Amazon Bedrock Anthropic Claude messages API](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-anthropic-claude-messages.html)
- [aws-samples/build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock](https://github.com/aws-samples/build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock)

---

Source: https://callsphere.ai/blog/vw5h-build-voice-agent-aws-bedrock-claude-polly-transcribe
