By Sagar Shankaran, Founder of CallSphere
Wire Amazon Transcribe streaming, Claude 4.7 Sonnet on Bedrock, and Polly generative voices into a sub-second voice agent. Real Python + boto3 code, IAM policy, and production tips.
Key takeaways
TL;DR — Amazon Transcribe streams partial transcripts over a WebSocket-style
StartStreamTranscriptionAPI, you forward final segments to Claude 4.7 Sonnet on Bedrock with the InvokeModel API, then synthesize the reply with Polly'sgenerativeengine. Three boto3 clients, one event loop, ~700ms voice-to-voice on us-east-1.
A Python service that accepts an HTTP POST with raw 16kHz PCM (or via WebSocket), pipes the audio into Amazon Transcribe streaming, sends each finalized utterance to anthropic.claude-sonnet-4-7-20250620-v1:0 on Bedrock, then streams the response text into Polly with the generative engine. The whole agent runs on a single t3.small EC2 or in a Fargate task — no GPU required.
us-east-1 (request access for Anthropic models in the Bedrock console).transcribe:StartStreamTranscription, bedrock:InvokeModel, and polly:SynthesizeSpeech.boto3>=1.34, amazon-transcribe>=0.6.2, asyncio.flowchart LR
CALLER[Caller / Browser] -->|PCM16 16kHz| APP[Python Agent]
APP -->|StartStreamTranscription| TRANS[Amazon Transcribe Streaming]
TRANS -->|partial + final| APP
APP -->|InvokeModel claude-4-7-sonnet| BR[Amazon Bedrock]
BR -->|text reply| APP
APP -->|SynthesizeSpeech engine=generative| POLLY[Amazon Polly]
POLLY -->|MP3 / PCM| CALLER
Attach this minimal inline policy to the EC2 instance role or task role:
```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "transcribe:StartStreamTranscription", "Resource": "" }, { "Effect": "Allow", "Action": "bedrock:InvokeModel", "Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-7-20250620-v1:0" }, { "Effect": "Allow", "Action": "polly:SynthesizeSpeech", "Resource": "" } ] } ```
```python from amazon_transcribe.client import TranscribeStreamingClient from amazon_transcribe.handlers import TranscriptResultStreamHandler import asyncio
class Handler(TranscriptResultStreamHandler): def init(self, stream, on_final): super().init(stream) self.on_final = on_final async def handle_transcript_event(self, event): for r in event.transcript.results: if not r.is_partial and r.alternatives: await self.on_final(r.alternatives[0].transcript)
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
async def transcribe(pcm_iter, on_final): client = TranscribeStreamingClient(region="us-east-1") stream = await client.start_stream_transcription( language_code="en-US", media_sample_rate_hz=16000, media_encoding="pcm") async def feed(): async for chunk in pcm_iter: await stream.input_stream.send_audio_event(audio_chunk=chunk) await stream.input_stream.end_stream() await asyncio.gather(feed(), Handler(stream.output_stream, on_final).handle_events()) ```
```python import boto3, json br = boto3.client("bedrock-runtime", region_name="us-east-1")
def ask_claude(history, user_text): history.append({"role": "user", "content": [{"type": "text", "text": user_text}]}) resp = br.invoke_model( modelId="anthropic.claude-sonnet-4-7-20250620-v1:0", body=json.dumps({ "anthropic_version": "bedrock-2023-05-31", "max_tokens": 512, "system": "You are a concise voice agent. Keep replies under 2 sentences.", "messages": history, })) text = json.loads(resp["body"].read())["content"][0]["text"] history.append({"role": "assistant", "content": [{"type": "text", "text": text}]}) return text ```
For lower latency, switch to invoke_model_with_response_stream and pipe deltas straight into Polly.
```python polly = boto3.client("polly", region_name="us-east-1") def synth(text, voice="Ruth"): # Ruth/Stephen are generative voices out = polly.synthesize_speech( Text=text, VoiceId=voice, OutputFormat="pcm", SampleRate="16000", Engine="generative") return out["AudioStream"].read() ```
Generative voices add ~150ms vs neural but sound dramatically more human; use neural for stricter latency budgets.
Use a simple energy-based VAD (RMS threshold) to chunk inputs to Transcribe; throw away anything below 600ms of speech. For barge-in, kill the current Polly playback the moment Transcribe emits a non-empty partial. For tool-use, switch from invoke_model to Bedrock's converse API which supports native tool calling — Claude returns a toolUse block, you execute, and reply with a toolResult block.
```dockerfile FROM python:3.11-slim RUN pip install boto3 amazon-transcribe uvicorn fastapi COPY app.py /app/app.py CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"] ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
docker build -t voice-agent . && aws ecr-public get-login-password | docker login ... && docker push .... Then run as a Fargate service behind an NLB; mTLS to Twilio if you're terminating PSTN.
Convert Twilio's mu-law 8kHz frames to PCM16 16kHz with audioop.ulaw2lin + audioop.ratecv before forwarding into the Transcribe stream. Reverse the chain (PCM16 16kHz → mu-law 8kHz) on Polly output frames before media events back to Twilio.
us-east-1, eu-west-1, ap-northeast-1 as of May 2026.Config(retries={"max_attempts": 1, "mode": "standard"}) on Bedrock; the default exponential backoff will blow your latency budget.CallSphere's Healthcare voice stack runs on FastAPI :8084 with OpenAI Realtime as the primary path because we measured 350ms cheaper TTFT vs Bedrock InvokeModel for short utterances. We keep an AWS Bedrock + Polly fallback wired through the same FastAPI surface for HIPAA-locked tenants who need their audio to never leave AWS, and Claude 4.7 Sonnet on Bedrock powers our 90+ tools across 6 verticals. We run 37 voice agents under one orchestration layer with 115+ Postgres tables tracking every turn. Pricing tiers are $149/$499/$1499 with a 14-day trial and a 22% lifetime affiliate cut.
Q: Why not just use Bedrock AgentCore? AgentCore is great for chat but doesn't give you raw audio control — you can't bridge Twilio media streams without a wrapper service anyway. Going direct to Transcribe + InvokeModel + Polly keeps you in the audio path.
Q: Can I use Nova Sonic instead of this stack? Nova Sonic (Amazon's speech-to-speech model) is excellent and cuts latency further, but it's currently only routable through Bedrock InvokeModelWithBidirectionalStream which requires SigV4 signing on a streaming socket — more code than this tutorial.
Q: How do I handle PHI? Sign a BAA with AWS, enable VPC endpoints for all three services so audio never traverses the public internet, and turn off Transcribe content redaction logging.
Q: What's the realistic latency? On us-east-1 with warm clients: Transcribe partial ~250ms, Bedrock TTFT ~400ms, Polly first-byte ~200ms. Voice-to-voice ~700ms.
Q: Can I stream Claude's output into Polly?
Yes — use invoke_model_with_response_stream, accumulate deltas into sentence boundaries (. ! ?), and call Polly per sentence. Cuts perceived latency by 40%.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Using multiple chat AIs at once is a real 2026 workflow. Here is when it makes sense, how to set it up, and how CallSphere handles multi-model routing.
The 2026 desktop AI agent landscape — ServiceNow Project Arc, Anthropic Claude offerings, OpenAI agents, and Google Mariner. A buyer's map.
Anthropic and Moody's announced a data partnership in May 2026 that grounds Claude in audited financial reference data. Why grounding reduces hallucination and what it unlocks.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Anthropic announced full Microsoft 365 integration in May 2026. What the integration covers, what it means for Outlook, Word, Excel, and Teams users, and where the boundaries are.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI