---
title: "Capstone: Building a Voice-Enabled Appointment Booking System from Scratch"
description: "Build a complete voice-powered appointment booking system using Twilio, speech-to-text, text-to-speech, calendar integration, and intelligent booking logic with a FastAPI backend."
canonical: https://callsphere.ai/blog/capstone-voice-enabled-appointment-booking-system
category: "Learn Agentic AI"
tags: ["Capstone Project", "Voice AI", "Twilio", "Appointment Booking", "STT/TTS", "Full-Stack AI"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T23:29:05.910Z
---

# Capstone: Building a Voice-Enabled Appointment Booking System from Scratch

> Build a complete voice-powered appointment booking system using Twilio, speech-to-text, text-to-speech, calendar integration, and intelligent booking logic with a FastAPI backend.

## System Architecture

A voice-enabled appointment booking system takes an inbound phone call, converts speech to text, processes the request through an AI agent, books or modifies appointments in a calendar, and speaks the response back to the caller. This capstone integrates Twilio for telephony, Deepgram for speech-to-text, OpenAI for the conversational agent, ElevenLabs for natural text-to-speech, and a PostgreSQL database for appointment storage.

The call flow is: Twilio receives the call and opens a WebSocket media stream to your backend. Your FastAPI backend receives raw audio frames, streams them to Deepgram for real-time transcription, sends the transcript to an AI agent, receives the agent response, converts it to speech via ElevenLabs, and streams the audio back through the Twilio WebSocket.

## Database Schema for Appointments

```python
# models.py
from sqlalchemy import Column, String, DateTime, Boolean, ForeignKey
from sqlalchemy.dialects.postgresql import UUID
import uuid

class Provider(Base):
    __tablename__ = "providers"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    name = Column(String(200), nullable=False)
    specialty = Column(String(100))
    timezone = Column(String(50), default="America/New_York")

class TimeSlot(Base):
    __tablename__ = "time_slots"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    provider_id = Column(UUID(as_uuid=True), ForeignKey("providers.id"))
    start_time = Column(DateTime, nullable=False)
    end_time = Column(DateTime, nullable=False)
    is_available = Column(Boolean, default=True)

class Appointment(Base):
    __tablename__ = "appointments"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    slot_id = Column(UUID(as_uuid=True), ForeignKey("time_slots.id"))
    patient_name = Column(String(200), nullable=False)
    patient_phone = Column(String(20), nullable=False)
    reason = Column(String(500))
    confirmed = Column(Boolean, default=False)
    created_at = Column(DateTime, server_default="now()")
```

## Twilio WebSocket Integration

Twilio sends a webhook when a call arrives. You respond with TwiML that opens a bidirectional media stream to your server.

```mermaid
flowchart LR
    CORPUS[("Pre-training corpus
trillions of tokens")]
    FILTER["Quality filter and
dedupe"]
    TOK["BPE tokenizer"]
    SHARD["Shard plus
data parallel"]
    GPU{"GPU cluster
FSDP or DeepSpeed"}
    CKPT[("Checkpoints
every N steps")]
    LOSS["Loss curve plus
eval gates"]
    SFT["SFT phase"]
    DPO["DPO or RLHF"]
    BASE([Base model])
    INSTR([Instruct model])
    CORPUS --> FILTER --> TOK --> SHARD --> GPU
    GPU --> CKPT --> LOSS
    LOSS --> BASE --> SFT --> DPO --> INSTR
    style GPU fill:#4f46e5,stroke:#4338ca,color:#fff
    style LOSS fill:#f59e0b,stroke:#d97706,color:#1f2937
    style INSTR fill:#059669,stroke:#047857,color:#fff
```

```python
# routes/twilio.py
from fastapi import APIRouter, Request
from fastapi.responses import Response

router = APIRouter()

@router.post("/incoming-call")
async def handle_incoming_call(request: Request):
    twiml = """

    """
    return Response(content=twiml, media_type="application/xml")
```

The WebSocket handler receives audio frames from Twilio and manages the conversation loop.

```python
# routes/media_stream.py
from fastapi import WebSocket
import json, base64

@app.websocket("/media-stream")
async def media_stream(ws: WebSocket):
    await ws.accept()
    stream_sid = None
    deepgram_ws = await connect_deepgram()
    conversation_history = []

    async for raw in ws.iter_text():
        msg = json.loads(raw)

        if msg["event"] == "start":
            stream_sid = msg["start"]["streamSid"]

        elif msg["event"] == "media":
            audio_bytes = base64.b64decode(msg["media"]["payload"])
            await deepgram_ws.send(audio_bytes)

        elif msg["event"] == "stop":
            break

    await deepgram_ws.close()
```

## Booking Agent with Tool Calls

The AI agent uses tools to check availability, book slots, and cancel appointments.

```python
# agents/booking_agent.py
from agents import Agent, function_tool
from datetime import datetime, timedelta

@function_tool
def check_availability(provider_name: str, date: str) -> str:
    """Check available time slots for a provider on a given date."""
    target = datetime.strptime(date, "%Y-%m-%d")
    slots = db.query(TimeSlot).join(Provider).filter(
        Provider.name.ilike(f"%{provider_name}%"),
        TimeSlot.start_time >= target,
        TimeSlot.start_time  str:
    """Book an appointment at the specified time."""
    slot = db.query(TimeSlot).filter(
        TimeSlot.start_time == datetime.strptime(slot_time, "%Y-%m-%d %H:%M"),
        TimeSlot.is_available == True,
    ).first()
    if not slot:
        return "That time slot is no longer available."
    slot.is_available = False
    appt = Appointment(
        slot_id=slot.id, patient_name=patient_name, reason=reason, confirmed=True
    )
    db.add(appt)
    db.commit()
    return f"Appointment booked for {patient_name} at {slot_time}."

booking_agent = Agent(
    name="Booking Agent",
    instructions="""You are a friendly appointment booking assistant on a phone call.
    Always confirm the provider, date, time, and reason before booking.
    Speak naturally since the caller is listening to TTS output.
    Keep responses under 2 sentences for quick voice delivery.""",
    tools=[check_availability, book_appointment],
)
```

## Speech-to-Text and Text-to-Speech Pipeline

Connect Deepgram for real-time STT with interim results, and ElevenLabs for low-latency TTS streaming.

```python
# services/stt.py
import websockets, json, os

async def connect_deepgram():
    url = "wss://api.deepgram.com/v1/listen?model=nova-2&punctuate=true"
    ws = await websockets.connect(url, extra_headers={
        "Authorization": f"Token {os.environ['DEEPGRAM_API_KEY']}"
    })
    return ws

async def stream_tts(text: str) -> bytes:
    """Convert text to speech using ElevenLabs streaming API."""
    import httpx
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
            headers={"xi-api-key": os.environ["ELEVENLABS_API_KEY"]},
            json={"text": text, "model_id": "eleven_turbo_v2"},
        )
        return resp.content
```

## Deployment and Testing

Deploy with Docker Compose using three services: the FastAPI backend, PostgreSQL, and an ngrok container for exposing your local WebSocket to Twilio during development. For production, deploy behind an nginx reverse proxy with TLS and configure Twilio to point to your domain.

Test the booking flow end-to-end by calling your Twilio number, requesting an appointment, confirming the details, and verifying the database record. Automated testing uses recorded audio fixtures played through the WebSocket handler.

## FAQ

### How do I handle interruptions when the caller speaks over the AI?

Implement barge-in detection by monitoring the Deepgram transcript stream while TTS audio is playing. When new speech is detected, immediately stop the TTS playback by sending a clear message on the Twilio WebSocket, then process the new utterance.

### What latency should I target for a natural voice experience?

Aim for under 800ms total round-trip from end-of-speech to start-of-response-audio. Deepgram Nova-2 typically returns final transcripts within 200ms, the LLM response takes 300-400ms, and ElevenLabs streaming TTS begins output within 200ms.

### How do I prevent double-booking?

Use a database-level unique constraint or a SELECT FOR UPDATE lock on the time slot row. Wrap the availability check and booking in a single database transaction so that concurrent callers cannot book the same slot.

---

#CapstoneProject #VoiceAI #Twilio #AppointmentBooking #STTTTS #FullStackAI #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/capstone-voice-enabled-appointment-booking-system