Post-Call Sentiment + Topic Extraction: CallSphere vs Vapi
CallSphere runs GPT-4o-mini on every call: sentiment, lead score, intent, topics, satisfaction, escalation flag. Vapi has no native analytics layer.
TL;DR
Voice AI without analytics is just an answering machine that talks back. CallSphere ships post-call analytics by default: GPT-4o-mini runs on every healthcare call to extract sentiment (-1.0 to 1.0), lead score (0-100), intent, topics, satisfaction (1-5), escalation flag (boolean), and an AI-generated summary. Vapi.ai is voice infrastructure with no native analytics layer — customers must wire their own LLM analytics pipeline. This post shows the analytics row schema, a sample row, the pipeline architecture, and what to ask in procurement.
Why Analytics Matter More Than the Voice Itself
Operations teams care less about the voice quality of an AI agent and more about three questions:
- Are calls solving the customer's problem? (intent + satisfaction)
- Are leads slipping through the cracks? (lead score + escalation flag)
- Are agents — human or AI — sentiment-competent? (sentiment trend by hour / day / queue)
These questions are answered with structured analytics on every call, not with manual sampling. A platform without native analytics forces the customer to build the analytics layer themselves — which most never finish.
The CallSphere call_log_analytics Schema
Every healthcare call (and most other verticals) produces a row like:
call_id | uuid
sentiment_score | float (-1.0 to 1.0)
lead_score | int (0 to 100)
intent | text (e.g., "appointment_booking", "billing_question")
topics | text[] (e.g., ["insurance", "rescheduling"])
satisfaction | int (1 to 5)
escalation_flag | bool
ai_summary | text (2-3 sentences)
analyzed_at | timestamp
model_version | text ("gpt-4o-mini-2024-07-18")
A sample row (synthetic):
sentiment_score: 0.42
lead_score: 78
intent: "new_patient_intake"
topics: ["insurance", "first_visit", "scheduling"]
satisfaction: 4
escalation_flag: false
ai_summary: "Caller is a new patient checking insurance acceptance and seeking first appointment. Provided Aetna PPO. Booked Tuesday 10am with Dr. Patel."
This row is queryable in any dashboard tool. Operations can filter:
- "Show me low-satisfaction calls in the last 7 days"
- "Top 10 topics for hot leads (lead_score > 80)"
- "Hourly sentiment trend for the front-desk queue"
Vapi's Analytics Gap
Vapi customers who want this functionality must:
- Capture the post-call transcript via a webhook
- Send it to OpenAI / Anthropic with a structured-output prompt
- Parse and store the JSON
- Build dashboards on the analytics table
- Maintain prompt drift / schema drift over time
- Add error handling, retries, idempotency
This is several engineer-weeks plus ongoing prompt engineering. And every change to the analytics schema requires re-running historical data — which most teams never do.
CallSphere's Pipeline
The analytics pipeline runs as a background job after every call:
- Transcript and metadata captured to call_logs
- Job triggers GPT-4o-mini with structured prompt
- Output validated against schema (type checks + range checks)
- Row written to call_log_analytics
- Webhooks fire for downstream systems (CRM, BI)
- Dashboards refresh in near-real-time
Costs are predictable: GPT-4o-mini at ~$0.15 per 1M input tokens, with average call analysis ~2K tokens, makes per-call analytics cost ~$0.0003. Effectively free at any reasonable volume.
Mermaid: Analytics Pipeline
graph LR
CALL[Call Ends] --> CAP[Capture Transcript]
CAP --> CLOG[(call_logs)]
CLOG --> JOB[Background Job]
JOB --> LLM[GPT-4o-mini]
LLM --> VAL[Schema Validation]
VAL --> CLA[(call_log_analytics)]
CLA --> DASH[Dashboards]
CLA --> WHK[Webhooks]
WHK --> CRM[CRM]
WHK --> BI[BI]
CLA --> BQ[BigQuery Export]
The pipeline is idempotent and re-runnable, so prompt or schema changes can backfill historical calls cleanly.
Comparison Table
| Analytics Capability | Vapi DIY | CallSphere |
|---|---|---|
| Sentiment scoring | Build yourself | Built-in |
| Lead scoring | Build yourself | Built-in |
| Intent detection | Build yourself | Built-in |
| Topic extraction | Build yourself | Built-in |
| Satisfaction estimate | Build yourself | Built-in |
| Escalation flag | Build yourself | Built-in |
| AI summary | Build yourself | Built-in |
| Schema-validated outputs | Build yourself | Default |
| Backfill on prompt changes | Build yourself | Built-in |
| Dashboard integration | Build yourself | Default |
| Webhook fanout | Build yourself | Default |
| Time-to-analytics | Weeks-months | Day 1 |
What "Lead Score" Actually Means in Healthcare
For healthcare voice, lead_score factors include:
- Caller is a new vs returning patient
- Insurance accepted by the practice
- Appointment booked vs not
- Caller intent strength ("just looking" vs "I need an MRI tomorrow")
- Sentiment trajectory through the call
- Likelihood-to-convert estimated from historical patterns
Scores in the 0-30 range typically represent informational inquiries; 30-60 are warm prospects; 60-80 are hot leads; 80+ are converted-on-call (e.g., booked an appointment, provided insurance).
Procurement-Friendly Analytics Checklist
- Is post-call analytics included by default?
- What model powers the analytics? (GPT-4o-mini, custom, etc.)
- What signals are extracted (sentiment, lead, intent, topics, satisfaction, escalation)?
- Are outputs schema-validated?
- Are the analytics queryable via SQL / BI tools?
- Are webhooks supported for CRM fanout?
- Are historical backfills supported on prompt updates?
- Are model versions logged with each row?
- What is the per-call analytics cost?
- Is analytics in scope for SOC 2 / HIPAA?
Real-World ROI Pattern
A 20-provider primary care group that switched from a Vapi-based intake bot (no analytics) to CallSphere reported the following changes after 90 days:
- Identified that "insurance acceptance" was the #1 unanswered topic, leading to FAQ updates that raised conversion 12%
- Found that low-satisfaction calls clustered around lunch hour due to staff hand-off issues; rescheduled coverage
- Caught 14 escalation_flag=true calls per week that had been silently dropping; recovered ~$60K in deferred services
Cost of CallSphere analytics: included. Cost of building the same in Vapi: estimated at 8 engineer-weeks plus ongoing maintenance.
CTA
Voice AI without analytics is missing the point. Book a CallSphere demo and see the analytics dashboard live, or check pricing.
FAQ
Why GPT-4o-mini specifically?
GPT-4o-mini provides excellent JSON schema conformance and topic extraction quality at near-trivial cost. It's purpose-fit for structured analytics on relatively short transcripts.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Can I bring my own analytics model?
Yes — enterprise plans support pluggable analytics. Customers have used Anthropic Claude Haiku, custom fine-tunes, or local models for on-prem deployments.
Are analytics rows tied back to specific moments in the call?
The transcript and timestamps are joinable via call_id, so a low-satisfaction flag can be traced to a specific transcript segment.
Is the lead-score model explainable?
Yes. The prompt asks the model to also return a brief reasoning string, captured in the row for ops review.
How is prompt drift managed?
Prompt versions are stored in git. The model_version column captures which prompt+model produced each row. Backfills can be scheduled when prompts change materially.
Deep Dive: Sentiment Score Interpretation
Sentiment is reported as a float from -1.0 to 1.0:
| Range | Interpretation | Typical Action |
|---|---|---|
| -1.0 to -0.5 | Strongly negative | Immediate manager review, possible outreach |
| -0.5 to -0.2 | Mildly negative | Review in daily summary |
| -0.2 to 0.2 | Neutral | No action |
| 0.2 to 0.5 | Mildly positive | Coaching recognition |
| 0.5 to 1.0 | Strongly positive | Best-practices identification |
Calibration: CallSphere's sentiment is calibrated against a held-out set of human-labeled calls. Model upgrades trigger a re-calibration to ensure score interpretations remain stable over time.
Topic Extraction in Practice
Topics are extracted as a multi-label set (a call can have multiple topics). The default vocabulary is vertical-specific:
Healthcare topics:
- appointment_booking, appointment_reschedule, appointment_cancel
- billing_question, insurance_verification, copay
- prescription_refill, medication_question
- new_patient, returning_patient
- referral, second_opinion
- complaint, compliment
Sales topics:
- pricing, demo_request, technical_question
- competitor_mention, integration_question
- decision_maker, budget_question, timeline
- objection_pricing, objection_features, objection_timing
Custom topic vocabularies can be defined per tenant.
Intent Classification
Intent is a single-label classification with the highest-confidence intent for the overall call. This is distinct from topics, which are multi-label.
Intent + topic together give a richer view than either alone:
- intent: "appointment_booking"
- topics: ["insurance", "first_visit", "scheduling"]
The pair lets ops dashboards filter by primary intent and drill into related topics.
Satisfaction Estimate
Satisfaction is a 1-5 estimate based on:
- Final sentiment of the call
- Whether the caller's primary goal was accomplished
- Sentiment trajectory (rising vs falling)
- Explicit satisfaction signals ("thank you", "that was helpful", "this is frustrating")
- Caller's tone in closing
Satisfaction differs from sentiment — a caller can be moderately negative throughout but ultimately satisfied if the issue was resolved.
Escalation Flag
The boolean escalation flag fires when the call meets any of:
- Caller explicitly requests a human ("can I talk to a person")
- Sentiment dips below -0.5 sustained
- Healthcare emergency keywords detected
- Compliance-sensitive topics (e.g., complaint, threat)
- Agent could not resolve the caller's primary goal
When the flag fires during the call, a real-time alert can be sent to the operations team. After the call, the flag drives a "missed escalation" report.
AI Summary Quality
The AI summary is a 2-3 sentence narrative answering:
- Who called and what did they want?
- What did the agent do?
- What was the outcome?
Example summary: "New patient seeking dermatology consultation. Agent verified Aetna PPO acceptance and scheduled with Dr. Patel for Tuesday April 22 at 10am. Caller satisfied; no follow-up required."
The summary is the single most-used field by busy managers reviewing daily call digests.
Backfill Patterns
When prompts change materially, customers can backfill historical calls:
- Choose date range
- Choose call types
- Backfill writes new rows alongside originals (both retained for comparison)
- Difference report highlights which calls changed scores significantly
This is critical for honest reporting — without backfill, "improvements" in sentiment over time may just be prompt changes.
Cost Predictability
GPT-4o-mini at current pricing means:
- Average call ~3-5 minutes ~600-1000 transcribed words ~1500-2500 input tokens
- Analytics output ~200-400 tokens
- Per-call analytics cost ~$0.0003-$0.0005
For a 100K-call-per-month customer, total analytics LLM cost ~$30-50/month — effectively free relative to the value.
Privacy Considerations
Analytics is run on redacted transcripts (see the redaction post). PII spans are masked before the analytics LLM sees them. This means analytics outputs themselves carry minimal PII and can be safely stored in BI tools accessible to a broader audience.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.