Mean Opinion Score is the only call quality metric that matters in production AI voice in 2026. The ITU-T scale is 1 to 5; "acceptable" is 4.3 or higher; conversations start to break around 3.6. The math behind MOS is packet loss, jitter, and one-way latency converted to a perceptual score via the E-Model. AI voice deployments running below 4.0 sustained will see customer satisfaction drop sharply, even if every other metric looks green.

Background

MOS was standardized by the ITU-T (recommendation P.800) in the 1990s as a subjective listening test: groups of trained listeners rate audio samples on a 1-to-5 scale, the average is the Mean Opinion Score. The objective MOS used in VoIP is computed via the E-Model (ITU-T G.107), which maps network metrics (latency, jitter, packet loss) plus codec choice into an R-factor and then to a MOS prediction.

The bands matter. MOS 4.3 to 5.0 is "excellent" - indistinguishable from in-person. MOS 4.0 to 4.2 is "good" - what a typical PSTN call sounds like. MOS 3.6 to 4.0 is "fair" - users notice but tolerate. MOS below 3.6 is "poor" - users start to abandon. AI voice has a tighter floor because the LLM-generated speech is already at risk of sounding robotic; combine that with a 3.5 MOS network and conversations break.

The dominant variables in 2026 are codec choice (G.711 caps at MOS 4.4 best case; Opus can hit 4.5+ at higher bitrates), one-way latency (target under 150 ms; degrades after 200 ms), packet loss (target under 1 percent; noticeable above 3 percent), and jitter (target under 30 ms; noticeable above 50 ms).

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Architecture

flowchart TD
    A[RTP packets in / out] --> B[Per-call telemetry collector]
    B --> C[Compute one-way latency]
    B --> D[Compute jitter]
    B --> E[Compute packet loss]
    C --> F[E-Model R-factor]
    D --> F
    E --> F
    F --> G[Predicted MOS]
    G --> H{MOS < 4.0?}
    H -->|Yes| I[Alert + investigate]
    H -->|No| J[Log and continue]
    I --> K[Codec / network / transcoder root cause]

The E-Model is the canonical conversion: R = 93.2 - latency_impact - jitter_impact - loss_impact - codec_impact, then MOS = 1 + 0.035R + 0.0000007R*(R-60)*(100-R). Most VoIP monitoring tools (Twilio Voice Insights, Obkio, Paessler) implement this directly.

CallSphere implementation

CallSphere measures MOS on every call across our six verticals. The Twilio Media Streams bridge that feeds OpenAI Realtime captures RTP-level telemetry; our call_quality table (one of 115+ DB tables) stores per-call latency, jitter, packet loss, and computed MOS. Healthcare AI calls are tagged with the patient ID (HIPAA-compliant) so we can correlate quality with clinical outcomes. Sales Calling AI tags with the lead ID so we can correlate quality with conversion rate. The MOS dashboard (one of 90+ tools) surfaces per-tenant rolling averages and triggers alerts when MOS drops below 4.0 sustained for 5 minutes. Default codec is Opus at 16 kHz for the AI side, transcoded to G.711 at the PSTN edge. Scale ($1499/mo) tenants get a per-call MOS report in the admin console; Growth ($499/mo) tenants get aggregate weekly. The 22% affiliate program credits Scale upgrades driven by quality SLAs.

Build steps

Capture RTP statistics on every call: latency (RTT/2), jitter (RFC 3550 inter-arrival jitter), packet loss percentage.
Compute MOS via the E-Model on each call, weighted by codec choice (G.711, Opus, AMR-WB).
Store per-call MOS in your central database; tag with tenant, agent, and call type.
Build a real-time dashboard showing rolling MOS averages per tenant per hour.
Set alerts: MOS < 4.0 sustained for 5 min on a tenant; MOS < 3.6 on any single call.
Investigate alerts root-cause: codec choice, network path, transcoder load, AI bridge buffer underrun.
Tune codec selection per route: Opus where supported end-to-end; G.711 fallback at PSTN edge.
Run weekly MOS reports per tenant; share with customer success on degradation trends.

FAQ

What is a "good" MOS for AI voice? 4.0 minimum sustained; 4.2 ideal. Below 4.0 the AI's already-synthetic speech starts to feel robotic; below 3.6 conversations break.

Does Opus beat G.711 for AI voice? Yes when end-to-end. Opus at 16 kHz can hit MOS 4.5+; G.711 caps at 4.4. The catch is the PSTN side typically G.711-only, so you transcode at the edge and lose some of the gain.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

How much does latency hurt MOS? Significantly above 150 ms one-way. The E-Model penalizes latency steeply after 175 ms because it adds conversational delay that listeners notice.

Can I improve MOS just by buying better internet? Sometimes. If packet loss is the dominant variable, yes. If it is codec or one-way latency due to physical distance, no.

Does CallSphere expose MOS metrics? Yes, on Growth and Scale plans. Per-call detail is on Scale; aggregate weekly is on Growth. Starter shows session-level "good/fair/poor" labels only.

Sources

Start a 14-day trial with full MOS visibility, browse pricing for per-call analytics on Scale, or book a demo. Partners earn 22% via the affiliate program; SLA questions go to contact.

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

Background

Architecture

CallSphere implementation

Build steps

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

The Agent Evaluation Stack in 2026: From Trace to Eval Score

Arize Phoenix: Open-Source LLM Tracing in 2026 Reviewed Honestly

Langfuse 2026 Update: Evals, Prompt Management, and Datasets Mature

Agent Memory Debugging Tools in 2026: A Practical Comparison

How to Add Voice Agent Observability with Langfuse and OpenTelemetry