By Sagar Shankaran, Founder of CallSphere
Streaming changes the eval game — final-answer correctness isn't enough when users perceive the answer one token at a time. Here's the metric set that matters.
Key takeaways
If your agent streams its output and your evaluation harness only grades the final concatenated string, you are blind to roughly half of what users experience. A streamed answer can be correct in aggregate and still feel awful — slow first token, jittery middle, an early sentence that contradicts a later one, a confident hallucination that arrives in token 40 and gets corrected in token 200 (after the user already started believing it). This post defines the streaming-specific metric set we run on every release of our voice and chat agents, with working Python code that wraps a streaming agent and emits TTFT, inter-token jitter, partial-answer correctness, mid-stream contradiction, and mid-stream claim-extraction-based hallucination signals. Models pinned to gpt-4o-2024-11-20 for the agent under test and gpt-4.1-2025-04-14 for the streaming judge.
The standard agent evaluation stack (covered in our evaluation stack overview) grades the final output of a run against a reference or rubric. That works for batch and async use cases. It fails for any surface where the user reads or hears the response as it is generated, which now includes essentially every chat UI, every voice agent, and every realtime browser agent.
Concrete failure modes that final-answer eval misses:
You catch these only by evaluating along the stream, not at the end of it.
flowchart TD
A[Agent stream] --> B[Token tap]
B --> C[Latency metrics]
B --> D[Rolling text buffer]
D --> E{Every N tokens or sentence boundary}
E --> F[Claim extractor LLM]
F --> G[Claim history]
G --> H[Contradiction check]
G --> I[Groundedness check vs retrieved docs]
D --> J[Partial-answer correctness probe]
C --> K[Metrics sink]
H --> K
I --> K
J --> K
K --> L[LangSmith feedback per run]
style F fill:#fef3c7
style K fill:#dcfce7
style L fill:#e0f2fe
Figure 1 — The streaming eval pipeline. The token tap is in-band with the agent stream; latency metrics fall out for free; semantic checks (claim extraction, contradiction, groundedness) run on a rolling window every N tokens or sentence boundary.
The pipeline has three parallel concerns: latency (cheap, deterministic), partial correctness (cheap if reference exists, judge-based otherwise), and mid-stream semantic drift (expensive — judge LLM on a rolling window).
| Metric | Type | What it catches | Cost |
|---|---|---|---|
| TTFT (time to first token) | Latency | "App feels broken" abandonment | Free |
| Inter-token p50 / p95 / p99 | Latency | Stream stutter, robotic TTS | Free |
| Stream smoothness (1 - CV of inter-token gaps) | Latency-derived | Jittery cadence | Free |
| Final-token latency | Latency | Slow completion at long contexts | Free |
| Partial-answer correctness @ N tokens | Judge | Wrong direction taken early | $$ |
| Mid-stream self-contradiction | Judge | Agent reverses itself silently | $$$ |
| Mid-stream hallucination (claim-extraction) | Judge + retrieval | Confident unsupported claims | $$$ |
| Premature commitment | Heuristic + judge | Decisions made before tool results | $$ |
| Stream-cancel safety | Heuristic | Useful answer if cut off mid-stream | $ |
The right-most column is the rough cost shape. Latency metrics fall out of the stream tap for free. The judge-based metrics are where the eval bill grows; we run them on 100% of dataset rows for full evals and on a 5% sample for online evals.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
This wraps an OpenAI Agents SDK streaming run and emits the metrics. Pin model snapshots.
import asyncio
import statistics
import time
from dataclasses import dataclass, field
from typing import Optional
from agents import Agent, Runner
from openai import AsyncOpenAI
JUDGE_MODEL = "gpt-4.1-2025-04-14"
judge = AsyncOpenAI()
@dataclass
class StreamMetrics:
ttft_ms: Optional[float] = None
inter_token_gaps_ms: list[float] = field(default_factory=list)
final_latency_ms: Optional[float] = None
partial_checkpoints: list[dict] = field(default_factory=list)
extracted_claims: list[str] = field(default_factory=list)
contradictions: list[dict] = field(default_factory=list)
hallucinations: list[dict] = field(default_factory=list)
async def extract_claims(text: str) -> list[str]:
"""Return atomic factual claims from the text so far."""
resp = await judge.chat.completions.create(
model=JUDGE_MODEL,
temperature=0,
messages=[
{"role": "system", "content": (
"Extract atomic factual claims from the assistant text. "
"Return one claim per line. Skip questions, hedges, and stylistic filler."
)},
{"role": "user", "content": text},
],
)
body = resp.choices[0].message.content or ""
return [c.strip("- ").strip() for c in body.splitlines() if c.strip()]
async def contradiction_check(claims: list[str]) -> list[dict]:
if len(claims) < 2:
return []
joined = "\n".join(f"- {c}" for c in claims)
resp = await judge.chat.completions.create(
model=JUDGE_MODEL,
temperature=0,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": (
"Given a list of claims, identify any pair that is logically "
"contradictory. Return JSON {\"contradictions\": [[\"claim A\", \"claim B\"]]}."
)},
{"role": "user", "content": joined},
],
)
import json
return json.loads(resp.choices[0].message.content or "{}").get("contradictions", [])
async def groundedness_check(claim: str, evidence: str) -> dict:
resp = await judge.chat.completions.create(
model=JUDGE_MODEL,
temperature=0,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": (
"Decide if CLAIM is supported by EVIDENCE. "
"Return JSON {\"supported\": bool, \"reason\": str}."
)},
{"role": "user", "content": f"CLAIM: {claim}\nEVIDENCE: {evidence}"},
],
)
import json
return json.loads(resp.choices[0].message.content or "{}")
async def run_with_streaming_eval(agent: Agent, prompt: str, evidence: str) -> StreamMetrics:
m = StreamMetrics()
started = time.perf_counter()
last_token_at: Optional[float] = None
rolling = []
token_count = 0
CHECKPOINT_EVERY = 60 # tokens
result = Runner.run_streamed(agent, input=prompt)
async for event in result.stream_events():
if event.type != "raw_response_event":
continue
data = event.data
if data.type != "response.output_text.delta":
continue
now = time.perf_counter()
if m.ttft_ms is None:
m.ttft_ms = (now - started) * 1000
if last_token_at is not None:
m.inter_token_gaps_ms.append((now - last_token_at) * 1000)
last_token_at = now
rolling.append(data.delta)
token_count += 1
if token_count % CHECKPOINT_EVERY == 0:
partial = "".join(rolling)
claims = await extract_claims(partial)
new_claims = [c for c in claims if c not in m.extracted_claims]
m.extracted_claims.extend(new_claims)
# Groundedness on each new claim
for c in new_claims:
g = await groundedness_check(c, evidence)
if not g.get("supported", True):
m.hallucinations.append({
"claim": c,
"at_token": token_count,
"reason": g.get("reason"),
})
# Self-contradiction across the whole claim history
cs = await contradiction_check(m.extracted_claims)
if cs:
m.contradictions.append({"at_token": token_count, "pairs": cs})
m.partial_checkpoints.append({
"at_token": token_count,
"claim_count": len(m.extracted_claims),
"hallucination_count": len(m.hallucinations),
})
m.final_latency_ms = (time.perf_counter() - started) * 1000
return m
def smoothness(gaps_ms: list[float]) -> float:
if len(gaps_ms) < 5:
return 1.0
mean = statistics.mean(gaps_ms)
if mean == 0:
return 1.0
cv = statistics.pstdev(gaps_ms) / mean
return max(0.0, 1.0 - cv)
A few production notes that are easy to miss:
asyncio.Queue consumed by a background task and surface results post-hoc; the snippet above keeps the inline form for clarity.. , ? , ! and only extract when one fires and token count since last extraction is ≥ 30.Inter-token gaps are not normally distributed. They cluster around ~8ms when the model is ripping and spike to ~300–600ms during what we suspect are scheduler hiccups on the OpenAI side. The metric we found most useful is the coefficient-of-variation-derived smoothness, 1 - (stdev / mean), clamped to [0, 1]. Values above 0.85 are subjectively fine; below 0.65, voice TTS pipelines start sounding stuttery.
| Smoothness range | User perception (voice) | Action |
|---|---|---|
| > 0.90 | Natural cadence | None |
| 0.80–0.90 | Slight unevenness | Watch trends |
| 0.65–0.80 | Noticeable stutter | Investigate model/region |
| < 0.65 | Robotic, broken | Page on-call |
We alert at p95 smoothness < 0.75 over 5-minute windows.
Running a single hallucination check on the final answer has two problems: (a) the bad claim may have been corrected later, so the final string passes while the user already saw the wrong claim, and (b) catching it after the fact gives you no chance to intervene. Claim-by-claim mid-stream extraction lets you both grade per-token-window quality and, optionally, cancel the stream and re-run if a high-confidence hallucination fires before token 100.
In our voice-agent eval suite, mid-stream hallucination detection caught ~3.4x more hallucinations than end-of-stream evaluation alone, because the model's own self-correction at the end of a long answer was masking earlier bad claims. The companion piece on voice agent quality metrics has more on grounding-specific evaluators.
Some prompts have a "right direction" that should be observable early. Example: "What's the weather in Mumbai today?" — by token ~30, the answer should at minimum mention Mumbai and a temperature or condition; if it has wandered into the history of Mumbai's monsoon patterns, the direction is wrong even if the final answer ends up correct.
We model this as a checkpoint evaluator that takes (prompt, partial_text, expected_direction_rubric) and returns on_track: bool. The rubric is small — usually one or two sentences — and is part of the dataset row.
async def on_track(prompt: str, partial: str, rubric: str) -> dict:
resp = await judge.chat.completions.create(
model=JUDGE_MODEL,
temperature=0,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": (
"Given a user prompt, the assistant's partial answer so far, "
"and a rubric for what an on-track answer should be doing by "
"this point, return JSON {\"on_track\": bool, \"reason\": str}."
)},
{"role": "user", "content": (
f"PROMPT: {prompt}\nPARTIAL: {partial}\nRUBRIC: {rubric}"
)},
],
)
import json
return json.loads(resp.choices[0].message.content or "{}")
We run this at token 30, 90, and 200. Below 30, the answer is too short to grade. Above 200, the final-answer evaluator covers it.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Every metric becomes a LangSmith feedback row keyed by the streaming run's run_id, so the continuous evaluation gate in CI/CD can apply thresholds the same way it does for non-streaming evaluators. The threshold shape we use:
| Metric | Floor | Max regression |
|---|---|---|
| ttft_ms p95 | < 900 ms | +120 ms |
| smoothness p50 | > 0.80 | -0.05 |
| partial_on_track @ 90 tokens | > 0.92 | -0.02 |
| mid_stream_contradiction_rate | < 0.5% | +0.2% |
| mid_stream_hallucination_rate | < 1.0% | +0.3% |
Floors stop slow drift; regression limits stop sudden cliffs. Both are required.
Across April 2026 on our voice and chat agents handling roughly 280k sessions:
The ROI is in the regressions caught. In Q1 we caught four agent prompt changes that passed final-answer eval but failed streaming eval — three on partial correctness, one on a sudden TTFT regression caused by a tool added to the agent's tool list (which inflated the system prompt size). Without streaming eval, all four would have shipped.
Yes — the pipeline is model-agnostic. Replace the agent under test with any streaming runnable. The judge model is independent and can stay on a single high-quality model regardless of what the agent uses.
The extraction calls run off the critical path of the agent run — they consume the same stream the user does, but their results land asynchronously. The agent's perceived latency is unchanged. The eval run itself takes ~1.4x as long wall-clock as the underlying agent run.
Same way we version reference outputs in the golden dataset workflow — they live in the dataset row metadata with a rubric_version field, and judge runs are tagged with the version so old experiments are still comparable.
Slightly different. For JSON streams, partial parsing is structural rather than semantic — you check that the partial JSON is prefix-valid and that critical keys appear in order. We use a streaming JSON parser and grade "is the schema being respected as it streams?" rather than running claim extraction.
Cautiously. We do this only on online evals at production traffic, not on offline evals (where the goal is to measure, not intervene). Cancellation triggers a full re-run with a different seed/temperature. We cancel ~0.4% of streams this way; user-perceived quality on canceled-and-re-streamed sessions is statistically indistinguishable from sessions that streamed cleanly the first time.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI