Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection
Streaming changes the eval game — final-answer correctness isn't enough when users perceive the answer one token at a time. Here's the metric set that matters.
TL;DR
If your agent streams its output and your evaluation harness only grades the final concatenated string, you are blind to roughly half of what users experience. A streamed answer can be correct in aggregate and still feel awful — slow first token, jittery middle, an early sentence that contradicts a later one, a confident hallucination that arrives in token 40 and gets corrected in token 200 (after the user already started believing it). This post defines the streaming-specific metric set we run on every release of our voice and chat agents, with working Python code that wraps a streaming agent and emits TTFT, inter-token jitter, partial-answer correctness, mid-stream contradiction, and mid-stream claim-extraction-based hallucination signals. Models pinned to gpt-4o-2024-11-20 for the agent under test and gpt-4.1-2025-04-14 for the streaming judge.
Why Final-Answer Eval Is Insufficient
The standard agent evaluation stack (covered in our evaluation stack overview) grades the final output of a run against a reference or rubric. That works for batch and async use cases. It fails for any surface where the user reads or hears the response as it is generated, which now includes essentially every chat UI, every voice agent, and every realtime browser agent.
Concrete failure modes that final-answer eval misses:
- Slow TTFT — the user perceives the agent as broken and abandons before the (correct) answer finishes streaming.
- Jittery streaming — long pauses between tokens make voice TTS sound robotic and chat UIs feel hung, even when total latency is fine.
- Early hallucination, late correction — token 40 says "the meeting is at 3pm" and token 200 says "actually 4pm." Final answer is correct; user perception is broken.
- Mid-stream self-contradiction — the agent emits two incompatible facts and never reconciles them, but the final-string evaluator only checks one.
- Premature commitment — the agent commits to a course of action in token 20 that token 80's tool result invalidates.
You catch these only by evaluating along the stream, not at the end of it.
The Streaming Eval Pipeline
flowchart TD
A[Agent stream] --> B[Token tap]
B --> C[Latency metrics]
B --> D[Rolling text buffer]
D --> E{Every N tokens or sentence boundary}
E --> F[Claim extractor LLM]
F --> G[Claim history]
G --> H[Contradiction check]
G --> I[Groundedness check vs retrieved docs]
D --> J[Partial-answer correctness probe]
C --> K[Metrics sink]
H --> K
I --> K
J --> K
K --> L[LangSmith feedback per run]
style F fill:#fef3c7
style K fill:#dcfce7
style L fill:#e0f2fe
Figure 1 — The streaming eval pipeline. The token tap is in-band with the agent stream; latency metrics fall out for free; semantic checks (claim extraction, contradiction, groundedness) run on a rolling window every N tokens or sentence boundary.
The pipeline has three parallel concerns: latency (cheap, deterministic), partial correctness (cheap if reference exists, judge-based otherwise), and mid-stream semantic drift (expensive — judge LLM on a rolling window).
Metric Catalog
| Metric | Type | What it catches | Cost |
|---|---|---|---|
| TTFT (time to first token) | Latency | "App feels broken" abandonment | Free |
| Inter-token p50 / p95 / p99 | Latency | Stream stutter, robotic TTS | Free |
| Stream smoothness (1 - CV of inter-token gaps) | Latency-derived | Jittery cadence | Free |
| Final-token latency | Latency | Slow completion at long contexts | Free |
| Partial-answer correctness @ N tokens | Judge | Wrong direction taken early | $$ |
| Mid-stream self-contradiction | Judge | Agent reverses itself silently | $$$ |
| Mid-stream hallucination (claim-extraction) | Judge + retrieval | Confident unsupported claims | $$$ |
| Premature commitment | Heuristic + judge | Decisions made before tool results | $$ |
| Stream-cancel safety | Heuristic | Useful answer if cut off mid-stream | $ |
The right-most column is the rough cost shape. Latency metrics fall out of the stream tap for free. The judge-based metrics are where the eval bill grows; we run them on 100% of dataset rows for full evals and on a 5% sample for online evals.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
A Working Streaming Evaluator (Python)
This wraps an OpenAI Agents SDK streaming run and emits the metrics. Pin model snapshots.
import asyncio
import statistics
import time
from dataclasses import dataclass, field
from typing import Optional
from agents import Agent, Runner
from openai import AsyncOpenAI
JUDGE_MODEL = "gpt-4.1-2025-04-14"
judge = AsyncOpenAI()
@dataclass
class StreamMetrics:
ttft_ms: Optional[float] = None
inter_token_gaps_ms: list[float] = field(default_factory=list)
final_latency_ms: Optional[float] = None
partial_checkpoints: list[dict] = field(default_factory=list)
extracted_claims: list[str] = field(default_factory=list)
contradictions: list[dict] = field(default_factory=list)
hallucinations: list[dict] = field(default_factory=list)
async def extract_claims(text: str) -> list[str]:
"""Return atomic factual claims from the text so far."""
resp = await judge.chat.completions.create(
model=JUDGE_MODEL,
temperature=0,
messages=[
{"role": "system", "content": (
"Extract atomic factual claims from the assistant text. "
"Return one claim per line. Skip questions, hedges, and stylistic filler."
)},
{"role": "user", "content": text},
],
)
body = resp.choices[0].message.content or ""
return [c.strip("- ").strip() for c in body.splitlines() if c.strip()]
async def contradiction_check(claims: list[str]) -> list[dict]:
if len(claims) < 2:
return []
joined = "\n".join(f"- {c}" for c in claims)
resp = await judge.chat.completions.create(
model=JUDGE_MODEL,
temperature=0,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": (
"Given a list of claims, identify any pair that is logically "
"contradictory. Return JSON {\"contradictions\": [[\"claim A\", \"claim B\"]]}."
)},
{"role": "user", "content": joined},
],
)
import json
return json.loads(resp.choices[0].message.content or "{}").get("contradictions", [])
async def groundedness_check(claim: str, evidence: str) -> dict:
resp = await judge.chat.completions.create(
model=JUDGE_MODEL,
temperature=0,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": (
"Decide if CLAIM is supported by EVIDENCE. "
"Return JSON {\"supported\": bool, \"reason\": str}."
)},
{"role": "user", "content": f"CLAIM: {claim}\nEVIDENCE: {evidence}"},
],
)
import json
return json.loads(resp.choices[0].message.content or "{}")
async def run_with_streaming_eval(agent: Agent, prompt: str, evidence: str) -> StreamMetrics:
m = StreamMetrics()
started = time.perf_counter()
last_token_at: Optional[float] = None
rolling = []
token_count = 0
CHECKPOINT_EVERY = 60 # tokens
result = Runner.run_streamed(agent, input=prompt)
async for event in result.stream_events():
if event.type != "raw_response_event":
continue
data = event.data
if data.type != "response.output_text.delta":
continue
now = time.perf_counter()
if m.ttft_ms is None:
m.ttft_ms = (now - started) * 1000
if last_token_at is not None:
m.inter_token_gaps_ms.append((now - last_token_at) * 1000)
last_token_at = now
rolling.append(data.delta)
token_count += 1
if token_count % CHECKPOINT_EVERY == 0:
partial = "".join(rolling)
claims = await extract_claims(partial)
new_claims = [c for c in claims if c not in m.extracted_claims]
m.extracted_claims.extend(new_claims)
# Groundedness on each new claim
for c in new_claims:
g = await groundedness_check(c, evidence)
if not g.get("supported", True):
m.hallucinations.append({
"claim": c,
"at_token": token_count,
"reason": g.get("reason"),
})
# Self-contradiction across the whole claim history
cs = await contradiction_check(m.extracted_claims)
if cs:
m.contradictions.append({"at_token": token_count, "pairs": cs})
m.partial_checkpoints.append({
"at_token": token_count,
"claim_count": len(m.extracted_claims),
"hallucination_count": len(m.hallucinations),
})
m.final_latency_ms = (time.perf_counter() - started) * 1000
return m
def smoothness(gaps_ms: list[float]) -> float:
if len(gaps_ms) < 5:
return 1.0
mean = statistics.mean(gaps_ms)
if mean == 0:
return 1.0
cv = statistics.pstdev(gaps_ms) / mean
return max(0.0, 1.0 - cv)
A few production notes that are easy to miss:
- The judge calls run concurrently with the stream, but you must ensure they do not block stream consumption. In practice we shunt them to a
asyncio.Queueconsumed by a background task and surface results post-hoc; the snippet above keeps the inline form for clarity. - Claim extraction is the expensive step. Limit it to sentence boundaries, not every token. We use a simple regex to detect
.,?,!and only extract when one fires and token count since last extraction is ≥ 30. - Groundedness check needs evidence. For RAG agents, the retrieved docs are the evidence. For tool-using agents without retrieval, you cannot run groundedness — fall back to "is this claim verifiable in real time" against a search tool, which is a different and harder check.
Stream Smoothness in Practice
Inter-token gaps are not normally distributed. They cluster around ~8ms when the model is ripping and spike to ~300–600ms during what we suspect are scheduler hiccups on the OpenAI side. The metric we found most useful is the coefficient-of-variation-derived smoothness, 1 - (stdev / mean), clamped to [0, 1]. Values above 0.85 are subjectively fine; below 0.65, voice TTS pipelines start sounding stuttery.
| Smoothness range | User perception (voice) | Action |
|---|---|---|
| > 0.90 | Natural cadence | None |
| 0.80–0.90 | Slight unevenness | Watch trends |
| 0.65–0.80 | Noticeable stutter | Investigate model/region |
| < 0.65 | Robotic, broken | Page on-call |
We alert at p95 smoothness < 0.75 over 5-minute windows.
Mid-Stream Hallucination — Why Claim-by-Claim Beats End-of-Stream
Running a single hallucination check on the final answer has two problems: (a) the bad claim may have been corrected later, so the final string passes while the user already saw the wrong claim, and (b) catching it after the fact gives you no chance to intervene. Claim-by-claim mid-stream extraction lets you both grade per-token-window quality and, optionally, cancel the stream and re-run if a high-confidence hallucination fires before token 100.
In our voice-agent eval suite, mid-stream hallucination detection caught ~3.4x more hallucinations than end-of-stream evaluation alone, because the model's own self-correction at the end of a long answer was masking earlier bad claims. The companion piece on voice agent quality metrics has more on grounding-specific evaluators.
Partial-Answer Correctness Checkpoints
Some prompts have a "right direction" that should be observable early. Example: "What's the weather in Mumbai today?" — by token ~30, the answer should at minimum mention Mumbai and a temperature or condition; if it has wandered into the history of Mumbai's monsoon patterns, the direction is wrong even if the final answer ends up correct.
We model this as a checkpoint evaluator that takes (prompt, partial_text, expected_direction_rubric) and returns on_track: bool. The rubric is small — usually one or two sentences — and is part of the dataset row.
async def on_track(prompt: str, partial: str, rubric: str) -> dict:
resp = await judge.chat.completions.create(
model=JUDGE_MODEL,
temperature=0,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": (
"Given a user prompt, the assistant's partial answer so far, "
"and a rubric for what an on-track answer should be doing by "
"this point, return JSON {\"on_track\": bool, \"reason\": str}."
)},
{"role": "user", "content": (
f"PROMPT: {prompt}\nPARTIAL: {partial}\nRUBRIC: {rubric}"
)},
],
)
import json
return json.loads(resp.choices[0].message.content or "{}")
We run this at token 30, 90, and 200. Below 30, the answer is too short to grade. Above 200, the final-answer evaluator covers it.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Tying It Into LangSmith Feedback
Every metric becomes a LangSmith feedback row keyed by the streaming run's run_id, so the continuous evaluation gate in CI/CD can apply thresholds the same way it does for non-streaming evaluators. The threshold shape we use:
| Metric | Floor | Max regression |
|---|---|---|
| ttft_ms p95 | < 900 ms | +120 ms |
| smoothness p50 | > 0.80 | -0.05 |
| partial_on_track @ 90 tokens | > 0.92 | -0.02 |
| mid_stream_contradiction_rate | < 0.5% | +0.2% |
| mid_stream_hallucination_rate | < 1.0% | +0.3% |
Floors stop slow drift; regression limits stop sudden cliffs. Both are required.
Real Numbers From Production
Across April 2026 on our voice and chat agents handling roughly 280k sessions:
- Mid-stream hallucination rate: 0.7% on chat, 1.1% on voice (voice is higher because grounding evidence is sparser).
- Mid-stream contradiction rate: 0.3% on chat, 0.4% on voice.
- TTFT p95: 760 ms on chat (gpt-4.1), 540 ms on voice (realtime).
- Stream smoothness p50: 0.91 on chat, 0.88 on voice.
- Eval cost overhead: streaming evals add ~$0.018/session at current pricing, vs ~$0.004 for end-only evaluation. We run streaming evals on 5% of online traffic and 100% of offline dataset runs, which lands at ~$420/month total.
The ROI is in the regressions caught. In Q1 we caught four agent prompt changes that passed final-answer eval but failed streaming eval — three on partial correctness, one on a sudden TTFT regression caused by a tool added to the agent's tool list (which inflated the system prompt size). Without streaming eval, all four would have shipped.
Frequently Asked Questions
Can I run streaming eval against non-OpenAI models?
Yes — the pipeline is model-agnostic. Replace the agent under test with any streaming runnable. The judge model is independent and can stay on a single high-quality model regardless of what the agent uses.
Doesn't claim extraction every 60 tokens add a lot of latency to the eval run?
The extraction calls run off the critical path of the agent run — they consume the same stream the user does, but their results land asynchronously. The agent's perceived latency is unchanged. The eval run itself takes ~1.4x as long wall-clock as the underlying agent run.
How do I version the rubrics for partial-correctness checkpoints?
Same way we version reference outputs in the golden dataset workflow — they live in the dataset row metadata with a rubric_version field, and judge runs are tagged with the version so old experiments are still comparable.
What about streaming structured outputs (JSON mode)?
Slightly different. For JSON streams, partial parsing is structural rather than semantic — you check that the partial JSON is prefix-valid and that critical keys appear in order. We use a streaming JSON parser and grade "is the schema being respected as it streams?" rather than running claim extraction.
Should I cancel runs that fire mid-stream hallucination flags?
Cautiously. We do this only on online evals at production traffic, not on offline evals (where the goal is to measure, not intervene). Cancellation triggers a full re-run with a different seed/temperature. We cancel ~0.4% of streams this way; user-perceived quality on canceled-and-re-streamed sessions is statistically indistinguishable from sessions that streamed cleanly the first time.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.