Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

TL;DR

If your agent streams its output and your evaluation harness only grades the final concatenated string, you are blind to roughly half of what users experience. A streamed answer can be correct in aggregate and still feel awful — slow first token, jittery middle, an early sentence that contradicts a later one, a confident hallucination that arrives in token 40 and gets corrected in token 200 (after the user already started believing it). This post defines the streaming-specific metric set we run on every release of our voice and chat agents, with working Python code that wraps a streaming agent and emits TTFT, inter-token jitter, partial-answer correctness, mid-stream contradiction, and mid-stream claim-extraction-based hallucination signals. Models pinned to gpt-4o-2024-11-20 for the agent under test and gpt-4.1-2025-04-14 for the streaming judge.

Why Final-Answer Eval Is Insufficient

The standard agent evaluation stack (covered in our evaluation stack overview) grades the final output of a run against a reference or rubric. That works for batch and async use cases. It fails for any surface where the user reads or hears the response as it is generated, which now includes essentially every chat UI, every voice agent, and every realtime browser agent.

Concrete failure modes that final-answer eval misses:

Slow TTFT — the user perceives the agent as broken and abandons before the (correct) answer finishes streaming.
Jittery streaming — long pauses between tokens make voice TTS sound robotic and chat UIs feel hung, even when total latency is fine.
Early hallucination, late correction — token 40 says "the meeting is at 3pm" and token 200 says "actually 4pm." Final answer is correct; user perception is broken.
Mid-stream self-contradiction — the agent emits two incompatible facts and never reconciles them, but the final-string evaluator only checks one.
Premature commitment — the agent commits to a course of action in token 20 that token 80's tool result invalidates.

You catch these only by evaluating along the stream, not at the end of it.

The Streaming Eval Pipeline

flowchart TD
  A[Agent stream] --> B[Token tap]
  B --> C[Latency metrics]
  B --> D[Rolling text buffer]
  D --> E{Every N tokens or sentence boundary}
  E --> F[Claim extractor LLM]
  F --> G[Claim history]
  G --> H[Contradiction check]
  G --> I[Groundedness check vs retrieved docs]
  D --> J[Partial-answer correctness probe]
  C --> K[Metrics sink]
  H --> K
  I --> K
  J --> K
  K --> L[LangSmith feedback per run]
  style F fill:#fef3c7
  style K fill:#dcfce7
  style L fill:#e0f2fe

Figure 1 — The streaming eval pipeline. The token tap is in-band with the agent stream; latency metrics fall out for free; semantic checks (claim extraction, contradiction, groundedness) run on a rolling window every N tokens or sentence boundary.

The pipeline has three parallel concerns: latency (cheap, deterministic), partial correctness (cheap if reference exists, judge-based otherwise), and mid-stream semantic drift (expensive — judge LLM on a rolling window).

Metric Catalog

Metric	Type	What it catches	Cost
TTFT (time to first token)	Latency	"App feels broken" abandonment	Free
Inter-token p50 / p95 / p99	Latency	Stream stutter, robotic TTS	Free
Stream smoothness (1 - CV of inter-token gaps)	Latency-derived	Jittery cadence	Free
Final-token latency	Latency	Slow completion at long contexts	Free
Partial-answer correctness @ N tokens	Judge	Wrong direction taken early	$$
Mid-stream self-contradiction	Judge	Agent reverses itself silently	$$$
Mid-stream hallucination (claim-extraction)	Judge + retrieval	Confident unsupported claims	$$$
Premature commitment	Heuristic + judge	Decisions made before tool results	$$
Stream-cancel safety	Heuristic	Useful answer if cut off mid-stream	$

The right-most column is the rough cost shape. Latency metrics fall out of the stream tap for free. The judge-based metrics are where the eval bill grows; we run them on 100% of dataset rows for full evals and on a 5% sample for online evals.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

A Working Streaming Evaluator (Python)

This wraps an OpenAI Agents SDK streaming run and emits the metrics. Pin model snapshots.

import asyncio
import statistics
import time
from dataclasses import dataclass, field
from typing import Optional
from agents import Agent, Runner
from openai import AsyncOpenAI

JUDGE_MODEL = "gpt-4.1-2025-04-14"
judge = AsyncOpenAI()

@dataclass
class StreamMetrics:
    ttft_ms: Optional[float] = None
    inter_token_gaps_ms: list[float] = field(default_factory=list)
    final_latency_ms: Optional[float] = None
    partial_checkpoints: list[dict] = field(default_factory=list)
    extracted_claims: list[str] = field(default_factory=list)
    contradictions: list[dict] = field(default_factory=list)
    hallucinations: list[dict] = field(default_factory=list)

async def extract_claims(text: str) -> list[str]:
    """Return atomic factual claims from the text so far."""
    resp = await judge.chat.completions.create(
        model=JUDGE_MODEL,
        temperature=0,
        messages=[
            {"role": "system", "content": (
                "Extract atomic factual claims from the assistant text. "
                "Return one claim per line. Skip questions, hedges, and stylistic filler."
            )},
            {"role": "user", "content": text},
        ],
    )
    body = resp.choices[0].message.content or ""
    return [c.strip("- ").strip() for c in body.splitlines() if c.strip()]

async def contradiction_check(claims: list[str]) -> list[dict]:
    if len(claims) < 2:
        return []
    joined = "\n".join(f"- {c}" for c in claims)
    resp = await judge.chat.completions.create(
        model=JUDGE_MODEL,
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": (
                "Given a list of claims, identify any pair that is logically "
                "contradictory. Return JSON {\"contradictions\": [[\"claim A\", \"claim B\"]]}."
            )},
            {"role": "user", "content": joined},
        ],
    )
    import json
    return json.loads(resp.choices[0].message.content or "{}").get("contradictions", [])

async def groundedness_check(claim: str, evidence: str) -> dict:
    resp = await judge.chat.completions.create(
        model=JUDGE_MODEL,
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": (
                "Decide if CLAIM is supported by EVIDENCE. "
                "Return JSON {\"supported\": bool, \"reason\": str}."
            )},
            {"role": "user", "content": f"CLAIM: {claim}\nEVIDENCE: {evidence}"},
        ],
    )
    import json
    return json.loads(resp.choices[0].message.content or "{}")

async def run_with_streaming_eval(agent: Agent, prompt: str, evidence: str) -> StreamMetrics:
    m = StreamMetrics()
    started = time.perf_counter()
    last_token_at: Optional[float] = None
    rolling = []
    token_count = 0
    CHECKPOINT_EVERY = 60  # tokens

    result = Runner.run_streamed(agent, input=prompt)

    async for event in result.stream_events():
        if event.type != "raw_response_event":
            continue
        data = event.data
        if data.type != "response.output_text.delta":
            continue

        now = time.perf_counter()
        if m.ttft_ms is None:
            m.ttft_ms = (now - started) * 1000
        if last_token_at is not None:
            m.inter_token_gaps_ms.append((now - last_token_at) * 1000)
        last_token_at = now

        rolling.append(data.delta)
        token_count += 1

        if token_count % CHECKPOINT_EVERY == 0:
            partial = "".join(rolling)
            claims = await extract_claims(partial)
            new_claims = [c for c in claims if c not in m.extracted_claims]
            m.extracted_claims.extend(new_claims)

            # Groundedness on each new claim
            for c in new_claims:
                g = await groundedness_check(c, evidence)
                if not g.get("supported", True):
                    m.hallucinations.append({
                        "claim": c,
                        "at_token": token_count,
                        "reason": g.get("reason"),
                    })

            # Self-contradiction across the whole claim history
            cs = await contradiction_check(m.extracted_claims)
            if cs:
                m.contradictions.append({"at_token": token_count, "pairs": cs})

            m.partial_checkpoints.append({
                "at_token": token_count,
                "claim_count": len(m.extracted_claims),
                "hallucination_count": len(m.hallucinations),
            })

    m.final_latency_ms = (time.perf_counter() - started) * 1000
    return m

def smoothness(gaps_ms: list[float]) -> float:
    if len(gaps_ms) < 5:
        return 1.0
    mean = statistics.mean(gaps_ms)
    if mean == 0:
        return 1.0
    cv = statistics.pstdev(gaps_ms) / mean
    return max(0.0, 1.0 - cv)

A few production notes that are easy to miss:

The judge calls run concurrently with the stream, but you must ensure they do not block stream consumption. In practice we shunt them to a asyncio.Queue consumed by a background task and surface results post-hoc; the snippet above keeps the inline form for clarity.
Claim extraction is the expensive step. Limit it to sentence boundaries, not every token. We use a simple regex to detect . , ? , ! and only extract when one fires and token count since last extraction is ≥ 30.
Groundedness check needs evidence. For RAG agents, the retrieved docs are the evidence. For tool-using agents without retrieval, you cannot run groundedness — fall back to "is this claim verifiable in real time" against a search tool, which is a different and harder check.

Stream Smoothness in Practice

Inter-token gaps are not normally distributed. They cluster around ~8ms when the model is ripping and spike to ~300–600ms during what we suspect are scheduler hiccups on the OpenAI side. The metric we found most useful is the coefficient-of-variation-derived smoothness, 1 - (stdev / mean), clamped to [0, 1]. Values above 0.85 are subjectively fine; below 0.65, voice TTS pipelines start sounding stuttery.

Smoothness range	User perception (voice)	Action
> 0.90	Natural cadence	None
0.80–0.90	Slight unevenness	Watch trends
0.65–0.80	Noticeable stutter	Investigate model/region
< 0.65	Robotic, broken	Page on-call

We alert at p95 smoothness < 0.75 over 5-minute windows.

Mid-Stream Hallucination — Why Claim-by-Claim Beats End-of-Stream

Running a single hallucination check on the final answer has two problems: (a) the bad claim may have been corrected later, so the final string passes while the user already saw the wrong claim, and (b) catching it after the fact gives you no chance to intervene. Claim-by-claim mid-stream extraction lets you both grade per-token-window quality and, optionally, cancel the stream and re-run if a high-confidence hallucination fires before token 100.

In our voice-agent eval suite, mid-stream hallucination detection caught ~3.4x more hallucinations than end-of-stream evaluation alone, because the model's own self-correction at the end of a long answer was masking earlier bad claims. The companion piece on voice agent quality metrics has more on grounding-specific evaluators.

Partial-Answer Correctness Checkpoints

Some prompts have a "right direction" that should be observable early. Example: "What's the weather in Mumbai today?" — by token ~30, the answer should at minimum mention Mumbai and a temperature or condition; if it has wandered into the history of Mumbai's monsoon patterns, the direction is wrong even if the final answer ends up correct.

We model this as a checkpoint evaluator that takes (prompt, partial_text, expected_direction_rubric) and returns on_track: bool. The rubric is small — usually one or two sentences — and is part of the dataset row.

async def on_track(prompt: str, partial: str, rubric: str) -> dict:
    resp = await judge.chat.completions.create(
        model=JUDGE_MODEL,
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": (
                "Given a user prompt, the assistant's partial answer so far, "
                "and a rubric for what an on-track answer should be doing by "
                "this point, return JSON {\"on_track\": bool, \"reason\": str}."
            )},
            {"role": "user", "content": (
                f"PROMPT: {prompt}\nPARTIAL: {partial}\nRUBRIC: {rubric}"
            )},
        ],
    )
    import json
    return json.loads(resp.choices[0].message.content or "{}")

We run this at token 30, 90, and 200. Below 30, the answer is too short to grade. Above 200, the final-answer evaluator covers it.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Tying It Into LangSmith Feedback

Every metric becomes a LangSmith feedback row keyed by the streaming run's run_id, so the continuous evaluation gate in CI/CD can apply thresholds the same way it does for non-streaming evaluators. The threshold shape we use:

Metric	Floor	Max regression
ttft_ms p95	< 900 ms	+120 ms
smoothness p50	> 0.80	-0.05
partial_on_track @ 90 tokens	> 0.92	-0.02
mid_stream_contradiction_rate	< 0.5%	+0.2%
mid_stream_hallucination_rate	< 1.0%	+0.3%

Floors stop slow drift; regression limits stop sudden cliffs. Both are required.

Real Numbers From Production

Across April 2026 on our voice and chat agents handling roughly 280k sessions:

Mid-stream hallucination rate: 0.7% on chat, 1.1% on voice (voice is higher because grounding evidence is sparser).
Mid-stream contradiction rate: 0.3% on chat, 0.4% on voice.
TTFT p95: 760 ms on chat (gpt-4.1), 540 ms on voice (realtime).
Stream smoothness p50: 0.91 on chat, 0.88 on voice.
Eval cost overhead: streaming evals add ~$0.018/session at current pricing, vs ~$0.004 for end-only evaluation. We run streaming evals on 5% of online traffic and 100% of offline dataset runs, which lands at ~$420/month total.

The ROI is in the regressions caught. In Q1 we caught four agent prompt changes that passed final-answer eval but failed streaming eval — three on partial correctness, one on a sudden TTFT regression caused by a tool added to the agent's tool list (which inflated the system prompt size). Without streaming eval, all four would have shipped.

Frequently Asked Questions

Can I run streaming eval against non-OpenAI models?

Yes — the pipeline is model-agnostic. Replace the agent under test with any streaming runnable. The judge model is independent and can stay on a single high-quality model regardless of what the agent uses.

Doesn't claim extraction every 60 tokens add a lot of latency to the eval run?

The extraction calls run off the critical path of the agent run — they consume the same stream the user does, but their results land asynchronously. The agent's perceived latency is unchanged. The eval run itself takes ~1.4x as long wall-clock as the underlying agent run.

How do I version the rubrics for partial-correctness checkpoints?

Same way we version reference outputs in the golden dataset workflow — they live in the dataset row metadata with a rubric_version field, and judge runs are tagged with the version so old experiments are still comparable.

What about streaming structured outputs (JSON mode)?

Slightly different. For JSON streams, partial parsing is structural rather than semantic — you check that the partial JSON is prefix-valid and that critical keys appear in order. We use a streaming JSON parser and grade "is the schema being respected as it streams?" rather than running claim extraction.

Should I cancel runs that fire mid-stream hallucination flags?

Cautiously. We do this only on online evals at production traffic, not on offline evals (where the goal is to measure, not intervene). Cancellation triggers a full re-run with a different seed/temperature. We cancel ~0.4% of streams this way; user-perceived quality on canceled-and-re-streamed sessions is statistically indistinguishable from sessions that streamed cleanly the first time.

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

TL;DR

Why Final-Answer Eval Is Insufficient

The Streaming Eval Pipeline

Metric Catalog

A Working Streaming Evaluator (Python)

Stream Smoothness in Practice

Mid-Stream Hallucination — Why Claim-by-Claim Beats End-of-Stream

Partial-Answer Correctness Checkpoints

Tying It Into LangSmith Feedback

Real Numbers From Production

Frequently Asked Questions

Can I run streaming eval against non-OpenAI models?

Doesn't claim extraction every 60 tokens add a lot of latency to the eval run?

How do I version the rubrics for partial-correctness checkpoints?

What about streaming structured outputs (JSON mode)?

Should I cancel runs that fire mid-stream hallucination flags?

Try CallSphere AI Voice Agents

Related Articles You May Like

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

How to Build a Golden Dataset for Production AI Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split