Skip to content
Learn Agentic AI
Learn Agentic AI11 min read2 views

Conversation Quality Metrics: Coherence, Relevance, and Helpfulness Scoring

Learn how to measure AI agent conversation quality using automated scoring rubrics, LLM-as-judge patterns, and human correlation validation to ensure your agent produces coherent and helpful responses.

Beyond Task Completion: Measuring How Well Agents Communicate

An agent can solve the user's problem yet still deliver a terrible experience. Rambling answers, contradicting itself mid-conversation, ignoring context from three messages ago, or answering a question nobody asked — these quality failures erode trust even when the final outcome is correct. Conversation quality metrics capture the how of agent communication, not just the what.

The three pillars of conversation quality are coherence (does the response make logical sense and flow naturally?), relevance (does it address what the user actually asked?), and helpfulness (does it provide actionable, useful information?).

Designing Scoring Rubrics

Rubrics convert subjective quality into structured, repeatable measurements. Each dimension gets a clear scale with concrete anchors.

flowchart TD
    START["Conversation Quality Metrics: Coherence, Relevanc…"] --> A
    A["Beyond Task Completion: Measuring How W…"]
    A --> B
    B["Designing Scoring Rubrics"]
    B --> C
    C["Implementing LLM-as-Judge"]
    C --> D
    D["Deterministic Quality Checks"]
    D --> E
    E["Validating LLM-Judge Correlation with H…"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass
from enum import IntEnum

class QualityScore(IntEnum):
    POOR = 1
    BELOW_AVERAGE = 2
    ADEQUATE = 3
    GOOD = 4
    EXCELLENT = 5

@dataclass
class QualityRubric:
    dimension: str
    anchors: dict[int, str]

COHERENCE_RUBRIC = QualityRubric(
    dimension="coherence",
    anchors={
        1: "Response contradicts itself or is incoherent",
        2: "Mostly understandable but has logical gaps",
        3: "Logically consistent, minor flow issues",
        4: "Well-structured with clear logical flow",
        5: "Perfectly coherent, natural conversational flow",
    },
)

RELEVANCE_RUBRIC = QualityRubric(
    dimension="relevance",
    anchors={
        1: "Completely off-topic or ignores the question",
        2: "Partially addresses the question with filler",
        3: "Addresses the question but includes irrelevant info",
        4: "Directly addresses the question with minimal extras",
        5: "Precisely targets the question with no wasted content",
    },
)

HELPFULNESS_RUBRIC = QualityRubric(
    dimension="helpfulness",
    anchors={
        1: "Provides no useful information or next steps",
        2: "Provides vague or generic information",
        3: "Provides some useful information",
        4: "Provides clear, actionable information",
        5: "Provides exceptional value, anticipates follow-ups",
    },
)

Concrete anchors are essential. Without them, a score of 3 means something different to every evaluator — human or LLM.

Implementing LLM-as-Judge

Using a language model to evaluate another language model's output is surprisingly effective when done carefully. The key is a structured prompt with explicit rubric criteria.

import json
from typing import Optional

async def llm_judge_quality(
    llm_client,
    user_message: str,
    agent_response: str,
    conversation_history: list[dict],
    rubrics: list[QualityRubric],
    model: str = "gpt-4o-mini",
) -> dict:
    rubric_text = ""
    for rubric in rubrics:
        rubric_text += f"\n### {rubric.dimension.title()}\n"
        for score, desc in sorted(rubric.anchors.items()):
            rubric_text += f"  {score}: {desc}\n"

    history_text = "\n".join(
        f"{m['role']}: {m['content']}"
        for m in conversation_history[-6:]
    )

    prompt = f"""You are an expert evaluator for AI agent responses.
Score the agent response on each dimension using the rubric.

## Conversation History (last 6 messages)
{history_text}

## Current User Message
{user_message}

## Agent Response
{agent_response}

## Rubrics
{rubric_text}

Return JSON with this exact structure:
{{
  "coherence": {{"score": <1-5>, "reasoning": "<1 sentence>"}},
  "relevance": {{"score": <1-5>, "reasoning": "<1 sentence>"}},
  "helpfulness": {{"score": <1-5>, "reasoning": "<1 sentence>"}}
}}"""

    response = await llm_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

Setting temperature to 0 improves scoring consistency. Including conversation history is critical — relevance cannot be judged without knowing what came before.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Deterministic Quality Checks

Not every quality signal requires an LLM. Some checks are simple and fast.

import re

def deterministic_quality_checks(
    user_message: str,
    agent_response: str,
    conversation_history: list[dict],
) -> dict:
    checks = {}

    # Response length check
    word_count = len(agent_response.split())
    checks["response_length"] = word_count
    checks["too_short"] = word_count < 10
    checks["too_long"] = word_count > 500

    # Repetition detection
    sentences = re.split(r'[.!?]+', agent_response)
    sentences = [s.strip().lower() for s in sentences if s.strip()]
    unique_ratio = (
        len(set(sentences)) / len(sentences) if sentences else 1.0
    )
    checks["repetition_score"] = round(unique_ratio, 3)

    # Self-contradiction signal (simple heuristic)
    contradiction_phrases = [
        "actually, I was wrong",
        "I apologize, that's incorrect",
        "let me correct myself",
    ]
    checks["self_correction"] = any(
        phrase in agent_response.lower()
        for phrase in contradiction_phrases
    )

    # Question echo detection
    if user_message.strip().endswith("?"):
        user_words = set(user_message.lower().split())
        response_first_sentence = (
            sentences[0] if sentences else ""
        )
        overlap = len(
            user_words & set(response_first_sentence.split())
        )
        checks["question_echo_ratio"] = round(
            overlap / max(len(user_words), 1), 3
        )

    return checks

These fast checks flag obvious problems — extremely short responses, excessive repetition, or the agent echoing the question back instead of answering it. Run them on every response before invoking the more expensive LLM judge.

Validating LLM-Judge Correlation with Humans

An LLM judge is only useful if it agrees with human evaluators. Measure this with inter-annotator agreement.

from scipy import stats

def compute_judge_correlation(
    human_scores: list[int],
    llm_scores: list[int],
) -> dict:
    if len(human_scores) != len(llm_scores):
        raise ValueError("Score lists must be same length")

    spearman_corr, spearman_p = stats.spearmanr(
        human_scores, llm_scores
    )
    exact_agreement = sum(
        1 for h, l in zip(human_scores, llm_scores) if h == l
    ) / len(human_scores)
    within_one = sum(
        1 for h, l in zip(human_scores, llm_scores)
        if abs(h - l) <= 1
    ) / len(human_scores)

    return {
        "spearman_correlation": round(spearman_corr, 3),
        "spearman_p_value": round(spearman_p, 4),
        "exact_agreement": round(exact_agreement, 3),
        "within_one_agreement": round(within_one, 3),
        "sample_size": len(human_scores),
    }

Target a Spearman correlation of 0.7 or higher and within-one agreement of 85 percent or better. If your LLM judge falls below these thresholds, refine your rubric anchors or switch to a more capable judge model.

FAQ

Which LLM should I use as a judge?

Use the strongest model you can afford for judging. GPT-4o or Claude work well. Do not use the same model as your agent — this introduces self-preference bias. A smaller, cheaper model like GPT-4o-mini works surprisingly well for structured rubric scoring where the anchors are clear and concrete.

How many human annotations do I need to validate the judge?

Annotate at least 100 samples with two independent human raters per sample. Compute inter-rater agreement between humans first to establish your ceiling — the LLM judge cannot be more reliable than humans agree with each other. If human agreement is only 60 percent, do not expect 90 percent from the LLM.

Should I score each turn individually or the whole conversation?

Both. Turn-level scoring catches individual bad responses. Conversation-level scoring catches patterns like the agent losing context over time or becoming repetitive across turns. Aggregate turn-level scores to get conversation-level trends, but also run a separate holistic evaluation on the full conversation.


#ConversationQuality #LLMasJudge #AgentEvaluation #NLPMetrics #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Agent A/B Testing: Comparing Model Versions, Prompts, and Architectures in Production

How to A/B test AI agents in production: traffic splitting, evaluation metrics, statistical significance, prompt version comparison, and architecture experiments.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.

Learn Agentic AI

Agent Evaluation Benchmarks 2026: SWE-Bench, GAIA, and Custom Eval Frameworks

Overview of agent evaluation benchmarks including SWE-Bench Verified, GAIA, custom evaluation frameworks, and how to build your own eval pipeline for production agents.

Learn Agentic AI

OpenAI Agents SDK in 2026: Building Multi-Agent Systems with Handoffs and Guardrails

Complete tutorial on the OpenAI Agents SDK covering agent creation, tool definitions, handoff patterns between specialist agents, and input/output guardrails for safe AI systems.