---
title: "Measuring Agent User Experience: CSAT, SUS, and Custom UX Metrics for AI Products"
description: "Build a comprehensive UX measurement framework for AI agents using CSAT surveys, System Usability Scale, custom behavioral metrics, A/B testing strategies, and analytics pipelines."
canonical: https://callsphere.ai/blog/measuring-agent-user-experience-csat-sus-custom-ux-metrics-ai-products
category: "Learn Agentic AI"
tags: ["UX Metrics", "CSAT", "Analytics", "AI Agents", "A/B Testing"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:44.051Z
---

# Measuring Agent User Experience: CSAT, SUS, and Custom UX Metrics for AI Products

> Build a comprehensive UX measurement framework for AI agents using CSAT surveys, System Usability Scale, custom behavioral metrics, A/B testing strategies, and analytics pipelines.

## You Cannot Improve What You Cannot Measure

Building a great AI agent UX requires continuous measurement. Intuition and user complaints are not enough — you need quantitative metrics that track experience quality over time, surface regressions quickly, and provide actionable data for improvement.

AI agents present unique measurement challenges. Traditional web analytics (page views, click-through rates) do not capture conversational quality. You need a layered approach combining survey-based metrics, behavioral signals, and AI-specific quality indicators.

## CSAT: Customer Satisfaction Score

CSAT is the most straightforward UX metric. Ask users to rate their experience on a 1-5 scale at the end of an interaction:

```mermaid
flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness
PromptFoo or Braintrust"]
    GOLD[("Golden set
200 tagged cases")]
    JUDGE["LLM as judge
plus regex graders"]
    SCORE["Aggregate score
and per slice"]
    GATE{"Score regress
more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
```

```python
from dataclasses import dataclass
from datetime import datetime
from enum import Enum

class SurveyTrigger(Enum):
    TASK_COMPLETED = "task_completed"
    HUMAN_ESCALATION = "human_escalation"
    SESSION_END = "session_end"
    ERROR_RECOVERY = "error_recovery"

@dataclass
class CSATSurvey:
    conversation_id: str
    trigger: SurveyTrigger
    rating: int | None           # 1-5
    comment: str | None
    timestamp: datetime
    task_type: str
    turns_in_conversation: int

class CSATCollector:
    """Collect and analyze CSAT scores for agent interactions."""

    SURVEY_MESSAGES = {
        SurveyTrigger.TASK_COMPLETED: (
            "I'm glad I could help! On a scale of 1-5, "
            "how would you rate your experience today?"
        ),
        SurveyTrigger.HUMAN_ESCALATION: (
            "Before I transfer you, could you rate your experience "
            "with me so far? (1-5, 5 being excellent)"
        ),
        SurveyTrigger.ERROR_RECOVERY: (
            "I know we hit a bump earlier. Now that it's resolved, "
            "how would you rate the overall experience? (1-5)"
        ),
    }

    def should_survey(
        self,
        conversation_id: str,
        trigger: SurveyTrigger,
        recent_survey_count: int,
    ) -> bool:
        """Avoid survey fatigue — limit frequency."""
        if recent_survey_count >= 1:
            return False  # Max one survey per session
        if trigger == SurveyTrigger.SESSION_END:
            return True
        if trigger == SurveyTrigger.TASK_COMPLETED:
            return True
        return False

    def calculate_csat_score(self, surveys: list[CSATSurvey]) -> dict:
        """Calculate CSAT percentage (% of 4 and 5 ratings)."""
        rated = [s for s in surveys if s.rating is not None]
        if not rated:
            return {"score": None, "sample_size": 0}

        satisfied = sum(1 for s in rated if s.rating >= 4)
        return {
            "score": round((satisfied / len(rated)) * 100, 1),
            "sample_size": len(rated),
            "average_rating": round(
                sum(s.rating for s in rated) / len(rated), 2
            ),
        }
```

Target a CSAT score of 80% or higher. Below 70% indicates a systemic UX problem.

## System Usability Scale (SUS)

SUS is a standardized 10-question survey that produces a score from 0-100. It is ideal for periodic deep-dive assessments of your agent's usability:

```python
SUS_QUESTIONS = [
    "I think I would like to use this AI assistant frequently.",
    "I found the AI assistant unnecessarily complex.",
    "I thought the AI assistant was easy to use.",
    "I think I would need technical support to use this assistant.",
    "I found the various capabilities were well integrated.",
    "I thought there was too much inconsistency in this assistant.",
    "I imagine most people would learn to use this assistant quickly.",
    "I found the assistant very cumbersome to use.",
    "I felt very confident using the assistant.",
    "I needed to learn a lot before I could use this assistant.",
]

# Questions alternate between positive and negative framing
POSITIVE_QUESTIONS = {0, 2, 4, 6, 8}  # 0-indexed

def calculate_sus_score(responses: list[int]) -> float:
    """
    Calculate SUS score from 10 responses (each 1-5).
    Score ranges from 0 to 100. Above 68 is above average.
    Above 80 is excellent.
    """
    if len(responses) != 10:
        raise ValueError("SUS requires exactly 10 responses")

    adjusted = []
    for i, response in enumerate(responses):
        if i in POSITIVE_QUESTIONS:
            adjusted.append(response - 1)      # Positive: score - 1
        else:
            adjusted.append(5 - response)      # Negative: 5 - score

    return sum(adjusted) * 2.5

def interpret_sus_score(score: float) -> str:
    if score >= 80.3:
        return "Excellent (Grade A)"
    elif score >= 68:
        return "Good (Grade C) — above average"
    elif score >= 51:
        return "OK (Grade D) — below average, needs improvement"
    else:
        return "Poor (Grade F) — significant usability issues"
```

## Custom Behavioral Metrics for AI Agents

Survey metrics capture stated satisfaction. Behavioral metrics capture actual usage patterns:

```python
@dataclass
class ConversationMetrics:
    conversation_id: str
    started_at: datetime
    ended_at: datetime
    total_turns: int
    user_turns: int
    agent_turns: int
    task_completed: bool
    escalated_to_human: bool
    errors_encountered: int
    errors_recovered: int
    clarification_questions_asked: int
    follow_up_prompts_clicked: int
    user_rephrased_count: int     # Times user had to rephrase
    time_to_first_value: float    # Seconds to first useful response
    idle_gaps: list[float]        # Seconds between user messages

def calculate_behavioral_health(
    metrics: list[ConversationMetrics],
) -> dict:
    """Calculate aggregate behavioral health indicators."""

    total = len(metrics)
    if total == 0:
        return {}

    task_completion_rate = (
        sum(1 for m in metrics if m.task_completed) / total * 100
    )

    escalation_rate = (
        sum(1 for m in metrics if m.escalated_to_human) / total * 100
    )

    avg_turns_to_completion = (
        sum(m.total_turns for m in metrics if m.task_completed)
        / max(sum(1 for m in metrics if m.task_completed), 1)
    )

    avg_rephrase_rate = (
        sum(m.user_rephrased_count for m in metrics) / total
    )

    avg_time_to_value = (
        sum(m.time_to_first_value for m in metrics) / total
    )

    error_recovery_rate = (
        sum(m.errors_recovered for m in metrics)
        / max(sum(m.errors_encountered for m in metrics), 1)
        * 100
    )

    return {
        "task_completion_rate": round(task_completion_rate, 1),
        "escalation_rate": round(escalation_rate, 1),
        "avg_turns_to_completion": round(avg_turns_to_completion, 1),
        "avg_rephrase_rate": round(avg_rephrase_rate, 2),
        "avg_time_to_value_seconds": round(avg_time_to_value, 1),
        "error_recovery_rate": round(error_recovery_rate, 1),
    }
```

Key thresholds to watch: task completion rate below 70% means the agent is failing its core job. Rephrase rate above 1.5 per conversation means the agent is not understanding users. Time to first value above 30 seconds means the onboarding or first response is too slow.

## A/B Testing UX Changes

Test UX changes rigorously before rolling them out:

```python
import hashlib
from dataclasses import dataclass

@dataclass
class ABTestConfig:
    test_id: str
    variants: dict[str, dict]   # variant_name -> config
    traffic_split: dict[str, float]  # variant_name -> percentage (0-1)
    primary_metric: str
    minimum_sample_size: int

def assign_variant(user_id: str, test: ABTestConfig) -> str:
    """Deterministically assign a user to a test variant."""
    hash_input = f"{test.test_id}:{user_id}"
    hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
    bucket = (hash_value % 1000) / 1000.0

    cumulative = 0.0
    for variant, split in test.traffic_split.items():
        cumulative += split
        if bucket  list[str]:
    """Generate alerts when metrics cross thresholds."""
    alerts = []

    if dashboard.csat_score  30:
        alerts.append(
            f"Escalation rate at {dashboard.escalation_rate}% — "
            "agent may be failing common intents"
        )

    if dashboard.avg_rephrase_rate > 2.0:
        alerts.append(
            f"Users rephrasing {dashboard.avg_rephrase_rate}x on average — "
            "NLU needs tuning"
        )

    if dashboard.avg_time_to_value > 45:
        alerts.append(
            f"Time to value at {dashboard.avg_time_to_value}s — "
            "first response too slow"
        )

    return alerts
```

Wire these alerts into your team's notification system (Slack, PagerDuty) so regressions are caught the same day they happen.

## FAQ

### How often should I collect CSAT surveys without causing survey fatigue?

Limit surveys to one per user session and no more than once per week for the same user. Rotate between end-of-task surveys and periodic in-depth surveys (like SUS). A 10-15% survey response rate is normal for in-product surveys — do not try to survey everyone. If your response rate drops below 5%, your survey prompt is too intrusive or too frequent.

### What is the most important single metric for agent UX?

Task completion rate. If users cannot complete the task they came for, no amount of personality, formatting, or speed matters. Track it by task type (order lookup, returns, FAQ) so you can identify which specific flows are broken. A high overall completion rate can mask a 20% completion rate on a specific task that affects thousands of users.

### How do I isolate whether a UX change or a model change caused a metric shift?

Never ship a UX change and a model change simultaneously. If your A/B test changes the greeting format at the same time you update the underlying model, you cannot attribute the metric movement. Use staged rollouts: ship the model change first, let metrics stabilize for a week, then launch the UX A/B test. If you must do both, use a 2x2 factorial design (old model + old UX, old model + new UX, new model + old UX, new model + new UX) but this requires 4x the sample size.

---

#UXMetrics #CSAT #Analytics #AIAgents #ABTesting #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/measuring-agent-user-experience-csat-sus-custom-ux-metrics-ai-products
