Building Custom Agent Dashboards: Visualizing Conversations, Costs, and Latency

The Key Metrics Every Agent Dashboard Needs

Generic application dashboards track request rate, error rate, and latency. Agent dashboards need those plus metrics unique to LLM workloads: token consumption, cost per conversation, tool call success rates, and conversation completion rates. Without these, you are flying blind on the dimensions that matter most for agent reliability and cost control.

The foundation is a metrics collection layer that captures these signals at the right granularity, and a visualization layer that makes patterns visible at a glance.

Exposing Prometheus Metrics from Your Agent

Use the prometheus_client library to define counters, histograms, and gauges that capture agent-specific signals.

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Conversation metrics
conversations_total = Counter(
    "agent_conversations_total",
    "Total conversations started",
    ["agent_name", "status"],
)

# LLM call metrics
llm_call_duration = Histogram(
    "agent_llm_call_duration_seconds",
    "LLM call latency in seconds",
    ["model", "agent_name"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
)

tokens_used = Counter(
    "agent_tokens_total",
    "Total tokens consumed",
    ["model", "token_type"],  # token_type: prompt or completion
)

# Tool metrics
tool_calls_total = Counter(
    "agent_tool_calls_total",
    "Total tool invocations",
    ["tool_name", "status"],
)

# Active conversations gauge
active_conversations = Gauge(
    "agent_active_conversations",
    "Currently active conversations",
    ["agent_name"],
)

# Start metrics server on port 9090
start_http_server(9090)

Instrumenting the Agent Loop

Wrap the core agent operations to emit metrics on every call.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import time

async def instrumented_llm_call(model: str, messages: list, agent_name: str):
    start = time.perf_counter()
    try:
        response = await llm_client.chat.completions.create(
            model=model, messages=messages
        )
        duration = time.perf_counter() - start
        llm_call_duration.labels(model=model, agent_name=agent_name).observe(duration)
        tokens_used.labels(model=model, token_type="prompt").inc(
            response.usage.prompt_tokens
        )
        tokens_used.labels(model=model, token_type="completion").inc(
            response.usage.completion_tokens
        )
        return response
    except Exception as e:
        duration = time.perf_counter() - start
        llm_call_duration.labels(model=model, agent_name=agent_name).observe(duration)
        raise

async def instrumented_tool_call(tool_name: str, arguments: dict):
    try:
        result = await execute_tool(tool_name, arguments)
        tool_calls_total.labels(tool_name=tool_name, status="success").inc()
        return result
    except Exception:
        tool_calls_total.labels(tool_name=tool_name, status="error").inc()
        raise

async def run_conversation(user_id: str, message: str, agent_name: str):
    active_conversations.labels(agent_name=agent_name).inc()
    try:
        result = await agent.run(message)
        conversations_total.labels(agent_name=agent_name, status="completed").inc()
        return result
    except Exception:
        conversations_total.labels(agent_name=agent_name, status="failed").inc()
        raise
    finally:
        active_conversations.labels(agent_name=agent_name).dec()

Building the Grafana Dashboard

Configure Prometheus as a Grafana data source, then create panels using PromQL queries for each KPI.

Conversation throughput — requests per minute over time:

rate(agent_conversations_total[5m])

LLM latency P95 — the 95th percentile response time by model:

histogram_quantile(0.95, rate(agent_llm_call_duration_seconds_bucket[5m]))

Token burn rate — tokens per minute, split by prompt vs completion:

rate(agent_tokens_total[5m])

Cost estimation panel — multiply token rates by per-token pricing using a recording rule or Grafana transformation:

rate(agent_tokens_total{token_type="prompt", model="gpt-4o"}[5m]) * 0.0000025
+
rate(agent_tokens_total{token_type="completion", model="gpt-4o"}[5m]) * 0.00001

Tool error rate — percentage of tool calls that fail:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

rate(agent_tool_calls_total{status="error"}[5m])
/ rate(agent_tool_calls_total[5m])

Setting Up Alerts

Define Prometheus alerting rules that fire when agent KPIs breach thresholds.

# prometheus-alerts.yaml
groups:
  - name: agent_alerts
    rules:
      - alert: HighLLMLatency
        expr: histogram_quantile(0.95, rate(agent_llm_call_duration_seconds_bucket[5m])) > 5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "LLM P95 latency exceeds 5 seconds"

      - alert: HighToolErrorRate
        expr: >
          rate(agent_tool_calls_total{status="error"}[10m])
          / rate(agent_tool_calls_total[10m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Tool error rate above 10%"

FAQ

How many Prometheus labels should I use per metric?

Keep label cardinality low. Labels like model, agent_name, and status are fine because they have a small, bounded set of values. Never use labels with high cardinality like user_id or conversation_id — these will cause Prometheus memory and performance issues. Track per-user data in a separate analytics database instead.

Should I track metrics in the agent code or use a sidecar?

Instrument directly in the agent code for LLM-specific metrics like token counts and tool call results, because only the application has that context. Use a sidecar or service mesh for infrastructure metrics like HTTP request rate and network latency. The two approaches complement each other.

How do I estimate costs when using multiple models?

Create a pricing lookup that maps model names to per-token costs, then apply it as a Grafana transformation or Prometheus recording rule. Update the pricing table whenever your provider changes rates. Some teams store costs in a database and join with token metrics in Grafana for more flexibility.

#Dashboards #Grafana #Prometheus #Monitoring #AIAgents #AgenticAI #LearnAI #AIEngineering

Building Custom Agent Dashboards: Visualizing Conversations, Costs, and Latency

The Key Metrics Every Agent Dashboard Needs

Exposing Prometheus Metrics from Your Agent

Instrumenting the Agent Loop

Building the Grafana Dashboard

Setting Up Alerts

FAQ

How many Prometheus labels should I use per metric?

Should I track metrics in the agent code or use a sidecar?

How do I estimate costs when using multiple models?

Try CallSphere AI Voice Agents

Related Articles You May Like

Personal AI Assistant: How to Pick One for Business in 2026

Free AI Agents in 2026: When Free Wins and When It Costs You

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

OpenAI Frontier vs Anthropic Managed Agents: 2026 Comparison