Skip to content
Building Custom Agent Dashboards: Visualizing Conversations, Costs, and Latency
Agentic AI & LLMs13 min read29 views

Building Custom Agent Dashboards: Visualizing Conversations, Costs, and Latency

By Sagar Shankaran, Founder of CallSphere

Quick answer

Build production-grade Grafana dashboards for AI agent systems that visualize conversation throughput, per-model costs, LLM latency percentiles, and tool usage patterns using Prometheus metrics.

Key takeaways

The Key Metrics Every Agent Dashboard Needs

Generic application dashboards track request rate, error rate, and latency. Agent dashboards need those plus metrics unique to LLM workloads: token consumption, cost per conversation, tool call success rates, and conversation completion rates. Without these, you are flying blind on the dimensions that matter most for agent reliability and cost control.

The foundation is a metrics collection layer that captures these signals at the right granularity, and a visualization layer that makes patterns visible at a glance.

Exposing Prometheus Metrics from Your Agent

Use the prometheus_client library to define counters, histograms, and gauges that capture agent-specific signals.

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Conversation metrics
conversations_total = Counter(
    "agent_conversations_total",
    "Total conversations started",
    ["agent_name", "status"],
)

# LLM call metrics
llm_call_duration = Histogram(
    "agent_llm_call_duration_seconds",
    "LLM call latency in seconds",
    ["model", "agent_name"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
)

tokens_used = Counter(
    "agent_tokens_total",
    "Total tokens consumed",
    ["model", "token_type"],  # token_type: prompt or completion
)

# Tool metrics
tool_calls_total = Counter(
    "agent_tool_calls_total",
    "Total tool invocations",
    ["tool_name", "status"],
)

# Active conversations gauge
active_conversations = Gauge(
    "agent_active_conversations",
    "Currently active conversations",
    ["agent_name"],
)

# Start metrics server on port 9090
start_http_server(9090)

Instrumenting the Agent Loop

Wrap the core agent operations to emit metrics on every call.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
import time

async def instrumented_llm_call(model: str, messages: list, agent_name: str):
    start = time.perf_counter()
    try:
        response = await llm_client.chat.completions.create(
            model=model, messages=messages
        )
        duration = time.perf_counter() - start
        llm_call_duration.labels(model=model, agent_name=agent_name).observe(duration)
        tokens_used.labels(model=model, token_type="prompt").inc(
            response.usage.prompt_tokens
        )
        tokens_used.labels(model=model, token_type="completion").inc(
            response.usage.completion_tokens
        )
        return response
    except Exception as e:
        duration = time.perf_counter() - start
        llm_call_duration.labels(model=model, agent_name=agent_name).observe(duration)
        raise

async def instrumented_tool_call(tool_name: str, arguments: dict):
    try:
        result = await execute_tool(tool_name, arguments)
        tool_calls_total.labels(tool_name=tool_name, status="success").inc()
        return result
    except Exception:
        tool_calls_total.labels(tool_name=tool_name, status="error").inc()
        raise

async def run_conversation(user_id: str, message: str, agent_name: str):
    active_conversations.labels(agent_name=agent_name).inc()
    try:
        result = await agent.run(message)
        conversations_total.labels(agent_name=agent_name, status="completed").inc()
        return result
    except Exception:
        conversations_total.labels(agent_name=agent_name, status="failed").inc()
        raise
    finally:
        active_conversations.labels(agent_name=agent_name).dec()

Building the Grafana Dashboard

Configure Prometheus as a Grafana data source, then create panels using PromQL queries for each KPI.

Conversation throughput — requests per minute over time:

rate(agent_conversations_total[5m])

LLM latency P95 — the 95th percentile response time by model:

histogram_quantile(0.95, rate(agent_llm_call_duration_seconds_bucket[5m]))

Token burn rate — tokens per minute, split by prompt vs completion:

rate(agent_tokens_total[5m])

Cost estimation panel — multiply token rates by per-token pricing using a recording rule or Grafana transformation:

rate(agent_tokens_total{token_type="prompt", model="gpt-4o"}[5m]) * 0.0000025
+
rate(agent_tokens_total{token_type="completion", model="gpt-4o"}[5m]) * 0.00001

Tool error rate — percentage of tool calls that fail:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

rate(agent_tool_calls_total{status="error"}[5m])
/ rate(agent_tool_calls_total[5m])

Setting Up Alerts

Define Prometheus alerting rules that fire when agent KPIs breach thresholds.

# prometheus-alerts.yaml
groups:
  - name: agent_alerts
    rules:
      - alert: HighLLMLatency
        expr: histogram_quantile(0.95, rate(agent_llm_call_duration_seconds_bucket[5m])) > 5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "LLM P95 latency exceeds 5 seconds"

      - alert: HighToolErrorRate
        expr: >
          rate(agent_tool_calls_total{status="error"}[10m])
          / rate(agent_tool_calls_total[10m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Tool error rate above 10%"

FAQ

How many Prometheus labels should I use per metric?

Keep label cardinality low. Labels like model, agent_name, and status are fine because they have a small, bounded set of values. Never use labels with high cardinality like user_id or conversation_id — these will cause Prometheus memory and performance issues. Track per-user data in a separate analytics database instead.

Should I track metrics in the agent code or use a sidecar?

Instrument directly in the agent code for LLM-specific metrics like token counts and tool call results, because only the application has that context. Use a sidecar or service mesh for infrastructure metrics like HTTP request rate and network latency. The two approaches complement each other.

How do I estimate costs when using multiple models?

Create a pricing lookup that maps model names to per-token costs, then apply it as a Grafana transformation or Prometheus recording rule. Update the pricing table whenever your provider changes rates. Some teams store costs in a database and join with token metrics in Grafana for more flexibility.


#Dashboards #Grafana #Prometheus #Monitoring #AIAgents #AgenticAI #LearnAI #AIEngineering

Share
S

Written by

Sagar Shankaran· Founder, CallSphere

Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI & LLMs

Personal AI Assistant: How to Pick One for Business in 2026

A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.

Agentic AI & LLMs

Free AI Agents in 2026: When Free Wins and When It Costs You

A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.

Agentic AI & LLMs

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.

Agentic AI & LLMs

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.

Industry Solutions

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.

Agentic AI & LLMs

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.