TL;DR

Final-answer correctness is the metric every agent team starts with. It is also the metric that hides the largest class of latent bugs in production agents: the model picked the wrong tool, the wrong tool happened to return enough information, and the assistant's natural-language reply was indistinguishable from a correct one. We instrumented our voice and chat agent platform for tool-selection accuracy in late 2025 and immediately found a 12% wrong-tool rate on a production agent whose final-answer eval was 0.97. The bugs were real — duplicated database writes, occasional hallucinated facts, and a slow cost regression from over-tooling. This post is the eval pipeline that catches it: a labeled dataset of (input → expected_tool, expected_args), a deterministic grader for tool name + arg shape, an LLM judge for arg semantics, and a confusion matrix you can stare at until the failure modes become obvious.

Why Final-Answer Eval Misses This

Consider a customer-service agent with two tools:

get_order_status(order_id) — returns shipping status.
get_order_history(user_id) — returns the user's last 10 orders.

User says: "where's my package?" The user has exactly one open order. The model picks get_order_history, gets back a list of one order, extracts the status, and answers correctly. Final-answer eval: 1.0.

The cost: get_order_history is 3× more expensive (more rows, larger tokenized result), it leaks PII (the other 9 historical orders) into the model's context window, and on the day the user has 10 open orders the agent will pick the most recent — not the one the user is actually asking about. The bug is dormant until the data distribution shifts.

This is the canonical case for tool-selection eval. You catch it now or you catch it during an incident.

The Three Failure Modes

Across the agents we have evaluated for our healthcare and real-estate deployments, tool-selection failures cluster into exactly three buckets:

Failure mode	Frequency in our data	Final-answer eval catches it?
Over-tooling — calling more tools than needed (or a more expensive variant)	6.4%	No
Wrong arg shape — right tool, malformed arguments (extra fields, wrong types)	3.1%	Sometimes (if it errors)
Hallucinated tool name — model invents a tool that does not exist	1.8%	Sometimes (SDK errors)
(Total wrong-tool rate)	~11.3%	—

The over-tooling case is by far the most expensive. It does not error, it does not produce wrong final answers, and it inflates token cost by 15–40% on a typical conversation. We saw a 22% cost regression on one agent before we built this eval; in retrospect the fix was a one-line tool description tightening.

The Eval Pipeline

```mermaid flowchart LR A[Production traces in LangSmith] --> B[Sample 200-500 conversations] B --> C[Human label: expected_tool + expected_args per turn] C --> D[Dataset committed to LangSmith] D --> E[Run agent on dataset inputs] E --> F[Deterministic grader: name match + arg-shape match] E --> G[LLM judge: arg semantic equivalence] F --> H[Per-tool confusion matrix] G --> H H --> I{Any tool below threshold?} I -->|yes| J[Tighten tool description / add few-shot] I -->|no| K[Ship — gate becomes CI] J --> E style C fill:#ffd style H fill:#cfc style J fill:#fee ```

Figure 1 — Tool-selection eval pipeline. The labeling step is the only expensive one and it is the one teams try to skip. Do not skip it.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Step 1 — Build the Labeled Dataset

Sample 200–500 real production conversations from your tracing layer. For each turn that triggered a tool call, a domain expert labels what the tool call should have been. The label has three fields:

```json { "input": {"messages": [...], "context": {...}}, "expected_tool": "get_order_status", "expected_args": {"order_id": "ORD-9182"}, "rationale": "User has one open order; status lookup is sufficient." } ```

We label in batches with two reviewers per example and adjudicate disagreements. The first 200 examples cost ~12 person-hours; subsequent batches go faster as the labeling guide stabilizes.

Step 2 — Deterministic Grader

```python from langsmith import evaluate, Client

def tool_name_match(run, example) -> dict: actual = run.outputs.get("tool_calls", [{}])[0].get("name") expected = example.outputs["expected_tool"] return {"key": "tool_name_match", "score": 1.0 if actual == expected else 0.0}

def tool_args_shape_match(run, example) -> dict: actual = run.outputs.get("tool_calls", [{}])[0].get("arguments", {}) expected = example.outputs["expected_args"] same_keys = set(actual.keys()) == set(expected.keys()) same_types = all( type(actual.get(k)) == type(expected.get(k)) for k in expected ) return { "key": "tool_args_shape_match", "score": 1.0 if (same_keys and same_types) else 0.0, } ```

These two graders are deterministic and cheap. They catch hallucinated tool names (name match fails), wrong arg shape (key set mismatch), and most over-tooling cases (the model picked a different tool than the label).

Step 3 — LLM Judge for Arg Semantics

The hard case: the model picked the right tool with the right arg shape, but the value is wrong. order_id="ORD-9182" vs. order_id="ORD-9183". A deterministic comparison would catch the mismatch but cannot tell you whether "ORD-9182" is plausibly derivable from the input or a hallucination. We use a structured LLM judge on gpt-4o-2024-08-06:

```python JUDGE_PROMPT = """You are grading a tool call against a reference label.

User input: {input} Expected tool: {expected_tool} Expected args: {expected_args} Actual args: {actual_args}

Score 1.0 if the actual args are semantically equivalent to the expected args given the user input (e.g. same order, slot, user). Score 0.5 if plausibly derivable but not optimal. Score 0.0 if hallucinated or contradicts the input.

Return JSON: {{"score": float, "reason": str}}"""

def tool_args_semantic_match(run, example) -> dict: response = openai_client.chat.completions.create( model="gpt-4o-2024-08-06", messages=[{"role": "user", "content": JUDGE_PROMPT.format(...)}], response_format={"type": "json_object"}, temperature=0, ) parsed = json.loads(response.choices[0].message.content) return {"key": "tool_args_semantic", "score": parsed["score"]}

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

evaluate( agent_runner, data="tool-selection-v2", evaluators=[tool_name_match, tool_args_shape_match, tool_args_semantic_match], experiment_prefix="tool-sel-baseline", ) ```

We tested judge agreement against three human raters on 100 held-out examples: gpt-4o-2024-08-06 agreed with majority human at 94%, gpt-4.1-2025-04-14 at 96%. Either is fine. Pin the snapshot or your "scores improving" graph will track model drift, not your fixes.

The Confusion Matrix

The single most useful artifact this pipeline produces is a per-tool confusion matrix. Rows are the labeled "expected" tool; columns are what the model actually picked. Diagonal is correct selection.

Expected ↓ / Actual →	get_order_status	get_order_history	get_shipping_eta	(none)
get_order_status	142	18	3	1
get_order_history	0	47	0	0
get_shipping_eta	9	1	22	0
(none)	4	2	0	51

Reading this:

18 cases of expected get_order_status but model picked get_order_history — over-tooling. Tool description for get_order_history is too inviting. Fix: tighten its description to specify "use only when user explicitly asks about past orders, not current status."
9 cases of expected get_shipping_eta but model picked get_order_status — under-tooling. The shipping ETA tool's description does not advertise its specific value. Fix: add an example.
4 cases of model calling get_order_status when no tool should have been called (e.g. user said "thanks!"). Fix: add a "no tool needed" few-shot to the system prompt.

Each fix gets re-evaluated against the same dataset. We aim for ≥ 0.95 on every diagonal cell before we ship.

Wiring It Into CI

Once the dataset is stable, this becomes a merge gate. Our PR template requires the eval link, and CI fails any PR that drops tool-selection accuracy below threshold:

```yaml

.github/workflows/agent-eval.yml

name: Run tool-selection eval run: | python eval/run.py
--dataset tool-selection-v2
--threshold-name 0.95
--threshold-args-shape 0.97
--threshold-args-semantic 0.93
--fail-on-regression

```

The thresholds are different per metric on purpose: name-match should be near-perfect once the prompt is right; arg-shape should be perfect under strict=True schemas; arg-semantic has irreducible noise from natural-language ambiguity and we accept ~0.93.

What We Caught the First Week

After we shipped this pipeline on a customer-support agent running gpt-4.1-2025-04-14:

A tool description that started with "Use this tool to..." was being matched too aggressively — every tool started with that phrase. Rewrote all 14 descriptions to lead with the trigger (e.g. "When the user asks about a refund, call this tool with..."). Tool-name accuracy went from 0.88 to 0.96.
One tool's arg schema had a stale optional field (legacy_account_id) that the model occasionally filled with a hallucinated value. Removed it. Arg-shape match went from 0.94 to 0.99.
A pair of tools (search_kb and get_kb_article) had ~30% mutual confusion. Merged them into one tool with a mode parameter. Mutual confusion → 0.

Total engineering time: about three days, including the labeling. Net effect: 22% cost reduction on the agent, no change in user-visible quality, and a new permanent gate against the same class of regression.

Closing: Tool Selection Is the Quiet Bug Class

Final-answer eval is necessary and not sufficient. If your agent uses two or more tools and you have not measured tool-selection accuracy, you are almost certainly carrying a 5–15% wrong-tool rate. It is invisible until it is not — and when it surfaces, it surfaces as a cost spike, a PII incident, or a class of "weird" complaints that resist debugging because the final answer always looks reasonable.

The pipeline above takes a week to build and pays for itself the first month. If you want to see it running end to end, the interactive demo on our site has a live tool-selection eval dashboard wired to the same agents that handle real customer calls in production.

Tool Selection Accuracy: The Eval Most Teams Skip — and Should Not (2026)

TL;DR

Why Final-Answer Eval Misses This

The Three Failure Modes

The Eval Pipeline

Step 1 — Build the Labeled Dataset

Step 2 — Deterministic Grader

Step 3 — LLM Judge for Arg Semantics

The Confusion Matrix

Wiring It Into CI

.github/workflows/agent-eval.yml

What We Caught the First Week

Closing: Tool Selection Is the Quiet Bug Class

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Build a Golden Dataset for Production AI Agents

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split