---
title: "Tool Selection Accuracy: The Eval Most Teams Skip — and Should Not (2026)"
description: "Your agent picked the wrong tool 12% of the time and the final answer was still right. That's a latent bug. Here's the eval pipeline that surfaces it."
canonical: https://callsphere.ai/blog/tool-selection-accuracy-agent-eval-pipeline-2026
category: "Agentic AI"
tags: ["Tool Calling", "Function Calling", "OpenAI Agents SDK", "Agent Evaluation", "LangSmith", "Production AI", "AI Engineering"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.705Z
---

# Tool Selection Accuracy: The Eval Most Teams Skip — and Should Not (2026)

> Your agent picked the wrong tool 12% of the time and the final answer was still right. That's a latent bug. Here's the eval pipeline that surfaces it.

## TL;DR

Final-answer correctness is the metric every agent team starts with. It is also the metric that hides the largest class of latent bugs in production agents: the model picked the wrong tool, the wrong tool happened to return *enough* information, and the assistant's natural-language reply was indistinguishable from a correct one. We instrumented our [voice and chat agent platform](/products) for tool-selection accuracy in late 2025 and immediately found a 12% wrong-tool rate on a production agent whose final-answer eval was 0.97. The bugs were real — duplicated database writes, occasional hallucinated facts, and a slow cost regression from over-tooling. This post is the eval pipeline that catches it: a labeled dataset of `(input → expected_tool, expected_args)`, a deterministic grader for tool name + arg shape, an LLM judge for arg semantics, and a confusion matrix you can stare at until the failure modes become obvious.

## Why Final-Answer Eval Misses This

Consider a customer-service agent with two tools:

- `get_order_status(order_id)` — returns shipping status.
- `get_order_history(user_id)` — returns the user's last 10 orders.

User says: "where's my package?" The user has exactly one open order. The model picks `get_order_history`, gets back a list of one order, extracts the status, and answers correctly. Final-answer eval: 1.0.

The cost: `get_order_history` is 3× more expensive (more rows, larger tokenized result), it leaks PII (the other 9 historical orders) into the model's context window, and on the day the user has 10 open orders the agent will pick the most recent — not the one the user is actually asking about. The bug is dormant until the data distribution shifts.

This is the canonical case for tool-selection eval. You catch it now or you catch it during an incident.

## The Three Failure Modes

Across the agents we have evaluated for our [healthcare](/industries) and [real-estate](/industries) deployments, tool-selection failures cluster into exactly three buckets:

| Failure mode | Frequency in our data | Final-answer eval catches it? |
| --- | --- | --- |
| **Over-tooling** — calling more tools than needed (or a more expensive variant) | 6.4% | No |
| **Wrong arg shape** — right tool, malformed arguments (extra fields, wrong types) | 3.1% | Sometimes (if it errors) |
| **Hallucinated tool name** — model invents a tool that does not exist | 1.8% | Sometimes (SDK errors) |
| **(Total wrong-tool rate)** | **~11.3%** | — |

The over-tooling case is by far the most expensive. It does not error, it does not produce wrong final answers, and it inflates token cost by 15–40% on a typical conversation. We saw a 22% cost regression on one agent before we built this eval; in retrospect the fix was a one-line tool description tightening.

## The Eval Pipeline

```mermaid
flowchart LR
  A[Production traces in LangSmith] --> B[Sample 200-500 conversations]
  B --> C[Human label: expected_tool + expected_args per turn]
  C --> D[Dataset committed to LangSmith]
  D --> E[Run agent on dataset inputs]
  E --> F[Deterministic grader: name match + arg-shape match]
  E --> G[LLM judge: arg semantic equivalence]
  F --> H[Per-tool confusion matrix]
  G --> H
  H --> I{Any tool below threshold?}
  I -->|yes| J[Tighten tool description / add few-shot]
  I -->|no| K[Ship — gate becomes CI]
  J --> E
  style C fill:#ffd
  style H fill:#cfc
  style J fill:#fee
```

*Figure 1 — Tool-selection eval pipeline. The labeling step is the only expensive one and it is the one teams try to skip. Do not skip it.*

### Step 1 — Build the Labeled Dataset

Sample 200–500 real production conversations from your tracing layer. For each turn that triggered a tool call, a domain expert labels what the tool call *should* have been. The label has three fields:

```json
{
  "input": {"messages": [...], "context": {...}},
  "expected_tool": "get_order_status",
  "expected_args": {"order_id": "ORD-9182"},
  "rationale": "User has one open order; status lookup is sufficient."
}
```

We label in batches with two reviewers per example and adjudicate disagreements. The first 200 examples cost ~12 person-hours; subsequent batches go faster as the labeling guide stabilizes.

### Step 2 — Deterministic Grader

```python
from langsmith import evaluate, Client

def tool_name_match(run, example) -> dict:
    actual = run.outputs.get("tool_calls", [{}])[0].get("name")
    expected = example.outputs["expected_tool"]
    return {"key": "tool_name_match", "score": 1.0 if actual == expected else 0.0}

def tool_args_shape_match(run, example) -> dict:
    actual = run.outputs.get("tool_calls", [{}])[0].get("arguments", {})
    expected = example.outputs["expected_args"]
    same_keys = set(actual.keys()) == set(expected.keys())
    same_types = all(
        type(actual.get(k)) == type(expected.get(k)) for k in expected
    )
    return {
        "key": "tool_args_shape_match",
        "score": 1.0 if (same_keys and same_types) else 0.0,
    }
```

These two graders are deterministic and cheap. They catch hallucinated tool names (name match fails), wrong arg shape (key set mismatch), and most over-tooling cases (the model picked a different tool than the label).

### Step 3 — LLM Judge for Arg Semantics

The hard case: the model picked the right tool with the right arg shape, but the *value* is wrong. `order_id="ORD-9182"` vs. `order_id="ORD-9183"`. A deterministic comparison would catch the mismatch but cannot tell you whether `"ORD-9182"` is *plausibly* derivable from the input or a hallucination. We use a structured LLM judge on `gpt-4o-2024-08-06`:

```python
JUDGE_PROMPT = """You are grading a tool call against a reference label.

User input: {input}
Expected tool: {expected_tool}
Expected args: {expected_args}
Actual args: {actual_args}

Score 1.0 if the actual args are semantically equivalent to the expected args
given the user input (e.g. same order, slot, user). Score 0.5 if plausibly
derivable but not optimal. Score 0.0 if hallucinated or contradicts the input.

Return JSON: {{"score": float, "reason": str}}"""

def tool_args_semantic_match(run, example) -> dict:
    response = openai_client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(...)}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    parsed = json.loads(response.choices[0].message.content)
    return {"key": "tool_args_semantic", "score": parsed["score"]}

evaluate(
    agent_runner,
    data="tool-selection-v2",
    evaluators=[tool_name_match, tool_args_shape_match, tool_args_semantic_match],
    experiment_prefix="tool-sel-baseline",
)
```

We tested judge agreement against three human raters on 100 held-out examples: `gpt-4o-2024-08-06` agreed with majority human at 94%, `gpt-4.1-2025-04-14` at 96%. Either is fine. Pin the snapshot or your "scores improving" graph will track model drift, not your fixes.

## The Confusion Matrix

The single most useful artifact this pipeline produces is a per-tool confusion matrix. Rows are the labeled "expected" tool; columns are what the model actually picked. Diagonal is correct selection.

| Expected ↓ / Actual → | get_order_status | get_order_history | get_shipping_eta | (none) |
| --- | --- | --- | --- | --- |
| **get_order_status** | 142 | **18** | 3 | 1 |
| **get_order_history** | 0 | 47 | 0 | 0 |
| **get_shipping_eta** | **9** | 1 | 22 | 0 |
| **(none)** | 4 | 2 | 0 | 51 |

Reading this:

- **18 cases** of expected `get_order_status` but model picked `get_order_history` — over-tooling. Tool description for `get_order_history` is too inviting. Fix: tighten its description to specify "use only when user explicitly asks about *past* orders, not current status."
- **9 cases** of expected `get_shipping_eta` but model picked `get_order_status` — under-tooling. The shipping ETA tool's description does not advertise its specific value. Fix: add an example.
- **4 cases** of model calling `get_order_status` when *no tool* should have been called (e.g. user said "thanks!"). Fix: add a "no tool needed" few-shot to the system prompt.

Each fix gets re-evaluated against the same dataset. We aim for ≥ 0.95 on every diagonal cell before we ship.

## Wiring It Into CI

Once the dataset is stable, this becomes a merge gate. Our PR template requires the eval link, and CI fails any PR that drops tool-selection accuracy below threshold:

```yaml

# .github/workflows/agent-eval.yml

- name: Run tool-selection eval
run: |
  python eval/run.py --dataset tool-selection-v2 --threshold-name 0.95 --threshold-args-shape 0.97 --threshold-args-semantic 0.93 --fail-on-regression

```

The thresholds are different per metric on purpose: name-match should be near-perfect once the prompt is right; arg-shape should be perfect under `strict=True` schemas; arg-semantic has irreducible noise from natural-language ambiguity and we accept ~0.93.

## What We Caught the First Week

After we shipped this pipeline on a customer-support agent running `gpt-4.1-2025-04-14`:

- A tool description that started with "Use this tool to..." was being matched too aggressively — every tool started with that phrase. Rewrote all 14 descriptions to lead with the *trigger* (e.g. "When the user asks about a refund, call this tool with..."). Tool-name accuracy went from 0.88 to 0.96.
- One tool's arg schema had a stale optional field (`legacy_account_id`) that the model occasionally filled with a hallucinated value. Removed it. Arg-shape match went from 0.94 to 0.99.
- A pair of tools (`search_kb` and `get_kb_article`) had ~30% mutual confusion. Merged them into one tool with a `mode` parameter. Mutual confusion → 0.

Total engineering time: about three days, including the labeling. Net effect: 22% cost reduction on the agent, no change in user-visible quality, and a new permanent gate against the same class of regression.

## Closing: Tool Selection Is the Quiet Bug Class

Final-answer eval is necessary and not sufficient. If your agent uses two or more tools and you have not measured tool-selection accuracy, you are almost certainly carrying a 5–15% wrong-tool rate. It is invisible until it is not — and when it surfaces, it surfaces as a cost spike, a PII incident, or a class of "weird" complaints that resist debugging because the final answer always looks reasonable.

The pipeline above takes a week to build and pays for itself the first month. If you want to see it running end to end, the [interactive demo](/demo) on our site has a live tool-selection eval dashboard wired to the same agents that handle real customer calls in production.

---

Source: https://callsphere.ai/blog/tool-selection-accuracy-agent-eval-pipeline-2026
