Dynamic Tool Selection: AI Agents That Choose Tools Based on Context

The Tool Selection Problem

When an agent has 3 tools, the LLM picks the right one almost every time. At 10 tools, accuracy starts declining. At 50+ tools, the model frequently picks wrong tools, hallucinates parameters, or calls tools that are irrelevant to the task. This is the too-many-tools problem, and solving it is essential for building agents that work with large toolsets.

The fundamental insight is that tool selection is a search problem. The LLM needs enough information to discriminate between tools, but not so much that it is overwhelmed.

How LLMs Select Tools

When you provide tools to an LLM, the model uses three signals to decide which tool to call:

flowchart TD
    USER(["User message"])
    LLM["LLM call<br/>with tools schema"]
    DECIDE{"Model wants<br/>to call a tool?"}
    EXEC["Execute tool<br/>sandboxed runtime"]
    RESULT["Append tool_result<br/>to messages"]
    GUARD{"Output passes<br/>guardrails?"}
    DONE(["Final reply"])
    BLOCK(["Refuse and log"])
    USER --> LLM --> DECIDE
    DECIDE -->|Yes| EXEC --> RESULT --> LLM
    DECIDE -->|No| GUARD
    GUARD -->|Yes| DONE
    GUARD -->|No| BLOCK
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EXEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DONE fill:#059669,stroke:#047857,color:#fff
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff

The tool name — semantic meaning extracted from the function name
The tool description — the primary source of selection guidance
The parameter schema — structural hints about what data the tool expects

The description is by far the most important. A good description acts as a routing instruction.

Writing Descriptions That Discriminate

Each tool description should answer: what does this tool do, when should it be used, and when should a different tool be used instead.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

# Bad: overlapping, ambiguous descriptions
tools_bad = [
    {"name": "search", "description": "Search for information"},
    {"name": "lookup", "description": "Look up data"},
    {"name": "find", "description": "Find results"},
]

# Good: clear boundaries between tools
tools_good = [
    {
        "name": "search_web",
        "description": "Search the public internet for current information. Use for recent events, general knowledge, or topics not in our internal database. Do NOT use for internal company data."
    },
    {
        "name": "search_knowledge_base",
        "description": "Search the internal company knowledge base for policies, procedures, and documentation. Use for company-specific questions. Do NOT use for general internet searches."
    },
    {
        "name": "search_customer_db",
        "description": "Look up a specific customer by name, email, or ID in the customer database. Use when the user asks about a specific customer's account, orders, or status. Requires at least one identifier."
    },
]

The "Do NOT use for" clause is surprisingly effective. It gives the LLM a negative signal that prevents common misrouting.

Strategy 1: Tool Categories with Pre-Routing

For large toolsets, pre-filter tools based on the conversation context before passing them to the LLM:

from dataclasses import dataclass, field

@dataclass
class ToolCategory:
    name: str
    description: str
    keywords: list[str]
    tools: list[dict]

class ToolRouter:
    def __init__(self):
        self.categories: list[ToolCategory] = []

    def add_category(self, category: ToolCategory):
        self.categories.append(category)

    def select_tools(self, user_message: str, max_tools: int = 10) -> list[dict]:
        message_lower = user_message.lower()
        scored_categories = []

        for category in self.categories:
            score = sum(
                1 for kw in category.keywords
                if kw.lower() in message_lower
            )
            if score > 0:
                scored_categories.append((score, category))

        scored_categories.sort(key=lambda x: x[0], reverse=True)

        selected_tools = []
        for _, category in scored_categories:
            for tool in category.tools:
                if len(selected_tools) < max_tools:
                    selected_tools.append(tool)

        # Always include core tools
        if not selected_tools:
            return self.categories[0].tools[:max_tools]

        return selected_tools

Usage:

router = ToolRouter()

router.add_category(ToolCategory(
    name="data_analysis",
    description="Tools for querying and analyzing data",
    keywords=["data", "query", "sql", "analyze", "statistics", "count", "average"],
    tools=[query_db_tool, chart_tool, export_csv_tool],
))

router.add_category(ToolCategory(
    name="communication",
    description="Tools for sending messages and notifications",
    keywords=["send", "email", "message", "notify", "slack", "alert"],
    tools=[send_email_tool, slack_tool, sms_tool],
))

# At runtime, only pass relevant tools to the LLM
relevant_tools = router.select_tools(user_message, max_tools=8)
response = await client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=relevant_tools,
)

Strategy 2: Two-Stage Tool Selection

For very large toolsets (50+ tools), use a two-stage approach where the first LLM call selects the tool category, and the second call uses only tools from that category:

async def two_stage_tool_selection(user_message: str, all_categories: list[ToolCategory]):
    # Stage 1: Ask LLM to pick the right category
    category_descriptions = "\n".join(
        f"- {cat.name}: {cat.description}"
        for cat in all_categories
    )

    stage1_response = await client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model for routing
        messages=[
            {"role": "system", "content": f"Select the tool category most relevant to the user's request. Available categories:\n{category_descriptions}\n\nRespond with only the category name."},
            {"role": "user", "content": user_message},
        ],
    )

    selected_name = stage1_response.choices[0].message.content.strip()

    # Stage 2: Run agent with only tools from selected category
    selected_category = next(
        (cat for cat in all_categories if cat.name == selected_name),
        all_categories[0]
    )

    return await run_agent(
        user_message,
        tools=selected_category.tools,
        system_prompt="You are a helpful assistant.",
    )

Using a cheaper model (GPT-4o-mini) for routing keeps costs low while ensuring the main agent only sees relevant tools.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Strategy 3: Embedding-Based Tool Selection

For the most sophisticated approach, use embeddings to match user intent to tool descriptions:

import numpy as np

class EmbeddingToolSelector:
    def __init__(self, tools: list[dict]):
        self.tools = tools
        self.embeddings = None

    async def build_index(self):
        descriptions = [
            f"{t['function']['name']}: {t['function']['description']}"
            for t in self.tools
        ]
        response = await client.embeddings.create(
            model="text-embedding-3-small",
            input=descriptions,
        )
        self.embeddings = np.array([e.embedding for e in response.data])

    async def select(self, query: str, top_k: int = 5) -> list[dict]:
        response = await client.embeddings.create(
            model="text-embedding-3-small",
            input=[query],
        )
        query_embedding = np.array(response.data[0].embedding)

        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        return [self.tools[i] for i in top_indices]

This approach scales to hundreds of tools and handles semantic matching — "show me revenue numbers" correctly routes to the database query tool even without the word "query" appearing.

FAQ

What is the maximum number of tools I should give an LLM at once?

Empirically, most models handle 10-15 tools well. Beyond 20, selection accuracy degrades noticeably. If you have more than 20 tools, use one of the pre-routing strategies described above to narrow the active toolset per conversation turn.

How do I debug tool selection mistakes?

Log the tool calls the LLM makes alongside the user message. Look for patterns: does the model confuse two specific tools? Add "Do NOT use for" clauses to their descriptions. Does it pick the right tool but with wrong parameters? The parameter descriptions need improvement. Track selection accuracy as a metric over time.

Should I fine-tune a model for tool selection?

Only as a last resort. For most applications, better tool descriptions, pre-routing, and the two-stage approach solve selection problems without fine-tuning. Fine-tuning makes sense when you have a very large, domain-specific toolset and can generate training data from production logs.

#ToolSelection #AgentArchitecture #FunctionCalling #AIAgents #AgenticAI #LearnAI #AIEngineering

Dynamic Tool Selection: AI Agents That Choose Tools Based on Context

The Tool Selection Problem

How LLMs Select Tools

Writing Descriptions That Discriminate

Strategy 1: Tool Categories with Pre-Routing

Strategy 2: Two-Stage Tool Selection

Strategy 3: Embedding-Based Tool Selection

FAQ

What is the maximum number of tools I should give an LLM at once?

How do I debug tool selection mistakes?

Should I fine-tune a model for tool selection?

Try CallSphere AI Voice Agents

Related Articles You May Like

Personal AI Assistant: How to Pick One for Business in 2026

Free AI Agents in 2026: When Free Wins and When It Costs You

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

GPT-Realtime-2 Tool Use and Reasoning: GPT-5-Class Voice Agents