Skip to content
Learn Agentic AI
Learn Agentic AI10 min read20 views

Tool Search and Deferred Loading for Large Tool Sets

Learn how to use ToolSearchTool and defer_loading in the OpenAI Agents SDK to manage large tool inventories, reduce token usage, and dynamically load tools only when needed.

The Problem with Large Tool Sets

Every tool you give an agent costs tokens. The tool's name, description, and full parameter schema are serialized into the system prompt on every model call. With 5 tools, this overhead is negligible. With 50, it consumes thousands of tokens per turn. With 200 — a realistic number for enterprise agents connecting to multiple APIs — you are burning a significant portion of your context window before the conversation even starts.

Beyond token cost, large tool sets degrade model performance. Research consistently shows that models make better tool-calling decisions when presented with fewer, more relevant options. An agent with 200 tools in its prompt will make more selection errors than the same agent with 10 well-matched tools.

The OpenAI Agents SDK addresses this with two complementary features: ToolSearchTool for dynamic tool discovery and defer_loading for lazy schema loading.

How ToolSearchTool Works

ToolSearchTool is a meta-tool — a tool whose job is to find other tools. Instead of loading all tools into the agent's context, you load only the ToolSearchTool. When the agent needs a capability, it searches for relevant tools by description, loads them dynamically, and then calls them.

flowchart TD
    START["Tool Search and Deferred Loading for Large Tool S…"] --> A
    A["The Problem with Large Tool Sets"]
    A --> B
    B["How ToolSearchTool Works"]
    B --> C
    C["Deferred Loading with defer_loading=True"]
    C --> D
    D["Combining ToolSearchTool with Deferred …"]
    D --> E
    E["Token Impact Analysis"]
    E --> F
    F["Best Practices for Tool Search"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Here is the basic setup:

from agents import Agent, Runner, function_tool
from agents.tools import ToolSearchTool

# Define a large set of tools
@function_tool
def get_user_profile(user_id: str) -> dict:
    """Retrieve a user's profile including name, email, and preferences."""
    return {"user_id": user_id, "name": "Alice", "email": "[email protected]"}

@function_tool
def update_user_email(user_id: str, new_email: str) -> dict:
    """Update the email address for a given user."""
    return {"status": "updated", "user_id": user_id, "email": new_email}

@function_tool
def list_invoices(user_id: str, status: str = "all") -> list:
    """List all invoices for a user, optionally filtered by status."""
    return [{"invoice_id": "INV-001", "amount": 99.99, "status": "paid"}]

@function_tool
def create_support_ticket(user_id: str, subject: str, description: str) -> dict:
    """Create a new support ticket for a user."""
    return {"ticket_id": "TKT-42", "status": "open"}

@function_tool
def get_system_health() -> dict:
    """Check the health status of all backend services."""
    return {"api": "healthy", "database": "healthy", "cache": "healthy"}

# Create a searchable tool registry
all_tools = [
    get_user_profile, update_user_email,
    list_invoices, create_support_ticket,
    get_system_health,
]

agent = Agent(
    name="SupportAgent",
    instructions="""You are a customer support agent. Use tool_search
    to find the right tool for each task. Search by describing what
    you need to do.""",
    tools=[
        ToolSearchTool(tools=all_tools),
    ],
    model="gpt-4o",
)

When the agent runs, its initial context only contains the schema for ToolSearchTool — a single lightweight tool. When it needs to, say, look up a user profile, it calls tool_search with a query like "retrieve user profile information." The ToolSearchTool searches the registry by matching the query against tool names and descriptions, then returns the matching tool schemas. On the next turn, the agent can call the discovered tool directly.

Deferred Loading with defer_loading=True

While ToolSearchTool handles the discovery pattern, defer_loading controls how individual tools are represented in the agent's context. When you set defer_loading=True on a tool, the SDK includes only the tool's name in the prompt — no description, no parameter schema. The full schema is loaded only when the agent decides to search for or call that tool.

from agents import Agent, function_tool
from agents.tools import ToolSearchTool

@function_tool(defer_loading=True)
def complex_data_migration(
    source_table: str,
    target_table: str,
    filter_column: str,
    filter_value: str,
    batch_size: int = 1000,
    dry_run: bool = True,
) -> dict:
    """Execute a filtered data migration between database tables
    with batching and optional dry-run mode."""
    return {"status": "migrated", "rows": 5000}

@function_tool(defer_loading=True)
def generate_analytics_report(
    report_type: str,
    date_from: str,
    date_to: str,
    dimensions: list[str] | None = None,
    metrics: list[str] | None = None,
    format: str = "json",
) -> dict:
    """Generate an analytics report with customizable dimensions,
    metrics, and output format."""
    return {"report_id": "RPT-001", "status": "generated"}

These tools have complex schemas with multiple parameters. Without deferred loading, each would consume hundreds of tokens in every prompt. With defer_loading=True, they appear as just their name until the agent needs them.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Combining ToolSearchTool with Deferred Loading

The most powerful pattern combines both features. ToolSearchTool discovers the right tools by semantic search, and deferred loading ensures those tools do not inflate every prompt:

from agents import Agent, Runner, function_tool
from agents.tools import ToolSearchTool
import asyncio

# Imagine 50+ tools across multiple domains
user_tools = [get_user_profile, update_user_email]  # ... many more
billing_tools = [list_invoices]  # ... many more
support_tools = [create_support_ticket]  # ... many more
admin_tools = [get_system_health]  # ... many more

all_tools = user_tools + billing_tools + support_tools + admin_tools

# All tools are deferred — minimal token cost
for tool in all_tools:
    tool.defer_loading = True

agent = Agent(
    name="UnifiedAgent",
    instructions="""You have access to a large set of tools through
    tool_search. Always search for the right tool before attempting
    an action. If no tool matches, tell the user what you cannot do.""",
    tools=[
        ToolSearchTool(tools=all_tools),
    ],
    model="gpt-4o",
)

async def main():
    result = await Runner.run(
        agent,
        input="Show me the latest invoices for user U-123",
    )
    print(result.final_output)

asyncio.run(main())

The agent's first model call sees only the tool_search schema. It searches for "invoices for user," discovers list_invoices, and on the next turn has the full schema available to call it with the correct parameters.

Token Impact Analysis

To understand the savings, consider a concrete example. A typical function tool with 5 parameters, a docstring, and type annotations consumes roughly 150-250 tokens in the prompt. With 100 such tools:

  • Without optimization: 15,000-25,000 tokens per model call just for tool schemas
  • With defer_loading only: ~2,000 tokens (just names)
  • With ToolSearchTool: ~200 tokens (just the search tool schema)
  • With both combined: ~200 tokens base, plus ~200-400 tokens for each discovered tool on subsequent turns

For agents that run multi-turn conversations, the cumulative savings are substantial. A 10-turn conversation with 100 tools could save over 200,000 tokens total.

Write descriptive tool docstrings. The search matches against names and descriptions. A tool called process_data with no docstring is nearly impossible to find by semantic search. Write explicit, keyword-rich descriptions.

Group related tools logically. If you have 20 billing tools and 20 user management tools, the search performs better when tools in the same domain use consistent terminology.

Set search result limits. Configure how many tools the search returns to avoid flooding the agent's context with irrelevant matches:

search_tool = ToolSearchTool(
    tools=all_tools,
    max_results=5,
)

Cache discovered tools across turns. The SDK handles this internally — once a tool is discovered and loaded in a run, it stays available for subsequent turns without another search.

Fall back gracefully. Instruct the agent to report when it cannot find a matching tool rather than hallucinating a tool call. This prevents confusing errors and gives the user actionable feedback.

Tool search and deferred loading are essential patterns for any agent that operates across multiple domains or integrates with large API surfaces. They keep your token costs predictable, your model performance high, and your agent architecture scalable.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.

Technical Guides

How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)

A full technical walkthrough of how modern AI voice agents work — speech-to-text, LLM orchestration, TTS, tool calling, and sub-second latency.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.