By Sagar Shankaran, Founder of CallSphere
A bad tool description costs you 30%+ in argument accuracy. We share the contract-style template — purpose, when-to-use, when-NOT-to-use, parameter constraints, two examples — that lifts CallSphere's Healthcare 14-tool stack to 96% function-arg accuracy across GPT-4o, Claude, and Gemini.
Key takeaways
TL;DR — Research from Gorilla and ToolAlpaca shows precise tool descriptions improve parameter accuracy 30%+ versus terse ones. The fix: write each description as a contract — purpose, when to use, when NOT to use, parameter formats with examples, expected return shape — and never reuse a description across two tools.
A reliable tool description has six parts:
Best-in-class tool descriptions read like API docs written for a junior engineer who must pick the right call from 14 candidates inside 200ms.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
LLMs choose tools by computing similarity between the user turn and each tool's description. When two descriptions both contain "creates an appointment" and only diverge in fine print, the model's softmax flattens and you get the wrong tool 10–20% of the time. Adding a when NOT to use clause forces the descriptions apart in embedding space and gives the model a hard negative signal.
Modern LLMs in 2026 (GPT-4o, Claude 4.x, Gemini 2.5) are explicitly trained to parse description fields with structured headings. They reward you for treating the description like a contract and they punish ambiguity.
flowchart TD
USER[User turn] --> EMB[Embed turn]
EMB --> SIM{Similarity to each tool desc}
SIM -->|tight desc| PICK[Correct tool 96%]
SIM -->|vague desc| WRONG[Wrong tool 18%]
PICK --> ARGS[Fill args from contract]
ARGS --> CALL[Tool call]
CallSphere's Healthcare 14-tool system prompt uses this exact template per tool. The OneRoof real-estate vertical's Triage Aria routes across 10 specialist agents by reading their routing_hint field — the same description pattern. Salon's narrow 6-tool stack uses two-shot examples per description because the call space is small enough to demonstrate every common path. We measure tool accuracy nightly across all 37 agents, 90+ tools, 115+ DB tables, 6 verticals.
Pricing: Starter $149, Growth $499, Scale $1,499. 14-day trial + 22% affiliate. Build your own tool stack via the Build Your Agent flow.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
{
"name": "book_appointment",
"description": "Books a clinical appointment for an EXISTING patient.\n\nWHEN TO USE: caller has confirmed (1) a provider name from list_providers, (2) an ISO-8601 start_time, (3) an existing patient_id from lookup_patient.\n\nWHEN NOT TO USE:\n- Caller has not been verified — call lookup_patient first.\n- Provider unknown — call list_providers first.\n- Time is vague (\"sometime next week\") — clarify with caller first.\n\nEXAMPLES:\n1) Happy: book_appointment(patient_id='p_91', provider='dr_patel', start_time='2026-05-08T15:00-04:00')\n2) Reschedule: cancel_appointment then book_appointment.",
"parameters": {
"type": "object",
"additionalProperties": false,
"required": ["patient_id","provider","start_time"],
"properties": {
"patient_id": {"type":"string","pattern":"^p_[a-z0-9]{2,}$","description":"From lookup_patient"},
"provider": {"type":"string","description":"Slug from list_providers, e.g. dr_patel"},
"start_time": {"type":"string","format":"date-time","description":"ISO-8601 with TZ offset, never naive"}
}
}
}
Q: How long should a description be? 80–250 tokens. Below 50 you lose disambiguation; above 300 you crowd context.
Q: Should I include error cases? Yes — one line. "Returns 409 if the slot is taken; agent should call list_open_slots."
Q: Does this work for Claude and Gemini too?
Yes. The contract style is provider-agnostic. Claude additionally benefits from wrapping the when-NOT section in <not_for> tags (see post 7).
Q: What about strict-mode JSON schemas? Use them — see post 3. Strict mode enforces the contract at the API layer.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
GPT-Realtime-2 brings GPT-5-class reasoning into voice. What that means for tool-call reliability, structured output, and production agent design.
The public MCP registry crossed 9,400 servers in April 2026. Here is a curated walkthrough of the SaaS MCP servers CallSphere mounts in production, with OAuth 2.1 PKCE patterns.
Your agent picked the wrong tool 12% of the time and the final answer was still right. That's a latent bug. Here's the eval pipeline that surfaces it.
OpenAI's parallel function calling can cut latency in half — or burn money on dependent calls. The architecture, code, and an eval that proves the win.
Neo4j's agent-memory project ships short-term, long-term, and reasoning memory in one graph. Microsoft Agent Framework and LangChain both wire it in. Here is the production pattern.
AI SDK 5 ships fully typed chat for React, Svelte, Vue, and Angular plus first-class agent loop primitives. Here are the patterns that matter for shipping in 2026.
© 2026 CallSphere LLC. All rights reserved.