Prompt Engineering for Tool-Calling Agents: 10 Patterns That Work
By Sagar Shankaran, Founder of CallSphere
Tool-calling reliability is mostly a prompt-engineering problem. The 2026 patterns that consistently improve function-call accuracy.
Key takeaways
Why Prompts Decide Reliability
Frontier models are generally capable function-callers. Reliability differences between agents come mostly from prompt design, not model choice. Get the prompts right and a mid-tier model outperforms a frontier model with sloppy prompts.
This piece is the working catalog of 10 patterns that consistently improve tool-calling accuracy.
The Patterns
flowchart TB
P[Patterns] --> P1[1. Single-purpose function names]
P --> P2[2. Negative criteria in descriptions]
P --> P3[3. Parameter sourcing rules]
P --> P4[4. Examples in schema]
P --> P5[5. Strict types and enums]
P --> P6[6. Validate, error, retry]
P --> P7[7. Group related tools]
P --> P8[8. Confirm before destructive]
P --> P9[9. Surface tool errors clearly]
P --> P10[10. Pin tool list to context]
1. Single-Purpose Function Names
Good: book_appointment, cancel_appointment, reschedule_appointment
Bad: appointment (with mode parameter)
Single-purpose functions are easier for the model to pick correctly. Multimode functions invite mode confusion.
2. Negative Criteria in Descriptions
Tell the model when NOT to call:
"Use this only after verifying patient via lookup_patient_*. Do NOT use this for rescheduling — use reschedule_appointment instead."
Explicit negatives prevent overlap mistakes.
3. Parameter Sourcing Rules
Tell the model where each parameter comes from:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
"patient_id: must come from lookup_patient_by_phone or similar. Do not invent."
"start_time: must be from the available_slots returned by get_available_slots."
Hallucinated IDs become rare when sourcing rules are explicit.
4. Examples in Schema
JSON Schema's examples field is read by frontier models. Include 1-2 representative examples:
"examples": [
{ "patient_id": "a1b2c3...", "start_time": "2026-04-25T10:00:00-05:00", ... }
]
Examples are more effective than additional descriptive text.
5. Strict Types and Enums
Use enum instead of free-form strings where possible:
"appointment_type": {
"type": "string",
"enum": ["new_patient", "follow_up", "emergency", "consultation"]
}
Constrains the output to valid values; reduces hallucinated types.
6. Validate, Error, Retry
Validate every tool call server-side. On failure, return a structured error the LLM can read:
{ "error": "patient_id is invalid: a1b2c3 is not a valid UUID. Did you mean to call lookup_patient first?" }
Specific error messages let the LLM correct itself in one retry.
7. Group Related Tools
For agents with many tools, group them:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
"appointment_tools": [book_appointment, cancel_appointment, reschedule_appointment]
"patient_tools": [lookup_patient_by_phone, lookup_patient_by_id, create_patient]
Helps the model navigate large tool catalogs.
8. Confirm Before Destructive
For irreversible actions (cancel, delete, send money):
System prompt: "For cancel, refund, and payment actions, always confirm with the user before calling the tool."
Adds a safety check; reduces costly mistakes.
9. Surface Tool Errors Clearly
When a tool errors, do not have the bot say "something went wrong." Have it say what went wrong and what the user can do:
"I couldn't find an available slot at that time. The next available slots are 2pm or 4pm."
10. Pin Tool List to Context
Don't change the available tool list mid-conversation if you can avoid it. Stable tool lists improve cache hit rates and reduce model confusion.
Other Patterns Worth Knowing
- Use specific verbs in tool names ("schedule" vs "do")
- Order parameters from most-required to most-optional
- Document return shape in the description
- Indicate side effects ("this sends an email")
- Specify timezone handling if relevant
What Goes Wrong Without These
flowchart TD
Without[Without these patterns] --> W1[Wrong tool selected]
Without --> W2[Hallucinated IDs]
Without --> W3[Loops on errors]
Without --> W4[Destructive mistakes]
Without --> W5[Cache misses inflate cost]
Each is preventable with deliberate prompt design.
Test Coverage
Every tool you ship should have unit tests:
- Successful call with normal inputs
- Failure with bad inputs
- Edge cases the schema does not cover
- Long-tail valid inputs
The tests catch regressions when prompts or tool definitions change.
Sources
- OpenAI function calling guide — https://platform.openai.com/docs/guides/function-calling
- Anthropic tool use — https://docs.anthropic.com/claude/docs/tool-use
- "BFCL" benchmarks — https://gorilla.cs.berkeley.edu
- "Tool use in LLMs" survey — https://arxiv.org/abs/2304.08354
- "Effective tool prompts" Hamel Husain — https://hamel.dev
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.