Prompt Engineering for Tool-Calling Agents: 10 Patterns That Work
Tool-calling reliability is mostly a prompt-engineering problem. The 2026 patterns that consistently improve function-call accuracy.
Why Prompts Decide Reliability
Frontier models are generally capable function-callers. Reliability differences between agents come mostly from prompt design, not model choice. Get the prompts right and a mid-tier model outperforms a frontier model with sloppy prompts.
This piece is the working catalog of 10 patterns that consistently improve tool-calling accuracy.
The Patterns
flowchart TB
P[Patterns] --> P1[1. Single-purpose function names]
P --> P2[2. Negative criteria in descriptions]
P --> P3[3. Parameter sourcing rules]
P --> P4[4. Examples in schema]
P --> P5[5. Strict types and enums]
P --> P6[6. Validate, error, retry]
P --> P7[7. Group related tools]
P --> P8[8. Confirm before destructive]
P --> P9[9. Surface tool errors clearly]
P --> P10[10. Pin tool list to context]
1. Single-Purpose Function Names
Good: book_appointment, cancel_appointment, reschedule_appointment
Bad: appointment (with mode parameter)
Single-purpose functions are easier for the model to pick correctly. Multimode functions invite mode confusion.
2. Negative Criteria in Descriptions
Tell the model when NOT to call:
"Use this only after verifying patient via lookup_patient_*. Do NOT use this for rescheduling — use reschedule_appointment instead."
Explicit negatives prevent overlap mistakes.
3. Parameter Sourcing Rules
Tell the model where each parameter comes from:
"patient_id: must come from lookup_patient_by_phone or similar. Do not invent."
"start_time: must be from the available_slots returned by get_available_slots."
Hallucinated IDs become rare when sourcing rules are explicit.
4. Examples in Schema
JSON Schema's examples field is read by frontier models. Include 1-2 representative examples:
"examples": [
{ "patient_id": "a1b2c3...", "start_time": "2026-04-25T10:00:00-05:00", ... }
]
Examples are more effective than additional descriptive text.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
5. Strict Types and Enums
Use enum instead of free-form strings where possible:
"appointment_type": {
"type": "string",
"enum": ["new_patient", "follow_up", "emergency", "consultation"]
}
Constrains the output to valid values; reduces hallucinated types.
6. Validate, Error, Retry
Validate every tool call server-side. On failure, return a structured error the LLM can read:
{ "error": "patient_id is invalid: a1b2c3 is not a valid UUID. Did you mean to call lookup_patient first?" }
Specific error messages let the LLM correct itself in one retry.
7. Group Related Tools
For agents with many tools, group them:
"appointment_tools": [book_appointment, cancel_appointment, reschedule_appointment]
"patient_tools": [lookup_patient_by_phone, lookup_patient_by_id, create_patient]
Helps the model navigate large tool catalogs.
8. Confirm Before Destructive
For irreversible actions (cancel, delete, send money):
System prompt: "For cancel, refund, and payment actions, always confirm with the user before calling the tool."
Adds a safety check; reduces costly mistakes.
9. Surface Tool Errors Clearly
When a tool errors, do not have the bot say "something went wrong." Have it say what went wrong and what the user can do:
"I couldn't find an available slot at that time. The next available slots are 2pm or 4pm."
10. Pin Tool List to Context
Don't change the available tool list mid-conversation if you can avoid it. Stable tool lists improve cache hit rates and reduce model confusion.
Other Patterns Worth Knowing
- Use specific verbs in tool names ("schedule" vs "do")
- Order parameters from most-required to most-optional
- Document return shape in the description
- Indicate side effects ("this sends an email")
- Specify timezone handling if relevant
What Goes Wrong Without These
flowchart TD
Without[Without these patterns] --> W1[Wrong tool selected]
Without --> W2[Hallucinated IDs]
Without --> W3[Loops on errors]
Without --> W4[Destructive mistakes]
Without --> W5[Cache misses inflate cost]
Each is preventable with deliberate prompt design.
Test Coverage
Every tool you ship should have unit tests:
- Successful call with normal inputs
- Failure with bad inputs
- Edge cases the schema does not cover
- Long-tail valid inputs
The tests catch regressions when prompts or tool definitions change.
Sources
- OpenAI function calling guide — https://platform.openai.com/docs/guides/function-calling
- Anthropic tool use — https://docs.anthropic.com/claude/docs/tool-use
- "BFCL" benchmarks — https://gorilla.cs.berkeley.edu
- "Tool use in LLMs" survey — https://arxiv.org/abs/2304.08354
- "Effective tool prompts" Hamel Husain — https://hamel.dev
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.