Why Prompts Decide Reliability

Frontier models are generally capable function-callers. Reliability differences between agents come mostly from prompt design, not model choice. Get the prompts right and a mid-tier model outperforms a frontier model with sloppy prompts.

This piece is the working catalog of 10 patterns that consistently improve tool-calling accuracy.

The Patterns

flowchart TB
    P[Patterns] --> P1[1. Single-purpose function names]
    P --> P2[2. Negative criteria in descriptions]
    P --> P3[3. Parameter sourcing rules]
    P --> P4[4. Examples in schema]
    P --> P5[5. Strict types and enums]
    P --> P6[6. Validate, error, retry]
    P --> P7[7. Group related tools]
    P --> P8[8. Confirm before destructive]
    P --> P9[9. Surface tool errors clearly]
    P --> P10[10. Pin tool list to context]

1. Single-Purpose Function Names

Good: book_appointment, cancel_appointment, reschedule_appointment
Bad: appointment (with mode parameter)

Single-purpose functions are easier for the model to pick correctly. Multimode functions invite mode confusion.

2. Negative Criteria in Descriptions

Tell the model when NOT to call:

"Use this only after verifying patient via lookup_patient_*. Do NOT use this for rescheduling — use reschedule_appointment instead."

Explicit negatives prevent overlap mistakes.

3. Parameter Sourcing Rules

Tell the model where each parameter comes from:

"patient_id: must come from lookup_patient_by_phone or similar. Do not invent."
"start_time: must be from the available_slots returned by get_available_slots."

Hallucinated IDs become rare when sourcing rules are explicit.

4. Examples in Schema

JSON Schema's examples field is read by frontier models. Include 1-2 representative examples:

"examples": [
  { "patient_id": "a1b2c3...", "start_time": "2026-04-25T10:00:00-05:00", ... }
]

Examples are more effective than additional descriptive text.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

5. Strict Types and Enums

Use enum instead of free-form strings where possible:

"appointment_type": {
  "type": "string",
  "enum": ["new_patient", "follow_up", "emergency", "consultation"]
}

Constrains the output to valid values; reduces hallucinated types.

6. Validate, Error, Retry

Validate every tool call server-side. On failure, return a structured error the LLM can read:

{ "error": "patient_id is invalid: a1b2c3 is not a valid UUID. Did you mean to call lookup_patient first?" }

Specific error messages let the LLM correct itself in one retry.

For agents with many tools, group them:

"appointment_tools": [book_appointment, cancel_appointment, reschedule_appointment]
"patient_tools": [lookup_patient_by_phone, lookup_patient_by_id, create_patient]

Helps the model navigate large tool catalogs.

8. Confirm Before Destructive

For irreversible actions (cancel, delete, send money):

System prompt: "For cancel, refund, and payment actions, always confirm with the user before calling the tool."

Adds a safety check; reduces costly mistakes.

9. Surface Tool Errors Clearly

When a tool errors, do not have the bot say "something went wrong." Have it say what went wrong and what the user can do:

"I couldn't find an available slot at that time. The next available slots are 2pm or 4pm."

10. Pin Tool List to Context

Don't change the available tool list mid-conversation if you can avoid it. Stable tool lists improve cache hit rates and reduce model confusion.

Other Patterns Worth Knowing

Use specific verbs in tool names ("schedule" vs "do")
Order parameters from most-required to most-optional
Document return shape in the description
Indicate side effects ("this sends an email")
Specify timezone handling if relevant

What Goes Wrong Without These

flowchart TD
    Without[Without these patterns] --> W1[Wrong tool selected]
    Without --> W2[Hallucinated IDs]
    Without --> W3[Loops on errors]
    Without --> W4[Destructive mistakes]
    Without --> W5[Cache misses inflate cost]

Each is preventable with deliberate prompt design.

Test Coverage

Every tool you ship should have unit tests:

Successful call with normal inputs
Failure with bad inputs
Edge cases the schema does not cover
Long-tail valid inputs

The tests catch regressions when prompts or tool definitions change.

Sources

OpenAI function calling guide — https://platform.openai.com/docs/guides/function-calling
Anthropic tool use — https://docs.anthropic.com/claude/docs/tool-use
"BFCL" benchmarks — https://gorilla.cs.berkeley.edu
"Tool use in LLMs" survey — https://arxiv.org/abs/2304.08354
"Effective tool prompts" Hamel Husain — https://hamel.dev

Prompt Engineering for Tool-Calling Agents: 10 Patterns That Work

Why Prompts Decide Reliability

The Patterns

1. Single-Purpose Function Names

2. Negative Criteria in Descriptions

3. Parameter Sourcing Rules

4. Examples in Schema

5. Strict Types and Enums

6. Validate, Error, Retry

8. Confirm Before Destructive

9. Surface Tool Errors Clearly

10. Pin Tool List to Context

Other Patterns Worth Knowing

What Goes Wrong Without These

Test Coverage

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Conversational State Management Patterns for Production Chatbots

Chatbot Architecture in 2026: From Rule-Based to Agentic Pipelines

Agent Loop Design Patterns: Plan-Execute-Reflect for Production Autonomy

Decision-Making in AI Agents: Bayesian, Utility, and Heuristic Approaches

Designing Agents for High-Stakes Decisions: Confidence Calibration in Production

RAG Privacy: Indexing Sensitive Data Without Leaking

Why Prompts Decide Reliability

The Patterns

1. Single-Purpose Function Names

2. Negative Criteria in Descriptions

3. Parameter Sourcing Rules

4. Examples in Schema

5. Strict Types and Enums

6. Validate, Error, Retry

7. Group Related Tools

8. Confirm Before Destructive

9. Surface Tool Errors Clearly

10. Pin Tool List to Context

Other Patterns Worth Knowing

What Goes Wrong Without These

Test Coverage

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Conversational State Management Patterns for Production Chatbots

Chatbot Architecture in 2026: From Rule-Based to Agentic Pipelines

Agent Loop Design Patterns: Plan-Execute-Reflect for Production Autonomy

Decision-Making in AI Agents: Bayesian, Utility, and Heuristic Approaches

Designing Agents for High-Stakes Decisions: Confidence Calibration in Production

RAG Privacy: Indexing Sensitive Data Without Leaking