Skip to content
Agentic AI
Agentic AI8 min read2 views

Prompt Engineering for Tool-Calling Agents: 10 Patterns That Work

Tool-calling reliability is mostly a prompt-engineering problem. The 2026 patterns that consistently improve function-call accuracy.

Why Prompts Decide Reliability

Frontier models are generally capable function-callers. Reliability differences between agents come mostly from prompt design, not model choice. Get the prompts right and a mid-tier model outperforms a frontier model with sloppy prompts.

This piece is the working catalog of 10 patterns that consistently improve tool-calling accuracy.

The Patterns

flowchart TB
    P[Patterns] --> P1[1. Single-purpose function names]
    P --> P2[2. Negative criteria in descriptions]
    P --> P3[3. Parameter sourcing rules]
    P --> P4[4. Examples in schema]
    P --> P5[5. Strict types and enums]
    P --> P6[6. Validate, error, retry]
    P --> P7[7. Group related tools]
    P --> P8[8. Confirm before destructive]
    P --> P9[9. Surface tool errors clearly]
    P --> P10[10. Pin tool list to context]

1. Single-Purpose Function Names

Good: book_appointment, cancel_appointment, reschedule_appointment
Bad: appointment (with mode parameter)

Single-purpose functions are easier for the model to pick correctly. Multimode functions invite mode confusion.

2. Negative Criteria in Descriptions

Tell the model when NOT to call:

"Use this only after verifying patient via lookup_patient_*. Do NOT use this for rescheduling — use reschedule_appointment instead."

Explicit negatives prevent overlap mistakes.

3. Parameter Sourcing Rules

Tell the model where each parameter comes from:

"patient_id: must come from lookup_patient_by_phone or similar. Do not invent."
"start_time: must be from the available_slots returned by get_available_slots."

Hallucinated IDs become rare when sourcing rules are explicit.

4. Examples in Schema

JSON Schema's examples field is read by frontier models. Include 1-2 representative examples:

"examples": [
  { "patient_id": "a1b2c3...", "start_time": "2026-04-25T10:00:00-05:00", ... }
]

Examples are more effective than additional descriptive text.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

5. Strict Types and Enums

Use enum instead of free-form strings where possible:

"appointment_type": {
  "type": "string",
  "enum": ["new_patient", "follow_up", "emergency", "consultation"]
}

Constrains the output to valid values; reduces hallucinated types.

6. Validate, Error, Retry

Validate every tool call server-side. On failure, return a structured error the LLM can read:

{ "error": "patient_id is invalid: a1b2c3 is not a valid UUID. Did you mean to call lookup_patient first?" }

Specific error messages let the LLM correct itself in one retry.

For agents with many tools, group them:

"appointment_tools": [book_appointment, cancel_appointment, reschedule_appointment]
"patient_tools": [lookup_patient_by_phone, lookup_patient_by_id, create_patient]

Helps the model navigate large tool catalogs.

8. Confirm Before Destructive

For irreversible actions (cancel, delete, send money):

System prompt: "For cancel, refund, and payment actions, always confirm with the user before calling the tool."

Adds a safety check; reduces costly mistakes.

9. Surface Tool Errors Clearly

When a tool errors, do not have the bot say "something went wrong." Have it say what went wrong and what the user can do:

"I couldn't find an available slot at that time. The next available slots are 2pm or 4pm."

10. Pin Tool List to Context

Don't change the available tool list mid-conversation if you can avoid it. Stable tool lists improve cache hit rates and reduce model confusion.

Other Patterns Worth Knowing

  • Use specific verbs in tool names ("schedule" vs "do")
  • Order parameters from most-required to most-optional
  • Document return shape in the description
  • Indicate side effects ("this sends an email")
  • Specify timezone handling if relevant

What Goes Wrong Without These

flowchart TD
    Without[Without these patterns] --> W1[Wrong tool selected]
    Without --> W2[Hallucinated IDs]
    Without --> W3[Loops on errors]
    Without --> W4[Destructive mistakes]
    Without --> W5[Cache misses inflate cost]

Each is preventable with deliberate prompt design.

Test Coverage

Every tool you ship should have unit tests:

  • Successful call with normal inputs
  • Failure with bad inputs
  • Edge cases the schema does not cover
  • Long-tail valid inputs

The tests catch regressions when prompts or tool definitions change.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.