Debugging LLM Responses: When the Model Says Something Wrong or Unexpected

The Model Said What?

Every developer building AI agents hits the same wall: the model returns something confidently wrong, hallucinates data that does not exist, or ignores a clear instruction. The instinct is to rewrite the entire prompt from scratch. That is almost never the right first step.

Debugging LLM responses requires the same discipline as debugging traditional software. You isolate the problem, form a hypothesis, test it, and iterate. The difference is that LLMs are stochastic — the same input can produce different outputs — so your debugging toolkit needs to account for non-determinism.

Step 1: Capture the Full Request and Response

Before you change anything, log the exact request that produced the bad output. This means the system prompt, user message, conversation history, tool definitions, and all model parameters:

flowchart TD
    SPEC(["Task spec"])
    SYSTEM["System prompt<br/>role plus rules"]
    SHOTS["Few shot examples<br/>3 to 5"]
    VARS["Variable injection<br/>Jinja or f-string"]
    COT["Chain of thought<br/>or scratchpad"]
    CONSTR["Output constraint<br/>JSON schema"]
    LLM["LLM call"]
    EVAL["Offline eval<br/>LLM as judge plus regex"]
    GATE{"Score over<br/>threshold?"}
    COMMIT(["Promote to prod<br/>version pinned"])
    REVISE(["Revise prompt"])
    SPEC --> SYSTEM --> SHOTS --> VARS --> COT --> CONSTR --> LLM --> EVAL --> GATE
    GATE -->|Yes| COMMIT
    GATE -->|No| REVISE --> SYSTEM
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style COMMIT fill:#059669,stroke:#047857,color:#fff

import json
import openai
from datetime import datetime

class LLMDebugger:
    def __init__(self, client: openai.AsyncOpenAI):
        self.client = client
        self.debug_log = []

    async def chat(self, messages, model="gpt-4o", temperature=1.0, **kwargs):
        request_payload = {
            "model": model,
            "messages": messages,
            "temperature": temperature,
            **kwargs,
        }

        # Capture full request
        debug_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "request": request_payload,
        }

        response = await self.client.chat.completions.create(**request_payload)

        # Capture full response
        debug_entry["response"] = {
            "content": response.choices[0].message.content,
            "finish_reason": response.choices[0].finish_reason,
            "usage": {
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
            },
        }
        self.debug_log.append(debug_entry)
        return response

    def dump_last(self):
        if self.debug_log:
            print(json.dumps(self.debug_log[-1], indent=2))

With the full request captured, you can replay it to see if the problem is deterministic or intermittent.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Step 2: Check Temperature and Sampling

Temperature is the most common hidden cause of inconsistent behavior. A temperature of 1.0 introduces significant randomness. For agent tasks that require precision — tool selection, data extraction, classification — lower the temperature:

# High temperature: creative but unpredictable
response = await client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=1.0,  # Too high for structured tasks
)

# Low temperature: deterministic and precise
response = await client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=0.1,  # Suitable for tool calls and extraction
)

Run the same prompt 10 times at your current temperature. If the bad output appears in only 2 of 10 runs, the issue is sampling variance, not a prompt flaw.

Step 3: Isolate the Prompt Section

When the full prompt is long, identify which section is causing the issue. Comment out sections systematically:

def build_diagnostic_prompts(full_system_prompt: str, user_message: str):
    """Generate minimal prompt variants to isolate the problem."""
    sections = full_system_prompt.split("\n## ")
    variants = []

    for i, section in enumerate(sections):
        # Remove one section at a time
        reduced = "\n## ".join(
            s for j, s in enumerate(sections) if j != i
        )
        variants.append({
            "removed_section": i,
            "section_preview": section[:80],
            "messages": [
                {"role": "system", "content": reduced},
                {"role": "user", "content": user_message},
            ],
        })
    return variants

If removing a section fixes the problem, that section contains a conflicting or confusing instruction.

Step 4: Add Few-Shot Examples

When the model consistently misinterprets an instruction, few-shot examples are more effective than adding more explanation. Show the model what you want:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

system_prompt = """You are a support agent. Extract the issue category.

Example input: "My payment was charged twice"
Example output: {"category": "billing", "urgency": "high"}

Example input: "How do I change my password?"
Example output: {"category": "account", "urgency": "low"}

Always respond with valid JSON only."""

Few-shot examples anchor the model to a specific output pattern. Two or three examples are usually sufficient.

FAQ

How do I debug a hallucinated tool call where the model invents a tool that does not exist?

Check that your tool definitions include clear, distinct descriptions. Models hallucinate tool names when existing tool descriptions are vague or overlap. Reduce temperature to 0.1 for tool selection and verify that the tools array in your request contains all expected entries. If the model still invents tools, add a system instruction explicitly stating it must only use the tools provided.

Should I always use temperature 0 for deterministic behavior?

Temperature 0 makes the output nearly deterministic but not perfectly so — there can be minor variations due to floating-point arithmetic across different hardware. Use temperature 0 or 0.1 for tasks requiring precision such as classification, extraction, and tool selection. Reserve higher temperatures for creative tasks like content generation where variety is desirable.

How many few-shot examples should I include to fix a recurring output format issue?

Two to three examples are usually enough to anchor the model to a specific format. More than five examples increase token usage without proportional improvement. Place examples near the beginning of the system prompt where they receive the most attention from the model.

#Debugging #LLM #PromptEngineering #AIAgents #Troubleshooting #AgenticAI #LearnAI #AIEngineering

Debugging LLM Responses: When the Model Says Something Wrong or Unexpected

The Model Said What?

Step 1: Capture the Full Request and Response

Step 2: Check Temperature and Sampling

Step 3: Isolate the Prompt Section

Step 4: Add Few-Shot Examples

FAQ

How do I debug a hallucinated tool call where the model invents a tool that does not exist?

Should I always use temperature 0 for deterministic behavior?

How many few-shot examples should I include to fix a recurring output format issue?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

OpenAI Frontier vs Anthropic Managed Agents: 2026 Comparison

Meta Hatch: The Consumer AI Agent Built To Beat OpenClaw

OpenAI Frontier: New Enterprise Platform to Build and Deploy Agents

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Microsoft Copilot for Sales 2026: Dynamics, Outlook, Teams

Product

Resources

Company

Legal

Industries

Integrations

Solutions

Compare

Pillar Guides

See AI Voice Agents in Action