---
title: "Claude's Extended Thinking: When to Use It and When Not To"
description: "Understand Claude's extended thinking feature, how it improves reasoning quality for complex tasks, when it adds value vs. unnecessary cost, and implementation patterns for production applications."
canonical: https://callsphere.ai/blog/claude-extended-thinking-when-to-use
category: "Agentic AI"
tags: ["Extended Thinking", "Claude API", "Reasoning", "Chain of Thought", "Anthropic"]
author: "CallSphere Team"
published: 2026-01-28T00:00:00.000Z
updated: 2026-05-08T09:38:41.821Z
---

# Claude's Extended Thinking: When to Use It and When Not To

> Understand Claude's extended thinking feature, how it improves reasoning quality for complex tasks, when it adds value vs. unnecessary cost, and implementation patterns for production applications.

## What Is Extended Thinking?

Extended thinking is a Claude feature that allocates dedicated reasoning tokens before generating the final response. When enabled, Claude produces a chain-of-thought "thinking" block where it reasons through the problem step by step, then generates its answer based on that reasoning.

This is different from simply asking Claude to "think step by step" in the prompt. Extended thinking uses a separate token budget and processing phase specifically designed for deep reasoning, and the thinking content is returned separately from the response so you can inspect Claude's reasoning process.

## How to Enable Extended Thinking

```python
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Up to 128K thinking tokens
    },
    messages=[{
        "role": "user",
        "content": "A farmer needs to cross a river with a wolf, a goat, and a cabbage. The boat can only carry the farmer and one item. If left alone, the wolf will eat the goat, and the goat will eat the cabbage. How can the farmer get everything across safely?"
    }]
)

# The response contains both thinking and text blocks
for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
    elif block.type == "text":
        print("\n=== RESPONSE ===")
        print(block.text)
```

## When Extended Thinking Adds Value

### Complex Mathematical Reasoning

Extended thinking dramatically improves accuracy on multi-step math problems. Without it, Claude might skip steps or make arithmetic errors. With it, Claude works through each step methodically.

```mermaid
flowchart TD
    SPEC(["Task spec"])
    SYSTEM["System prompt
role plus rules"]
    SHOTS["Few shot examples
3 to 5"]
    VARS["Variable injection
Jinja or f-string"]
    COT["Chain of thought
or scratchpad"]
    CONSTR["Output constraint
JSON schema"]
    LLM["LLM call"]
    EVAL["Offline eval
LLM as judge plus regex"]
    GATE{"Score over
threshold?"}
    COMMIT(["Promote to prod
version pinned"])
    REVISE(["Revise prompt"])
    SPEC --> SYSTEM --> SHOTS --> VARS --> COT --> CONSTR --> LLM --> EVAL --> GATE
    GATE -->|Yes| COMMIT
    GATE -->|No| REVISE --> SYSTEM
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style COMMIT fill:#059669,stroke:#047857,color:#fff
```

**Benchmark improvement**: On the MATH benchmark, extended thinking improves accuracy by 10-20 percentage points compared to standard responses.

### Code Architecture Decisions

When designing complex systems, extended thinking helps Claude consider more alternatives, evaluate tradeoffs, and arrive at better-reasoned recommendations:

```python
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{
        "role": "user",
        "content": """Design the database schema for a multi-tenant SaaS application that needs:
- Per-tenant data isolation
- Shared resources for common configurations
- Audit logging for compliance
- Support for 10,000+ tenants with varying data volumes
- Sub-100ms query latency for dashboard queries

Consider row-level security, partitioning strategies, and caching layers."""
    }]
)
```

### Ambiguous Requirements Analysis

When requirements are vague or contradictory, extended thinking helps Claude identify and reason through the ambiguities:

```python
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{
        "role": "user",
        "content": """Our client wants a 'fast, secure, and cheap' authentication system
that supports 'millions of users' with 'zero downtime' and must be built in '2 weeks.'
Identify the tradeoffs and propose a realistic architecture."""
    }]
)
```

### Multi-Step Planning

Extended thinking excels at tasks that require planning multiple steps with dependencies:

- Migration planning for large codebases
- Incident response procedures
- Project decomposition and scheduling
- Complex SQL query construction

## When NOT to Use Extended Thinking

### Simple Factual Questions

"What is the capital of France?" does not benefit from extended thinking. The answer is immediate and certain. Thinking tokens are wasted.

### Template-Based Generation

Generating emails, form letters, or structured outputs from templates does not require deep reasoning. The overhead of thinking tokens adds cost without improving quality.

### Classification Tasks

Binary or multi-class classification is typically a pattern-matching task that does not benefit from extended reasoning:

```python
# DON'T use extended thinking for this
response = client.messages.create(
    model="claude-haiku-4-5-20250514",  # Use Haiku, no thinking
    max_tokens=100,
    messages=[{
        "role": "user",
        "content": "Classify this email as spam or not spam: 'You won $1M! Click here...'"
    }]
)
```

### High-Volume, Low-Latency Applications

Extended thinking adds latency (the thinking phase runs before the response begins) and cost (thinking tokens are billed as output tokens). For chatbots handling thousands of concurrent conversations, the overhead is unjustified for routine queries.

## Cost and Latency Impact

### Token Costs

Thinking tokens are billed as output tokens. At Claude Sonnet rates:

| Budget | Thinking Cost | Typical Response Cost | Total |
| --- | --- | --- | --- |
| 1,000 tokens | $0.015 | $0.015 | $0.030 |
| 5,000 tokens | $0.075 | $0.015 | $0.090 |
| 10,000 tokens | $0.150 | $0.015 | $0.165 |
| 50,000 tokens | $0.750 | $0.015 | $0.765 |

### Latency Impact

Thinking tokens must be generated before the response begins, which directly increases time to first token (TTFT):

- **1,000 thinking tokens**: +1-2 seconds TTFT
- **5,000 thinking tokens**: +5-10 seconds TTFT
- **10,000 thinking tokens**: +10-20 seconds TTFT

For interactive applications, keep thinking budgets modest (1,000-5,000 tokens). For offline analysis, larger budgets (10,000-50,000) are acceptable.

## Streaming with Extended Thinking

You can stream both the thinking and response phases:

```python
with client.messages.stream(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{"role": "user", "content": "Design a rate limiter for a distributed system."}],
) as stream:
    current_phase = None
    for event in stream:
        if event.type == "content_block_start":
            if event.content_block.type == "thinking":
                current_phase = "thinking"
                print("\n[Thinking...]")
            elif event.content_block.type == "text":
                current_phase = "response"
                print("\n[Response]")
        elif event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                pass  # Optionally show thinking to user
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)
```

## Practical Decision Framework

Use this flowchart to decide whether to enable extended thinking:

1. **Is the task time-sensitive ( No extended thinking
2. **Is the answer deterministic or template-based?** -> No extended thinking
3. **Does the task involve multi-step reasoning?** -> Yes, use 3,000-5,000 budget
4. **Does the task involve complex analysis with tradeoffs?** -> Yes, use 5,000-10,000 budget
5. **Is this an offline analysis or batch job?** -> Yes, use 10,000-50,000 budget
6. **Is correctness critical (financial, medical, legal)?** -> Yes, use maximum budget

## Multi-Turn Conversations with Thinking

In multi-turn conversations, previous thinking blocks are included in the conversation history. This means Claude can build on its prior reasoning. However, thinking tokens from previous turns count toward input tokens, which can be expensive.

```python
# First turn with thinking
messages = [{"role": "user", "content": "Design a caching strategy for our API."}]
response1 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=messages,
)

# Second turn -- include previous thinking in history
messages.append({"role": "assistant", "content": response1.content})
messages.append({"role": "user", "content": "Now consider how this works with database read replicas."})

response2 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=messages,
)
```

## Redacting Thinking in Production

In some applications, you may want to use extended thinking for quality but not expose the thinking process to end users. The thinking content is returned in a separate block, making it easy to filter:

```python
def get_response_only(response) -> str:
    """Extract only the text response, discarding thinking blocks."""
    return "".join(
        block.text for block in response.content if block.type == "text"
    )

def get_thinking_only(response) -> str:
    """Extract only thinking blocks for debugging/logging."""
    return "".join(
        block.thinking for block in response.content if block.type == "thinking"
    )
```

Log the thinking content for debugging and quality analysis, but only return the text response to users.

---

Source: https://callsphere.ai/blog/claude-extended-thinking-when-to-use
