Claude Extended Thinking: Leveraging Chain-of-Thought for Complex Reasoning

What Is Extended Thinking

Extended thinking is a Claude feature that lets the model "think out loud" before producing its final answer. When enabled, Claude generates an internal chain-of-thought reasoning trace — a thinking block — that works through the problem step by step before committing to a response.

This is not the same as asking Claude to "think step by step" in a prompt. Extended thinking is a model-level feature where Claude allocates dedicated compute to reasoning. The thinking happens in a structured thinking content block that is returned alongside the final text block, giving you visibility into the model's reasoning process.

Enabling Extended Thinking

Extended thinking requires a thinking configuration with a budget_tokens parameter that controls how many tokens Claude can spend on reasoning:

flowchart TD
    SPEC(["Task spec"])
    SYSTEM["System prompt<br/>role plus rules"]
    SHOTS["Few shot examples<br/>3 to 5"]
    VARS["Variable injection<br/>Jinja or f-string"]
    COT["Chain of thought<br/>or scratchpad"]
    CONSTR["Output constraint<br/>JSON schema"]
    LLM["LLM call"]
    EVAL["Offline eval<br/>LLM as judge plus regex"]
    GATE{"Score over<br/>threshold?"}
    COMMIT(["Promote to prod<br/>version pinned"])
    REVISE(["Revise prompt"])
    SPEC --> SYSTEM --> SHOTS --> VARS --> COT --> CONSTR --> LLM --> EVAL --> GATE
    GATE -->|Yes| COMMIT
    GATE -->|No| REVISE --> SYSTEM
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style COMMIT fill:#059669,stroke:#047857,color:#fff

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[
        {"role": "user", "content": "Analyze the trade-offs between microservices and monolithic architecture for a startup with 5 engineers building a fintech product."}
    ]
)

# The thinking block contains the reasoning trace
for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
    elif block.type == "text":
        print("=== RESPONSE ===")
        print(block.text)

The budget_tokens sets the maximum tokens Claude can use for thinking. The model may use fewer tokens if it reaches a conclusion early. The max_tokens must be larger than budget_tokens to leave room for the actual response.

Understanding the Response Structure

With extended thinking enabled, the response contains multiple content blocks:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=12000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[
        {"role": "user", "content": "Write a Python function that finds the longest palindromic substring in O(n) time using Manacher's algorithm."}
    ]
)

for block in response.content:
    if block.type == "thinking":
        print(f"Thinking used approximately {len(block.thinking.split())} words")
    elif block.type == "text":
        print(block.text)

# Token usage shows thinking tokens separately
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

The thinking block is visible to you as the developer but is not included in conversation history for subsequent turns. This means thinking does not accumulate context window usage across multi-turn conversations.

When to Use Extended Thinking

Extended thinking is most valuable for tasks that require multi-step reasoning:

import anthropic

client = anthropic.Anthropic()

# Complex analysis task - good candidate for extended thinking
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    system="You are a code review agent. Analyze code for bugs, security issues, and performance problems.",
    messages=[
        {"role": "user", "content": """Review this authentication function:

def authenticate(username, password):
    query = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'"
    result = db.execute(query)
    if result:
        token = base64.b64encode(f"{username}:{time.time()}".encode()).decode()
        session['token'] = token
        return {"status": "ok", "token": token}
    return {"status": "fail"}
"""}
    ]
)

for block in response.content:
    if block.type == "text":
        print(block.text)

This is ideal for extended thinking because the model needs to evaluate SQL injection risks, password storage issues, token generation weaknesses, and session management problems — multiple distinct analyses that benefit from structured reasoning.

Budget Token Strategies

The budget allocation depends on task complexity:

import anthropic

client = anthropic.Anthropic()

def smart_query(prompt: str, complexity: str = "medium") -> str:
    budgets = {
        "low": 2000,     # Simple factual questions
        "medium": 6000,  # Analysis and comparison tasks
        "high": 12000,   # Complex reasoning, code generation, math
    }

    budget = budgets.get(complexity, 6000)

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=budget + 4000,
        thinking={"type": "enabled", "budget_tokens": budget},
        messages=[{"role": "user", "content": prompt}]
    )

    return "".join(
        block.text for block in response.content if block.type == "text"
    )

# Low complexity - fast, cheap
answer = smart_query("What is the capital of France?", "low")

# High complexity - deep reasoning
answer = smart_query(
    "Design a rate limiting system that handles 100K requests/second with geographic distribution",
    "high"
)

Start with lower budgets and increase only when you observe the model cutting its reasoning short. Oversized budgets waste tokens (and money) without improving quality on simple tasks.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Extended Thinking in Agent Loops

When combining extended thinking with tool use, thinking happens before each tool call decision:

import anthropic

client = anthropic.Anthropic()

# Extended thinking works alongside tools
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    tools=[{
        "name": "run_sql",
        "description": "Execute a SQL query and return results.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"]
        }
    }],
    messages=[
        {"role": "user", "content": "Find the top 5 customers by lifetime revenue, excluding test accounts."}
    ]
)

# Response may contain: thinking -> text -> tool_use
for block in response.content:
    print(f"Block type: {block.type}")

The thinking block reveals how Claude reasons about which tool to call and what arguments to provide, which is invaluable for debugging agent behavior.

FAQ

Does extended thinking increase costs?

Yes. Thinking tokens are billed as output tokens, which are more expensive than input tokens. A 10,000 token thinking budget could add significant cost per request. Use extended thinking selectively for tasks where the quality improvement justifies the cost, not for every API call.

Can I use extended thinking with streaming?

Yes. When streaming with extended thinking, you receive thinking_delta events followed by content_block_delta events for the text response. This lets you show a "reasoning" indicator to users while Claude thinks, then stream the final answer in real time.

Should I include the thinking block in conversation history?

No. The API does not include thinking blocks in the conversation history for subsequent turns. If you need to reference Claude's reasoning in follow-up turns, extract the relevant parts from the thinking block and include them as regular text content in your messages.

#Anthropic #Claude #ExtendedThinking #ChainOfThought #Reasoning #AgenticAI #LearnAI #AIEngineering

Claude Extended Thinking: Leveraging Chain-of-Thought for Complex Reasoning

What Is Extended Thinking

Enabling Extended Thinking

Understanding the Response Structure

When to Use Extended Thinking

Budget Token Strategies

Extended Thinking in Agent Loops

FAQ

Does extended thinking increase costs?

Can I use extended thinking with streaming?

Should I include the thinking block in conversation history?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Use Multiple Chat AIs at Once (and Why You Might)

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

GPT-Realtime-2 Tool Use and Reasoning: GPT-5-Class Voice Agents

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Project Arc vs Anthropic Managed Agents: Enterprise Agent Comparison