Building Reasoning Agents with GPT-5 and o3 in 2026: When to Reach for the Big Brain

TL;DR

Reasoning models — gpt-5-2025-04-14, o3-2025-04-16, o4-mini-2025-04-16 — are not "smarter GPT-4o" you swap in across the board. They are a different latency, cost, and prompting shape that wins decisively for multi-step planning, structured tool decomposition, and constraint satisfaction, and loses badly for low-latency conversational turns and high-throughput tool-call wrapping. The architecture that has held up across the last twelve months on our agent platform is a hybrid: a reasoning model as the planner that emits a structured plan once per task, and a fast model as the executor that runs the plan node-by-node. The savings are real (3.4× lower cost than reasoning-everywhere, 5.8× lower p95 latency), and so is the quality lift over fast-model-only (planning correctness up from 71% to 94% on our internal eval suite). This post is the architecture, the OpenAI SDK + LangGraph code, and the numbers.

Why "Just Use o3 Everywhere" Is the Wrong Default

The temptation when GPT-5 and o3 land is the same temptation every model upgrade brings: bolt the new model into the same scaffold and call it done. With reasoning models, that approach actively hurts you on three axes.

Latency. o3-2025-04-16 at reasoning_effort: "medium" averages 14–22 seconds to first token on planning-shaped inputs in our measurements. gpt-4o-2024-08-06 averages 0.4–0.9 seconds. For a voice agent that has to respond inside a 1.2-second conversational budget, swapping in o3 for the entire turn is not a tradeoff — it is an outage.

Cost. o3 input tokens are roughly 8× gpt-4o, and reasoning models also charge for hidden reasoning tokens (you do not see them, you do pay for them). A typical multi-tool turn that costs $0.004 on gpt-4o costs $0.06–$0.11 on o3 because of the reasoning trace.

Prompt sensitivity. This is the one most teams get wrong. Reasoning models are worse at following heavy chain-of-thought instructions than fast models because they already do that internally. Telling o3 "think step by step, first identify the goal, then list constraints, then..." typically degrades output quality. They want the goal stated, the constraints listed, and the ask — full stop.

So the question is not "should we upgrade." The question is: where in the agent loop is the marginal token of reasoning worth eight pennies?

The Hybrid Architecture That Wins

For non-trivial agents — anything beyond single-tool retrieval — the pattern is:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A[User input] --> B[Classifier - gpt-4o-mini]
  B -->|simple| C[Fast executor - gpt-4o]
  B -->|multi-step or constrained| D[Planner - o3]
  D --> E[Structured plan JSON]
  E --> F[LangGraph plan loop]
  F --> G[Executor - gpt-4o per node]
  G --> H{Plan node failed?}
  H -->|yes, recoverable| G
  H -->|yes, replan needed| D
  H -->|no| I{More nodes?}
  I -->|yes| G
  I -->|no| J[Final response]
  C --> J
  style D fill:#ffd
  style G fill:#cfc
  style J fill:#cfe

Figure 1 — Hybrid planner/executor split. The reasoning model touches the request once; the fast model does the heavy lifting per node.

Three roles, three models:

Classifier (gpt-4o-mini-2024-07-18) — sub-100ms, decides whether the request actually needs the planner. Most don't.
Planner (o3-2025-04-16 at reasoning_effort: "high" or gpt-5-2025-04-14) — emits a structured plan (a DAG of tool calls and assertions) once per qualifying request.
Executor (gpt-4o-2024-08-06) — runs each plan node, calls tools, fills slots. Fast, cheap, and crucially never sees the long reasoning context.

The economic logic: the planner is the only place reasoning quality compounds. Get the plan right and the executor's job is trivial. Get the plan wrong and no executor model — reasoning or otherwise — recovers cleanly.

Real Numbers From Our Production Eval Set

We benchmarked four configurations against our 700-case internal eval (multi-step scheduling, healthcare intake triage, real-estate-lead qualification, IT helpdesk runbook execution) drawn from the verticals we serve on CallSphere's industry deployments:

Configuration	Planning correctness	Mean cost per session	p95 latency	Failed-task rate
All gpt-4o	71%	$0.011	2.1s	18%
All gpt-4.1	78%	$0.018	2.4s	13%
All o3 high-effort	96%	$0.084	24.7s	4%
All gpt-5 medium-effort	95%	$0.061	11.8s	4%
Hybrid: o3 planner + gpt-4o executor	94%	$0.025	4.2s	5%

The hybrid loses ~2 points of planning correctness vs. o3-everywhere, gains 5.8× on p95 latency, and saves 70% on cost. For our use cases that is the right point on the curve. If you're optimizing for absolute correctness on high-stakes one-off tasks (legal research, complex code refactors) you'd push toward o3-everywhere. If you're optimizing for cost-per-conversation in a voice agent, you push toward fast-model-only and accept the 18% failed-task rate as a fallback to a human.

Per-Model Profile

Model	Best for	Avoid for	`reasoning_effort`	$/1M in	$/1M out
`gpt-4o-2024-08-06`	Conversational turns, tool wrapping, summarization	Multi-constraint planning, math-heavy tasks	n/a	$2.50	$10.00
`gpt-4.1-2025-04-14`	Long-context refactor, document synthesis	Latency-critical voice	n/a	$2.00	$8.00
`gpt-5-2025-04-14`	Mixed planning + execution where you want one model	Sub-second response budgets	low/medium/high	$5.00	$40.00
`o3-2025-04-16`	Hardest planning + adversarial reasoning	Anything chatty	low/medium/high	$10.00	$40.00
`o4-mini-2025-04-16`	Cheap reasoning for medium-complexity planning	When you need o3-tier rigor	low/medium/high	$1.10	$4.40

A note on reasoning_effort: it's the dial that matters most and the one most teams ignore. We default to medium for planners. high is for adversarial cases (jailbreak resistance, multi-stakeholder constraint satisfaction). low is when you want reasoning-model behavior at a price/latency closer to gpt-4.1 — useful for the second-pass replanner in our loop.

Code: A Reasoning Planner Inside a LangGraph Node

Here's the production-ish wiring. The planner is a single LangGraph node that calls o3 with a tight prompt and returns a JSON plan. The executor is a downstream node that consumes plan steps.

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal
from langgraph.graph import StateGraph, END

client = OpenAI()

# ── 1. Plan schema ─────────────────────────────────────────────
class PlanStep(BaseModel):
    id: str
    tool: Literal["search_appointments", "verify_insurance",
                  "send_confirmation", "lookup_provider", "respond"]
    args: dict
    depends_on: list[str] = Field(default_factory=list)
    success_check: str  # natural-language assertion

class Plan(BaseModel):
    goal: str
    constraints: list[str]
    steps: list[PlanStep]

# ── 2. The planner node ────────────────────────────────────────
PLANNER_SYSTEM = """You produce a JSON Plan to fulfill a user request.

Output ONLY valid Plan JSON matching the schema. Do not narrate.
Decompose the request into the minimum number of tool calls.
Each step has a success_check the executor will verify."""

def planner_node(state: dict) -> dict:
    # NOTE: o3 prompts are SHORT. Do not add chain-of-thought scaffolding.
    user_msg = (
        f"User request: {state['user_input']}\n"
        f"Known context: {state.get('context', {})}\n"
        f"Available tools: search_appointments, verify_insurance, "
        f"send_confirmation, lookup_provider, respond"
    )
    resp = client.chat.completions.create(
        model="o3-2025-04-16",
        reasoning_effort="medium",
        messages=[
            {"role": "system", "content": PLANNER_SYSTEM},
            {"role": "user",   "content": user_msg},
        ],
        response_format={"type": "json_object"},
    )
    plan = Plan.model_validate_json(resp.choices[0].message.content)
    return {"plan": plan, "step_idx": 0, "results": {}}

# ── 3. The executor node (fast model, per-step) ────────────────
EXEC_SYSTEM = """You execute ONE plan step. Call exactly the named tool
with the given args. Then verify the success_check holds against the
tool result. Reply with JSON: {ok: bool, result: any, note: str}."""

def executor_node(state: dict) -> dict:
    step = state["plan"].steps[state["step_idx"]]
    deps = {d: state["results"][d] for d in step.depends_on}
    resp = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": EXEC_SYSTEM},
            {"role": "user", "content":
                f"Step: {step.model_dump_json()}\n"
                f"Dependency outputs: {deps}\n"
                f"Success check: {step.success_check}"},
        ],
        tools=TOOL_SCHEMAS,
        tool_choice="auto",
    )
    result = run_tool_calls(resp)  # your tool dispatcher
    state["results"][step.id] = result
    state["step_idx"] += 1
    return state

# ── 4. Branching: replan if a step fails non-recoverably ───────
def route(state: dict) -> str:
    last_id = state["plan"].steps[state["step_idx"] - 1].id
    last = state["results"][last_id]
    if not last.get("ok") and last.get("note") == "needs_replan":
        return "planner"
    if state["step_idx"] >= len(state["plan"].steps):
        return END
    return "executor"

graph = StateGraph(dict)
graph.add_node("planner",  planner_node)
graph.add_node("executor", executor_node)
graph.set_entry_point("planner")
graph.add_edge("planner", "executor")
graph.add_conditional_edges("executor", route,
                            {"executor": "executor",
                             "planner":  "planner",
                             END: END})
agent = graph.compile()

Three things worth highlighting:

The planner prompt is short. No "let's think step by step." That's the single most common prompt-engineering mistake teams make when they first move to reasoning models. The model is already thinking; your scaffolding is at best noise and at worst confuses it.
response_format={"type": "json_object"} plus Pydantic validation gives us a hard contract. If o3 returns malformed JSON we surface it as a planner error and can fall back to gpt-4.1 as a backup planner.
Replan is rare but critical. About 4% of our sessions trigger a replan after the executor finds a step's success_check failing. Without that branch, you ship a happy-path-only agent.

When to Skip the Planner Entirely

The classifier in front of the graph is doing real work. About 62% of our voice-agent sessions are simple enough that the fast executor with no plan handles them in one or two tool calls. Routing those through o3 would burn money for no quality gain. The classifier prompt is roughly:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Classify the request as simple (single intent, ≤2 tool calls likely) or complex (multi-constraint, multi-step, or contains explicit conditional like "if X then Y"). Output one word.

We run that on gpt-4o-mini-2024-07-18 at temperature 0. It costs essentially nothing and saves the planner ~62% of its work. The classifier itself is wrong about 4% of the time, and we treat false-negatives (simple-classified-as-complex) as harmless and false-positives (complex-classified-as-simple) as logged for the regression dataset.

What Reasoning Models Are Genuinely Good At

The wins we've seen most clearly:

Multi-constraint scheduling. "Book me with a female cardiologist who takes Aetna, available Tuesday or Thursday afternoon, English-speaking, within 10 miles." Fast models routinely drop one constraint. o3 holds all of them.
Tool-decomposition with hidden dependencies. When tool A's output unlocks an argument for tool C only if tool B succeeded. Reasoning models build the DAG correctly; fast models tend to flatten it.
Adversarial inputs and jailbreak resistance. The planner pass on a reasoning model with reasoning_effort: "high" catches roughly 2× more prompt-injection attempts than gpt-4o in our red-team set.
Numerical and date arithmetic. "Three business days after the second Tuesday of next month, excluding US federal holidays." Don't trust gpt-4o here. Trust o3 or o4-mini.

What they're not good at: anything chatty. Don't put o3 on the user-facing turn. The output style is too dense, too clinical. Save reasoning for the planner; let gpt-4o do the talking. We use this exact pattern in our demo experience — the visible agent is gpt-4o; the planner you don't see is o3.

The "When in Doubt" Heuristic

A senior-engineer rule of thumb that's held up across teams I've advised: if the agent's failure mode is "it answered fluently but did the wrong thing," you have a planning problem and you want a reasoning model in the planner. If the failure mode is "it answered slowly or weirdly," you have an executor problem and reasoning models will make it worse. Diagnose the failure mode first, pick the model second.

Frequently Asked Questions

Should I use GPT-5 or o3 for the planner?

Both work. GPT-5 is faster and slightly cheaper at comparable effort settings; o3 is marginally more rigorous on adversarial inputs. We default to o3 for healthcare and IT-helpdesk planners (where rigor matters more than latency) and gpt-5 for sales and real-estate planners (where the planner sits closer to the user-facing turn).

Can I just use o4-mini and skip the hybrid?

You can, and for medium-complexity workloads it's a real option — o4-mini is roughly 1/9th the cost of o3 with about 80% of the planning quality. The hybrid still wins because the executor is the bottleneck on cost, and o4-mini-as-executor is still 3× more expensive than gpt-4o-as-executor with no quality gain on execution-shaped tasks.

How do I evaluate this loop?

Score the plan and the final answer separately. Plan-correctness against a reference plan (or a rubric judge) catches planner regressions; final-answer correctness catches executor regressions. We dig into the trace-level eval pattern in the companion piece on evaluating reasoning traces.

What about streaming?

Reasoning models do not stream the reasoning trace, only the final response. For voice agents this is fine — the planner runs offline-of-the-turn while a "let me check that for you" filler plays. Don't try to stream o3 inside a sub-second voice loop. It does not work.

How sensitive is the planner to prompt changes?

Less than gpt-4o, more than you'd hope. We pin the planner system prompt with a version tag and re-run the full regression suite on any change. About 1 in 6 planner-prompt changes regresses some evaluator.

Building Reasoning Agents with GPT-5 and o3 in 2026: When to Reach for the Big Brain

TL;DR

Why "Just Use o3 Everywhere" Is the Wrong Default

The Hybrid Architecture That Wins

Real Numbers From Our Production Eval Set

Per-Model Profile

Code: A Reasoning Planner Inside a LangGraph Node

When to Skip the Planner Entirely

What Reasoning Models Are Genuinely Good At

The "When in Doubt" Heuristic

Frequently Asked Questions

Should I use GPT-5 or o3 for the planner?

Can I just use o4-mini and skip the hybrid?

How do I evaluate this loop?

What about streaming?

How sensitive is the planner to prompt changes?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Build a Golden Dataset for Production AI Agents

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split