---
title: "Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough"
description: "Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge."
canonical: https://callsphere.ai/blog/building-first-agent-openai-agents-sdk-2026
category: "Agentic AI"
tags: ["OpenAI Agents SDK", "AI Agents", "LangSmith", "Agent Evaluation", "Production AI", "AI Engineering", "Python"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.488Z
---

# Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

> Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

## TL;DR

The OpenAI Agents SDK (`openai-agents` on PyPI) is, as of mid-2026, the cleanest way to ship a single-file production agent on top of OpenAI models. You get an `Agent` class, decorator-based tools, typed handoffs between specialists, and a tracing dashboard out of the box — without the ceremony of the legacy Assistants API or the conceptual sprawl of a full graph framework. This post is a hands-on walkthrough: we build a working appointment-scheduling agent end to end, wire it into [LangSmith](https://docs.langchain.com/langsmith/observability) for richer tracing, and then bolt on an offline eval pipeline that runs as a merge gate. By the end you will have copy-pasteable Python that compiles, a mental model for when to use the SDK vs. raw chat completions, and a defensible "did this change make the agent better" workflow. Setup time on a clean machine: about 90 minutes.

## What the Agents SDK Actually Is

If the last time you looked at OpenAI's agent story it was the Assistants API, throw out your priors. The Agents SDK is a thinner library with a different philosophy: the agent loop, tools, handoffs, and guardrails live in your Python process, and OpenAI's server only sees plain Chat Completions or Responses calls plus whatever tracing you opt into. There is no server-side "assistant resource" to create, no "thread" to manage, no opaque "run" object you poll.

Concretely, the SDK gives you four primitives:

1. **Agent** — a configured LLM with a name, instructions, tool list, and an optional output schema.
2. **Tools** — Python callables decorated with `@function_tool` whose signatures and docstrings are auto-converted to JSON schema.
3. **Handoffs** — typed transitions where one agent can hand the conversation to another specialist agent (with optional input filtering).
4. **Runner** — `Runner.run(agent, input)` (or its sync/streaming variants), which executes the loop, manages tool calls, and emits traces.

That is the entire surface area for the 90% case. Anything more elaborate — graphs, branches, persistent multi-day workflows — is where you reach for [LangGraph](https://langchain-ai.github.io/langgraph/) or [Temporal](https://temporal.io). For a focused, single-purpose agent, the SDK is the right default.

## Project Setup

A clean install looks like this:

```bash
pip install 'openai-agents>=0.9' 'openai>=1.50' langsmith pydantic
export OPENAI_API_KEY=sk-...
export LANGSMITH_API_KEY=lsv2_...
export LANGSMITH_TRACING=true
export LANGSMITH_PROJECT=appointment-agent-dev
```

We pin model snapshots aggressively. Floating aliases like `gpt-4o` move under your feet — the date-stamped form (`gpt-4o-2024-08-06`, `gpt-4.1-2025-04-14`) is what you want in any agent that ships to users. Across a year of running these in production, the single biggest source of "the agent regressed and we cannot reproduce it" tickets has been someone using a floating alias.

## A Working Agent in 60 Lines

Here is the appointment-scheduling agent. It has two tools — list available slots and book a slot — plus a handoff to a billing specialist if the user wants to discuss pricing.

```python
import asyncio
from datetime import datetime, timedelta
from typing import Literal
from pydantic import BaseModel, Field
from agents import Agent, Runner, function_tool, handoff

# ─── Tools ────────────────────────────────────────────────────────────────

@function_tool
def list_slots(day: str) -> list[str]:
    """Return available 30-minute slots for a given ISO date (YYYY-MM-DD)."""
    base = datetime.fromisoformat(day)
    return [
        (base.replace(hour=h, minute=m)).isoformat()
        for h in (9, 10, 14, 15)
        for m in (0, 30)
    ]

class BookingResult(BaseModel):
    confirmation_id: str = Field(..., description="Opaque booking ID")
    when: str = Field(..., description="ISO datetime of the booked slot")
    status: Literal["confirmed", "waitlisted"]

@function_tool
def book_slot(slot_iso: str, patient_name: str) -> BookingResult:
    """Book a 30-minute slot for the given patient. Idempotent on (slot_iso, patient_name)."""
    # In production this calls the scheduling backend. For the demo we mint a fake ID.
    return BookingResult(
        confirmation_id=f"CS-{abs(hash((slot_iso, patient_name))) % 10**8:08d}",
        when=slot_iso,
        status="confirmed",
    )

# ─── Specialist agent (handoff target) ────────────────────────────────────

billing_agent = Agent(
    name="Billing Specialist",
    model="gpt-4o-2024-08-06",
    instructions=(
        "You answer questions about pricing, insurance, and copays. "
        "Never quote a number you are not certain about — defer to a human if unsure."
    ),
)

# ─── Primary agent ────────────────────────────────────────────────────────

scheduler = Agent(
    name="Appointment Scheduler",
    model="gpt-4.1-2025-04-14",
    instructions=(
        "You are a calm, concise scheduling agent for a healthcare clinic. "
        "Always confirm the patient's full name and preferred date before calling list_slots. "
        "After booking, repeat the confirmation_id and time back to the patient."
    ),
    tools=[list_slots, book_slot],
    handoffs=[handoff(billing_agent)],
)

async def main():
    result = await Runner.run(
        scheduler,
        input="Hi, this is Maria Chen. Can I book something for May 12th?",
    )
    print(result.final_output)

if __name__ == "__main__":
    asyncio.run(main())
```

A few notes on what this code is doing — and why each choice matters:

- **Two different models.** We use `gpt-4.1-2025-04-14` for the primary scheduler (better at multi-turn dialog and tool routing) and `gpt-4o-2024-08-06` for the billing specialist (cheaper, fast, good enough for FAQ-style answers). Mixing models per agent is the SDK's biggest practical lever for cost control.
- **Pydantic return types on tools.** `book_slot` returns a `BookingResult`, not a dict. The SDK serializes this into the tool result the model sees, and the typed contract makes downstream evaluators trivial — you can assert on `confirmation_id.startswith("CS-")` without parsing free text.
- **Handoffs are typed transitions, not glorified prompts.** When the scheduler decides the user is asking about pricing, it emits a handoff event. The runner switches the active agent and replays the conversation against `billing_agent`'s instructions. The tracing dashboard renders this as a colored boundary in the run tree.

## The Agent Loop, Visually

```mermaid
flowchart TD
  A[User input] --> B[Runner.run(scheduler, input)]
  B --> C[LLM call with tools + handoff schema]
  C --> D{Model output}
  D -->|tool_call| E[Execute @function_tool]
  E --> F[Append tool result to messages]
  F --> C
  D -->|handoff| G[Switch active agent]
  G --> C
  D -->|final message| H[Return RunResult]
  H --> I[Trace flushed to OpenAI + LangSmith]
  style H fill:#cfc
  style I fill:#ffd
```

*Figure 1 — The loop is dead simple: call model, dispatch tool or handoff, repeat until a final message. Everything you might want to inspect (tool args, latency, token counts, agent boundaries) lands in the trace.*

The thing to internalize is that there is no magic. The runner is a `while True` over chat completions with two extra branches (tool call, handoff) before returning. If you understand that loop, you understand the SDK.

## Tracing: What You Get for Free

Just by importing `agents`, every `Runner.run` call emits a trace to OpenAI's [traces dashboard](https://platform.openai.com/traces) with the agent name, tool calls, handoffs, latencies, and token usage. The granularity is exactly what you want for production debugging — you can click into any node and see the prompt, the response, the tool args, and the cost.

For richer slicing (custom tags, datasets, online evals, side-by-side experiment views) we layer LangSmith on top. The SDK respects standard OpenTelemetry env vars, so wiring LangSmith is one line:

```python
from langsmith.wrappers import wrap_openai
from openai import AsyncOpenAI
from agents import set_default_openai_client

set_default_openai_client(wrap_openai(AsyncOpenAI()))
```

After this, every LLM call inside any `Runner.run` shows up in your LangSmith project with the same tree structure. The two dashboards complement each other: OpenAI's is great for raw debugging, LangSmith's is where the dataset → experiment → eval loop lives.

## SDK vs. Raw Chat Completions

A reasonable question: do I actually need the SDK, or can I write the loop myself?

| Concern | Raw chat completions | OpenAI Agents SDK |
| --- | --- | --- |
| Lines of code for a 2-tool agent | ~120 | ~40 |
| Tool-schema generation | Hand-written JSON | From Python signature |
| Handoffs | DIY state machine | First-class, typed |
| Tracing | Roll your own | OpenAI dashboard + OTel |
| Streaming | Manual delta accumulation | `Runner.run_streamed` |
| Multi-agent orchestration | DIY | Built in |
| Lock-in | None | Soft (Pydantic + decorator) |
| Best for | Bespoke, single-call workflows | Anything with >1 tool or any handoff |

If you have one model call and one tool, raw completions are fine. The moment you have two tools or any concept of a "specialist," the SDK pays for itself within a day. We migrated our internal scheduling agent from raw completions to the SDK in March 2026 and deleted ~600 lines of glue code in the process.

## Bolting On an Eval Pipeline

A working agent is not a shippable agent. Before this thing goes near customers we need a way to tell whether changes make it better or worse. The pattern: build a small dataset of representative interactions, write evaluators, and run `langsmith.evaluate()` as a merge gate.

```python
from langsmith import Client, evaluate

client = Client()

# Seed a small dataset (do this once)
def seed_dataset():
    ds = client.create_dataset(
        "scheduler-smoke",
        description="Golden cases for the appointment scheduler",
    )
    examples = [
        {
            "inputs": {"input": "Hi, I'm Maria Chen. I need an appointment May 12th, morning if possible."},
            "outputs": {"must_call": "list_slots", "must_contain": "9:00"},
        },
        {
            "inputs": {"input": "What does this cost? Do you take Aetna?"},
            "outputs": {"must_handoff_to": "Billing Specialist"},
        },
        {
            "inputs": {"input": "Cancel my appointment please."},
            "outputs": {"must_contain": "I cannot cancel"},  # out of scope
        },
    ]
    for ex in examples:
        client.create_example(
            dataset_id=ds.id, inputs=ex["inputs"], outputs=ex["outputs"]
        )

# Wrap the agent for the evaluator
async def predict(inputs: dict) -> dict:
    result = await Runner.run(scheduler, input=inputs["input"])
    return {
        "output": result.final_output,
        "tools_called": [s.name for s in result.tool_calls],
        "final_agent": result.last_agent.name,
    }

# Evaluators (kept simple — production uses LLM-as-judge for quality)
def tool_called(run, example) -> dict:
    expected = example.outputs.get("must_call")
    if not expected:
        return {"key": "tool_called", "score": 1}
    score = int(expected in run.outputs["tools_called"])
    return {"key": "tool_called", "score": score}

def handoff_correct(run, example) -> dict:
    expected = example.outputs.get("must_handoff_to")
    if not expected:
        return {"key": "handoff_correct", "score": 1}
    score = int(run.outputs["final_agent"] == expected)
    return {"key": "handoff_correct", "score": score}

# Run as a merge gate
results = evaluate(
    predict,
    data="scheduler-smoke",
    evaluators=[tool_called, handoff_correct],
    experiment_prefix="scheduler-pr-1234",
    max_concurrency=4,
)
```

Three things to call out. First, evaluators do not have to be LLM-as-judge — structural checks like "did the model call the right tool" catch the majority of regressions for free. Second, `experiment_prefix` lets you tag a run with the PR number so reviewers can diff against `main`'s baseline experiment. Third, `max_concurrency` is the knob you tune to your OpenAI rate limits; 4 is safe for the default tier.

In CI we wrap this in a script that pulls the latest main-tagged experiment, computes per-evaluator deltas, and exits non-zero if any score regresses by more than 2 points. That gate is what stops "I tweaked the prompt" PRs from quietly degrading quality.

## What This Looks Like Inside CallSphere

The scheduler above is roughly the shape of one of the agents powering our [voice and chat platform](/products) — the same Pydantic tool contracts, the same handoff to a billing specialist, the same eval pipeline. The voice variant adds an STT/TTS layer on top, but the agent reasoning core is identical. That uniformity is why we can reuse evaluators across modalities and why the same regression dataset gates both the voice and chat builds. If you want to see the production version in action, the [interactive demo](/demo) runs the same code path against a sandbox dataset.

The SDK is not a complete platform — you still need session storage, retry logic, fallback models, and a real dataset curation process. But as the *core* of an agent, it is the smallest viable starting point that does not paint you into a corner. Start here, add the eval gate before you ship, and you will avoid 80% of the failure modes I have seen on agent teams in 2025–2026.

## FAQ

### Should I use `Runner.run` or `Runner.run_streamed`?

Streaming for any user-facing surface (voice, chat UI). Non-streaming for batch jobs and evaluators where you only need the final result. The streamed variant gives you token-level events you can pipe to a TTS engine; the non-streamed variant is simpler to test.

### Can I use non-OpenAI models?

Yes — pass a custom client via `set_default_openai_client` pointing at any OpenAI-compatible endpoint (Anthropic via a proxy, Azure OpenAI, vLLM, Ollama). Tool calling quality varies wildly across providers; benchmark on your actual eval set before swapping.

### How do I persist conversation state across requests?

The SDK gives you a `Session` object that handles message history. For anything beyond a single process, you persist the message list yourself (Postgres or Redis) and rehydrate it on each turn. Treat the SDK as stateless and own state in your app.

### Is this production-ready or a toy?

Production. We run agents built on this SDK across [healthcare scheduling, real estate qualification, and after-hours support](/industries) — combined ~280k sessions per month — with a 99.7% successful-completion rate gated by the same eval pipeline shown above.

---

Source: https://callsphere.ai/blog/building-first-agent-openai-agents-sdk-2026