OpenAI Agents SDK vs Assistants API in 2026: Migration Guide with Eval Parity

TL;DR

If you built on the OpenAI Assistants API in 2024, you are now staring at a quiet but unmistakable signal: the Assistants API is in maintenance mode and the OpenAI Agents SDK (openai-agents) is where new investment lives. There is no hard end-of-life date as of May 2026, but the documentation has shifted, file/vector store ergonomics are awkward compared to the new tooling, and every interesting OpenAI primitive shipped in the last six months — Responses-API tool calls, structured outputs, native handoffs, OTel tracing — landed in the SDK first. This post is the migration plan I wish I had had when we moved our scheduling and intake agents off Assistants in Q1: a side-by-side concept map, a working code diff, the gotchas that bit us, and an eval-parity strategy so cutover is gated on numerical proof, not vibes.

Why Migrate Now

Three concrete reasons, in order of how often they bite teams:

The Assistants API hides its loop on a server you cannot inspect. Threads, runs, and run steps are server resources. You poll runs.retrieve until status == "completed". When something goes wrong you get a status code and maybe a tool-call delta. The Agents SDK runs the loop in your process, which means a Python debugger and a stack trace work the way they always have.
Tool call ergonomics in the Assistants API require submission round-trips. Your code has to: receive a requires_action status → execute tools → submit outputs → re-poll. The Agents SDK collapses that to one await Runner.run(...).
File/vector store handling is bifurcated. Assistants ships its own File Search and Code Interpreter tools, separate from the new Responses API file primitives. Maintaining a codebase that uses both is unpleasant.

There is also a fourth, less-quantifiable reason: every ecosystem library — LangSmith, OTel exporters, eval frameworks — is investing on the SDK side. The Assistants integration story is increasingly stale.

Concept Map

Assistants API	OpenAI Agents SDK	Notes
`Assistant` resource (server-side)	`Agent(...)` Python object	No server resource; instantiate per-process
`Thread`	`Session` + your own message store	You own persistence
`Run` (polling`status`)	`Runner.run(agent, input)` (awaitable)	One call, returns final result
`runs.submit_tool_outputs`	`@function_tool` decorator	Tools execute in-process
File Search tool	Responses API + your own retrieval	Use a real vector DB
Code Interpreter	Computer-use / sandboxed exec	Different primitive entirely
Handoffs	`handoff(target_agent)`	First-class in SDK; ad-hoc on Assistants
Streaming via SSE on `runs`	`Runner.run_streamed` async iterator	Cleaner Pythonic API

The mental shift is from "configure a server resource and poll it" to "instantiate Python objects and call them." Most of the migration work is plumbing — moving thread/message persistence into your own database, replacing tool submission flows, and rebuilding file retrieval against a vector store you control.

Side-by-Side Code

Same logical agent, both SDKs. A scheduling assistant with one tool.

Assistants API (legacy):

import time, json
from openai import OpenAI

client = OpenAI()

assistant = client.beta.assistants.create(
    name="Scheduler",
    model="gpt-4o-2024-08-06",
    instructions="You are a scheduling assistant.",
    tools=[{
        "type": "function",
        "function": {
            "name": "list_slots",
            "description": "List available slots for a day.",
            "parameters": {
                "type": "object",
                "properties": {"day": {"type": "string"}},
                "required": ["day"],
            },
        },
    }],
)

thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id, role="user", content="May 12 morning?"
)
run = client.beta.threads.runs.create(
    thread_id=thread.id, assistant_id=assistant.id
)

while True:
    run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
    if run.status == "requires_action":
        outputs = []
        for tc in run.required_action.submit_tool_outputs.tool_calls:
            args = json.loads(tc.function.arguments)
            result = {"slots": ["09:00", "09:30", "10:00"]}  # tool body
            outputs.append({"tool_call_id": tc.id, "output": json.dumps(result)})
        client.beta.threads.runs.submit_tool_outputs(
            thread_id=thread.id, run_id=run.id, tool_outputs=outputs
        )
    elif run.status in ("completed", "failed", "cancelled", "expired"):
        break
    else:
        time.sleep(0.4)

msgs = client.beta.threads.messages.list(thread_id=thread.id, limit=1)
print(msgs.data[0].content[0].text.value)

Agents SDK (new):

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import asyncio
from agents import Agent, Runner, function_tool

@function_tool
def list_slots(day: str) -> list[str]:
    """List available slots for a day."""
    return ["09:00", "09:30", "10:00"]

scheduler = Agent(
    name="Scheduler",
    model="gpt-4o-2024-08-06",
    instructions="You are a scheduling assistant.",
    tools=[list_slots],
)

async def main():
    result = await Runner.run(scheduler, input="May 12 morning?")
    print(result.final_output)

asyncio.run(main())

That is the same agent. Forty-five lines of polling and JSON serialization collapse to twelve. The migration is rarely just this clean — you have thread persistence, file search, and probably some custom retry logic — but the core shape is dramatically simpler, and the simpler code is also the code that produces a better trace.

Migration Flow With Eval-Parity Gates

The mistake teams make is "rewrite, deploy, hope." The right move is to run both stacks against the same eval dataset and gate cutover on score parity. Otherwise you will discover the regression in production from a customer email at 11pm on a Friday.

flowchart TD
  A[Inventory: assistants, threads, tools, files] --> B[Build smoke dataset from prod traces]
  B --> C[Score legacy stack on dataset &#40;baseline&#41;]
  C --> D[Port tools: tool fns -&gt; @function_tool]
  D --> E[Port instructions + model snapshots]
  E --> F[Replace thread persistence with own store]
  F --> G[Score SDK stack on same dataset]
  G --> H{SDK scores &gt;= legacy &#40;within 1pt&#41;?}
  H -->|No| I[Investigate: tool args? prompt? model?]
  I --> D
  H -->|Yes| J[Shadow: run BOTH on 5% of prod traffic]
  J --> K{Online evals match within tolerance?}
  K -->|No| I
  K -->|Yes| L[Cutover: route 100% to SDK]
  L --> M[Decommission Assistants resources]
  style C fill:#ffd
  style G fill:#ffd
  style H fill:#fcc
  style L fill:#cfc

Figure 1 — Migration is gated on eval parity, not deadline pressure. The shadow phase is what catches the bugs the offline dataset misses.

The two checkpoints are non-negotiable in my experience:

Offline parity (step H). Run the same 200–700 row dataset through both stacks. New stack must score within 1 point of legacy on every evaluator. If it does not, do not proceed — debug.
Shadow parity (step K). For 24–72 hours, route a slice of real traffic through both and compare online eval scores. Subtle drift (e.g., the new stack handles edge cases differently because of a tool-arg coercion change) only shows up here.

Building the Parity Dataset

The dataset is the load-bearing artifact. We seed it three ways:

Sample real production threads from the Assistants API. Pull 500 threads stratified by intent, redact PII, store the user messages as inputs and the final assistant response as a reference.
Add known-bug traces. Anything you fixed in the last quarter goes in — the migration must not silently re-introduce shipped regressions.
Add adversarial cases. Out-of-scope requests, prompt injections, ambiguous inputs. These are the cases where prompt drift between stacks shows up first.

Then both stacks evaluate against this dataset:

from langsmith import Client, evaluate
from my_legacy_assistants import run_legacy
from my_sdk_agent import run_sdk
from my_evaluators import factual_match, tool_call_correct, no_hallucination

client = Client()

# Wrap each stack as a predictor
async def predict_legacy(inputs):
    return {"output": await run_legacy(inputs["input"])}

async def predict_sdk(inputs):
    return {"output": await run_sdk(inputs["input"])}

baseline = evaluate(
    predict_legacy,
    data="migration-parity",
    evaluators=[factual_match, tool_call_correct, no_hallucination],
    experiment_prefix="legacy-baseline",
    metadata={"stack": "assistants_api", "model": "gpt-4o-2024-08-06"},
)

candidate = evaluate(
    predict_sdk,
    data="migration-parity",
    evaluators=[factual_match, tool_call_correct, no_hallucination],
    experiment_prefix="sdk-candidate",
    metadata={"stack": "agents_sdk", "model": "gpt-4o-2024-08-06"},
)

# Compare
import pandas as pd
b = baseline.to_pandas().mean(numeric_only=True)
c = candidate.to_pandas().mean(numeric_only=True)
delta = (c - b).round(3)
print(delta)

LangSmith's Experiments view renders this as a side-by-side table with row-level diffs — invaluable for finding the specific cases where the new stack disagrees with the old one. Spend time in this view; it is where the gnarly bugs hide.

Migration Checklist

Item	Why it matters	Common gotcha
Pin model snapshots in both stacks	Ensures the comparison is apples-to-apples	Legacy default may differ from SDK default
Recreate tool JSON schemas exactly	Argument coercion can shift behavior	Required vs. optional fields drift
Port system instructions verbatim first	Establishes baseline before optimizing	Resist "while I'm here" prompt edits
Replace File Search with explicit retrieval	SDK does not have a hosted equivalent	Vector DB choice affects scores
Move thread state to your own store	SDK is stateless across runs	Cold-start latency on first turn
Wire OTel + LangSmith from day one	Migration without traces is debugging blind	Set `LANGSMITH_TRACING=true` early
Add a feature flag for stack selection	Enables shadow + instant rollback	Forgotten flags become tech debt
Decommission only after 2 weeks of clean prod	Avoid premature cleanup	"We will delete it tomorrow" never happens

Gotchas That Cost Us Real Time

Five things that surprised us during our own migration. Save yourself the bruises:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Tool argument coercion differs subtly. The Assistants API tolerated some loose JSON shapes that the SDK's Pydantic-validated tool path rejects. Three of our 700 dataset rows started failing because a tool that previously accepted day: "2026-05-12T00:00:00Z" now requires "2026-05-12". The fix is one Pydantic validator, but only if you run the dataset and notice.
Default temperature differs. Assistants defaulted to 1.0; we had been overriding to 0.3 server-side. The SDK respects the model's default in the Responses API, which can shift outputs. Set temperature explicitly.
File Search has no drop-in. If you used Assistants File Search, you are rebuilding retrieval from scratch with a real vector DB (we use pgvector inside Postgres). Budget a week per agent for this; it is the single biggest lift.
max_completion_tokens is enforced harder. The SDK will truncate cleanly at the limit and surface a finish reason; the Assistants API was more forgiving. We saw two regressions traced to outputs being cut off mid-tool-call. Raise the limit and add an evaluator that flags finish_reason != "stop".
Streaming event shape changed. If you had UI code consuming SSE deltas from Assistants runs, it does not work as-is against the SDK's async iterator. Plan a UI release alongside the backend cutover, not after.

Eval Parity in CI

Once you are running both stacks, the eval-parity check belongs in CI, not just one engineer's notebook. We add a dedicated job to the agent repo that runs the dataset through both stacks on every PR that touches migration code, and fails the build if the SDK stack falls more than 1 point behind on any evaluator. The same gate pattern as our continuous-eval CI/CD setup — just doubled, with two predictors instead of one.

The job is gone two weeks after the legacy code is deleted. While it lives, it is the difference between a clean cutover and a six-month tail of "wait, why is this one customer hitting an old code path."

After Cutover

Two final practices once the SDK stack is the only stack:

Keep the parity dataset. It becomes the seed of your ongoing regression suite. Every shipped bug gets added; the dataset is now a permanent asset.
Audit dependencies. Anything that imported openai.beta.assistants is now dead code. Grep it out. Leftover Assistants resources on OpenAI's side keep accruing storage costs for files you forgot about.

The migration is, frankly, mostly mechanical once you have the dataset. The hard part — and the part most teams skip — is committing to a numerical parity gate before you cut over. Skip the gate and you will ship regressions. Run the gate and the cutover is a non-event, which is exactly what a good migration should be.

FAQ

How long does a typical migration take?

For a single agent with 2–3 tools and no File Search dependency, a senior engineer can do it in 3–5 days including the eval-parity work. Add a week per agent that uses File Search. Add another week if your thread persistence is currently entirely server-side and needs to be moved to your own store.

Can I keep using both stacks indefinitely?

Technically yes, but the cognitive cost of maintaining two mental models for "how the agent works" is high, and the SDK is where new features land. Plan to fully cut over within a quarter of starting the migration.

What about the legacy Threads — do I lose conversation history?

Only if you deleted them. Pull the thread messages via the API before decommissioning, store them in your own DB, and you have a clean migration of historical context. We did this for ~40k threads in a single batch job.

Does the SDK support all the tool types Assistants did?

Native function tools: yes, and the ergonomics are better. File Search and Code Interpreter: not as hosted tools — you build the equivalents yourself or use the Responses API's file primitives. For most teams the function-tool case is 90% of the surface area; the rest is project-specific work.

How do I know if my parity dataset is big enough?

A working heuristic: the dataset is big enough when adding 50 more rows does not move the aggregate scores by more than the regression threshold you intend to gate on. For our scheduling agent that point was around 320 rows. Below 100 rows the comparison is too noisy to trust as a cutover gate. Pair the offline dataset with the shadow-traffic step and you have two independent gates protecting the migration.

OpenAI Agents SDK vs Assistants API in 2026: Migration Guide with Eval Parity

TL;DR

Why Migrate Now

Concept Map

Side-by-Side Code

Migration Flow With Eval-Parity Gates

Building the Parity Dataset

Migration Checklist

Gotchas That Cost Us Real Time

Eval Parity in CI

After Cutover

FAQ

How long does a typical migration take?

Can I keep using both stacks indefinitely?

What about the legacy Threads — do I lose conversation history?

Does the SDK support all the tool types Assistants did?

How do I know if my parity dataset is big enough?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Build a Golden Dataset for Production AI Agents

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split