OpenAI Agents SDK vs Assistants API in 2026: Migration Guide with Eval Parity
Honest principal-engineer comparison of the OpenAI Agents SDK and the legacy Assistants API, with a migration checklist and eval-parity strategy so you don't ship regressions.
TL;DR
If you built on the OpenAI Assistants API in 2024, you are now staring at a quiet but unmistakable signal: the Assistants API is in maintenance mode and the OpenAI Agents SDK (openai-agents) is where new investment lives. There is no hard end-of-life date as of May 2026, but the documentation has shifted, file/vector store ergonomics are awkward compared to the new tooling, and every interesting OpenAI primitive shipped in the last six months — Responses-API tool calls, structured outputs, native handoffs, OTel tracing — landed in the SDK first. This post is the migration plan I wish I had had when we moved our scheduling and intake agents off Assistants in Q1: a side-by-side concept map, a working code diff, the gotchas that bit us, and an eval-parity strategy so cutover is gated on numerical proof, not vibes.
Why Migrate Now
Three concrete reasons, in order of how often they bite teams:
- The Assistants API hides its loop on a server you cannot inspect. Threads, runs, and run steps are server resources. You poll
runs.retrieveuntilstatus == "completed". When something goes wrong you get a status code and maybe a tool-call delta. The Agents SDK runs the loop in your process, which means a Python debugger and a stack trace work the way they always have. - Tool call ergonomics in the Assistants API require submission round-trips. Your code has to: receive a
requires_actionstatus → execute tools → submit outputs → re-poll. The Agents SDK collapses that to oneawait Runner.run(...). - File/vector store handling is bifurcated. Assistants ships its own File Search and Code Interpreter tools, separate from the new Responses API file primitives. Maintaining a codebase that uses both is unpleasant.
There is also a fourth, less-quantifiable reason: every ecosystem library — LangSmith, OTel exporters, eval frameworks — is investing on the SDK side. The Assistants integration story is increasingly stale.
Concept Map
| Assistants API | OpenAI Agents SDK | Notes |
|---|---|---|
Assistant resource (server-side) |
Agent(...) Python object |
No server resource; instantiate per-process |
Thread |
Session + your own message store |
You own persistence |
Run (pollingstatus) |
Runner.run(agent, input) (awaitable) |
One call, returns final result |
runs.submit_tool_outputs |
@function_tool decorator |
Tools execute in-process |
| File Search tool | Responses API + your own retrieval | Use a real vector DB |
| Code Interpreter | Computer-use / sandboxed exec | Different primitive entirely |
| Handoffs | handoff(target_agent) |
First-class in SDK; ad-hoc on Assistants |
Streaming via SSE on runs |
Runner.run_streamed async iterator |
Cleaner Pythonic API |
The mental shift is from "configure a server resource and poll it" to "instantiate Python objects and call them." Most of the migration work is plumbing — moving thread/message persistence into your own database, replacing tool submission flows, and rebuilding file retrieval against a vector store you control.
Side-by-Side Code
Same logical agent, both SDKs. A scheduling assistant with one tool.
Assistants API (legacy):
import time, json
from openai import OpenAI
client = OpenAI()
assistant = client.beta.assistants.create(
name="Scheduler",
model="gpt-4o-2024-08-06",
instructions="You are a scheduling assistant.",
tools=[{
"type": "function",
"function": {
"name": "list_slots",
"description": "List available slots for a day.",
"parameters": {
"type": "object",
"properties": {"day": {"type": "string"}},
"required": ["day"],
},
},
}],
)
thread = client.beta.threads.create()
client.beta.threads.messages.create(
thread_id=thread.id, role="user", content="May 12 morning?"
)
run = client.beta.threads.runs.create(
thread_id=thread.id, assistant_id=assistant.id
)
while True:
run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
if run.status == "requires_action":
outputs = []
for tc in run.required_action.submit_tool_outputs.tool_calls:
args = json.loads(tc.function.arguments)
result = {"slots": ["09:00", "09:30", "10:00"]} # tool body
outputs.append({"tool_call_id": tc.id, "output": json.dumps(result)})
client.beta.threads.runs.submit_tool_outputs(
thread_id=thread.id, run_id=run.id, tool_outputs=outputs
)
elif run.status in ("completed", "failed", "cancelled", "expired"):
break
else:
time.sleep(0.4)
msgs = client.beta.threads.messages.list(thread_id=thread.id, limit=1)
print(msgs.data[0].content[0].text.value)
Agents SDK (new):
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
import asyncio
from agents import Agent, Runner, function_tool
@function_tool
def list_slots(day: str) -> list[str]:
"""List available slots for a day."""
return ["09:00", "09:30", "10:00"]
scheduler = Agent(
name="Scheduler",
model="gpt-4o-2024-08-06",
instructions="You are a scheduling assistant.",
tools=[list_slots],
)
async def main():
result = await Runner.run(scheduler, input="May 12 morning?")
print(result.final_output)
asyncio.run(main())
That is the same agent. Forty-five lines of polling and JSON serialization collapse to twelve. The migration is rarely just this clean — you have thread persistence, file search, and probably some custom retry logic — but the core shape is dramatically simpler, and the simpler code is also the code that produces a better trace.
Migration Flow With Eval-Parity Gates
The mistake teams make is "rewrite, deploy, hope." The right move is to run both stacks against the same eval dataset and gate cutover on score parity. Otherwise you will discover the regression in production from a customer email at 11pm on a Friday.
flowchart TD
A[Inventory: assistants, threads, tools, files] --> B[Build smoke dataset from prod traces]
B --> C[Score legacy stack on dataset (baseline)]
C --> D[Port tools: tool fns -> @function_tool]
D --> E[Port instructions + model snapshots]
E --> F[Replace thread persistence with own store]
F --> G[Score SDK stack on same dataset]
G --> H{SDK scores >= legacy (within 1pt)?}
H -->|No| I[Investigate: tool args? prompt? model?]
I --> D
H -->|Yes| J[Shadow: run BOTH on 5% of prod traffic]
J --> K{Online evals match within tolerance?}
K -->|No| I
K -->|Yes| L[Cutover: route 100% to SDK]
L --> M[Decommission Assistants resources]
style C fill:#ffd
style G fill:#ffd
style H fill:#fcc
style L fill:#cfc
Figure 1 — Migration is gated on eval parity, not deadline pressure. The shadow phase is what catches the bugs the offline dataset misses.
The two checkpoints are non-negotiable in my experience:
- Offline parity (step H). Run the same 200–700 row dataset through both stacks. New stack must score within 1 point of legacy on every evaluator. If it does not, do not proceed — debug.
- Shadow parity (step K). For 24–72 hours, route a slice of real traffic through both and compare online eval scores. Subtle drift (e.g., the new stack handles edge cases differently because of a tool-arg coercion change) only shows up here.
Building the Parity Dataset
The dataset is the load-bearing artifact. We seed it three ways:
- Sample real production threads from the Assistants API. Pull 500 threads stratified by intent, redact PII, store the user messages as inputs and the final assistant response as a reference.
- Add known-bug traces. Anything you fixed in the last quarter goes in — the migration must not silently re-introduce shipped regressions.
- Add adversarial cases. Out-of-scope requests, prompt injections, ambiguous inputs. These are the cases where prompt drift between stacks shows up first.
Then both stacks evaluate against this dataset:
from langsmith import Client, evaluate
from my_legacy_assistants import run_legacy
from my_sdk_agent import run_sdk
from my_evaluators import factual_match, tool_call_correct, no_hallucination
client = Client()
# Wrap each stack as a predictor
async def predict_legacy(inputs):
return {"output": await run_legacy(inputs["input"])}
async def predict_sdk(inputs):
return {"output": await run_sdk(inputs["input"])}
baseline = evaluate(
predict_legacy,
data="migration-parity",
evaluators=[factual_match, tool_call_correct, no_hallucination],
experiment_prefix="legacy-baseline",
metadata={"stack": "assistants_api", "model": "gpt-4o-2024-08-06"},
)
candidate = evaluate(
predict_sdk,
data="migration-parity",
evaluators=[factual_match, tool_call_correct, no_hallucination],
experiment_prefix="sdk-candidate",
metadata={"stack": "agents_sdk", "model": "gpt-4o-2024-08-06"},
)
# Compare
import pandas as pd
b = baseline.to_pandas().mean(numeric_only=True)
c = candidate.to_pandas().mean(numeric_only=True)
delta = (c - b).round(3)
print(delta)
LangSmith's Experiments view renders this as a side-by-side table with row-level diffs — invaluable for finding the specific cases where the new stack disagrees with the old one. Spend time in this view; it is where the gnarly bugs hide.
Migration Checklist
| Item | Why it matters | Common gotcha |
|---|---|---|
| Pin model snapshots in both stacks | Ensures the comparison is apples-to-apples | Legacy default may differ from SDK default |
| Recreate tool JSON schemas exactly | Argument coercion can shift behavior | Required vs. optional fields drift |
| Port system instructions verbatim first | Establishes baseline before optimizing | Resist "while I'm here" prompt edits |
| Replace File Search with explicit retrieval | SDK does not have a hosted equivalent | Vector DB choice affects scores |
| Move thread state to your own store | SDK is stateless across runs | Cold-start latency on first turn |
| Wire OTel + LangSmith from day one | Migration without traces is debugging blind | Set LANGSMITH_TRACING=true early |
| Add a feature flag for stack selection | Enables shadow + instant rollback | Forgotten flags become tech debt |
| Decommission only after 2 weeks of clean prod | Avoid premature cleanup | "We will delete it tomorrow" never happens |
Gotchas That Cost Us Real Time
Five things that surprised us during our own migration. Save yourself the bruises:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Tool argument coercion differs subtly. The Assistants API tolerated some loose JSON shapes that the SDK's Pydantic-validated tool path rejects. Three of our 700 dataset rows started failing because a tool that previously accepted
day: "2026-05-12T00:00:00Z"now requires"2026-05-12". The fix is one Pydantic validator, but only if you run the dataset and notice. - Default temperature differs. Assistants defaulted to
1.0; we had been overriding to0.3server-side. The SDK respects the model's default in the Responses API, which can shift outputs. Set temperature explicitly. - File Search has no drop-in. If you used Assistants File Search, you are rebuilding retrieval from scratch with a real vector DB (we use pgvector inside Postgres). Budget a week per agent for this; it is the single biggest lift.
max_completion_tokensis enforced harder. The SDK will truncate cleanly at the limit and surface a finish reason; the Assistants API was more forgiving. We saw two regressions traced to outputs being cut off mid-tool-call. Raise the limit and add an evaluator that flagsfinish_reason != "stop".- Streaming event shape changed. If you had UI code consuming SSE deltas from Assistants runs, it does not work as-is against the SDK's async iterator. Plan a UI release alongside the backend cutover, not after.
Eval Parity in CI
Once you are running both stacks, the eval-parity check belongs in CI, not just one engineer's notebook. We add a dedicated job to the agent repo that runs the dataset through both stacks on every PR that touches migration code, and fails the build if the SDK stack falls more than 1 point behind on any evaluator. The same gate pattern as our continuous-eval CI/CD setup — just doubled, with two predictors instead of one.
The job is gone two weeks after the legacy code is deleted. While it lives, it is the difference between a clean cutover and a six-month tail of "wait, why is this one customer hitting an old code path."
After Cutover
Two final practices once the SDK stack is the only stack:
- Keep the parity dataset. It becomes the seed of your ongoing regression suite. Every shipped bug gets added; the dataset is now a permanent asset.
- Audit dependencies. Anything that imported
openai.beta.assistantsis now dead code. Grep it out. Leftover Assistants resources on OpenAI's side keep accruing storage costs for files you forgot about.
The migration is, frankly, mostly mechanical once you have the dataset. The hard part — and the part most teams skip — is committing to a numerical parity gate before you cut over. Skip the gate and you will ship regressions. Run the gate and the cutover is a non-event, which is exactly what a good migration should be.
FAQ
How long does a typical migration take?
For a single agent with 2–3 tools and no File Search dependency, a senior engineer can do it in 3–5 days including the eval-parity work. Add a week per agent that uses File Search. Add another week if your thread persistence is currently entirely server-side and needs to be moved to your own store.
Can I keep using both stacks indefinitely?
Technically yes, but the cognitive cost of maintaining two mental models for "how the agent works" is high, and the SDK is where new features land. Plan to fully cut over within a quarter of starting the migration.
What about the legacy Threads — do I lose conversation history?
Only if you deleted them. Pull the thread messages via the API before decommissioning, store them in your own DB, and you have a clean migration of historical context. We did this for ~40k threads in a single batch job.
Does the SDK support all the tool types Assistants did?
Native function tools: yes, and the ergonomics are better. File Search and Code Interpreter: not as hosted tools — you build the equivalents yourself or use the Responses API's file primitives. For most teams the function-tool case is 90% of the surface area; the rest is project-specific work.
How do I know if my parity dataset is big enough?
A working heuristic: the dataset is big enough when adding 50 more rows does not move the aggregate scores by more than the regression threshold you intend to gate on. For our scheduling agent that point was around 320 rows. Below 100 rows the comparison is too noisy to trust as a cutover gate. Pair the offline dataset with the shadow-traffic step and you have two independent gates protecting the migration.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.