By Sagar Shankaran, Founder of CallSphere
Honest principal-engineer comparison of the OpenAI Agents SDK and the legacy Assistants API, with a migration checklist and eval-parity strategy so you don't ship regressions.
Key takeaways
If you built on the OpenAI Assistants API in 2024, you are now staring at a quiet but unmistakable signal: the Assistants API is in maintenance mode and the OpenAI Agents SDK (openai-agents) is where new investment lives. There is no hard end-of-life date as of May 2026, but the documentation has shifted, file/vector store ergonomics are awkward compared to the new tooling, and every interesting OpenAI primitive shipped in the last six months — Responses-API tool calls, structured outputs, native handoffs, OTel tracing — landed in the SDK first. This post is the migration plan I wish I had had when we moved our scheduling and intake agents off Assistants in Q1: a side-by-side concept map, a working code diff, the gotchas that bit us, and an eval-parity strategy so cutover is gated on numerical proof, not vibes.
Three concrete reasons, in order of how often they bite teams:
runs.retrieve until status == "completed". When something goes wrong you get a status code and maybe a tool-call delta. The Agents SDK runs the loop in your process, which means a Python debugger and a stack trace work the way they always have.requires_action status → execute tools → submit outputs → re-poll. The Agents SDK collapses that to one await Runner.run(...).There is also a fourth, less-quantifiable reason: every ecosystem library — LangSmith, OTel exporters, eval frameworks — is investing on the SDK side. The Assistants integration story is increasingly stale.
| Assistants API | OpenAI Agents SDK | Notes |
|---|---|---|
Assistant resource (server-side) |
Agent(...) Python object |
No server resource; instantiate per-process |
Thread |
Session + your own message store |
You own persistence |
Run (pollingstatus) |
Runner.run(agent, input) (awaitable) |
One call, returns final result |
runs.submit_tool_outputs |
@function_tool decorator |
Tools execute in-process |
| File Search tool | Responses API + your own retrieval | Use a real vector DB |
| Code Interpreter | Computer-use / sandboxed exec | Different primitive entirely |
| Handoffs | handoff(target_agent) |
First-class in SDK; ad-hoc on Assistants |
Streaming via SSE on runs |
Runner.run_streamed async iterator |
Cleaner Pythonic API |
The mental shift is from "configure a server resource and poll it" to "instantiate Python objects and call them." Most of the migration work is plumbing — moving thread/message persistence into your own database, replacing tool submission flows, and rebuilding file retrieval against a vector store you control.
Same logical agent, both SDKs. A scheduling assistant with one tool.
Assistants API (legacy):
import time, json
from openai import OpenAI
client = OpenAI()
assistant = client.beta.assistants.create(
name="Scheduler",
model="gpt-4o-2024-08-06",
instructions="You are a scheduling assistant.",
tools=[{
"type": "function",
"function": {
"name": "list_slots",
"description": "List available slots for a day.",
"parameters": {
"type": "object",
"properties": {"day": {"type": "string"}},
"required": ["day"],
},
},
}],
)
thread = client.beta.threads.create()
client.beta.threads.messages.create(
thread_id=thread.id, role="user", content="May 12 morning?"
)
run = client.beta.threads.runs.create(
thread_id=thread.id, assistant_id=assistant.id
)
while True:
run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
if run.status == "requires_action":
outputs = []
for tc in run.required_action.submit_tool_outputs.tool_calls:
args = json.loads(tc.function.arguments)
result = {"slots": ["09:00", "09:30", "10:00"]} # tool body
outputs.append({"tool_call_id": tc.id, "output": json.dumps(result)})
client.beta.threads.runs.submit_tool_outputs(
thread_id=thread.id, run_id=run.id, tool_outputs=outputs
)
elif run.status in ("completed", "failed", "cancelled", "expired"):
break
else:
time.sleep(0.4)
msgs = client.beta.threads.messages.list(thread_id=thread.id, limit=1)
print(msgs.data[0].content[0].text.value)
Agents SDK (new):
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
import asyncio
from agents import Agent, Runner, function_tool
@function_tool
def list_slots(day: str) -> list[str]:
"""List available slots for a day."""
return ["09:00", "09:30", "10:00"]
scheduler = Agent(
name="Scheduler",
model="gpt-4o-2024-08-06",
instructions="You are a scheduling assistant.",
tools=[list_slots],
)
async def main():
result = await Runner.run(scheduler, input="May 12 morning?")
print(result.final_output)
asyncio.run(main())
That is the same agent. Forty-five lines of polling and JSON serialization collapse to twelve. The migration is rarely just this clean — you have thread persistence, file search, and probably some custom retry logic — but the core shape is dramatically simpler, and the simpler code is also the code that produces a better trace.
The mistake teams make is "rewrite, deploy, hope." The right move is to run both stacks against the same eval dataset and gate cutover on score parity. Otherwise you will discover the regression in production from a customer email at 11pm on a Friday.
flowchart TD
A[Inventory: assistants, threads, tools, files] --> B[Build smoke dataset from prod traces]
B --> C[Score legacy stack on dataset (baseline)]
C --> D[Port tools: tool fns -> @function_tool]
D --> E[Port instructions + model snapshots]
E --> F[Replace thread persistence with own store]
F --> G[Score SDK stack on same dataset]
G --> H{SDK scores >= legacy (within 1pt)?}
H -->|No| I[Investigate: tool args? prompt? model?]
I --> D
H -->|Yes| J[Shadow: run BOTH on 5% of prod traffic]
J --> K{Online evals match within tolerance?}
K -->|No| I
K -->|Yes| L[Cutover: route 100% to SDK]
L --> M[Decommission Assistants resources]
style C fill:#ffd
style G fill:#ffd
style H fill:#fcc
style L fill:#cfc
Figure 1 — Migration is gated on eval parity, not deadline pressure. The shadow phase is what catches the bugs the offline dataset misses.
The two checkpoints are non-negotiable in my experience:
The dataset is the load-bearing artifact. We seed it three ways:
Then both stacks evaluate against this dataset:
from langsmith import Client, evaluate
from my_legacy_assistants import run_legacy
from my_sdk_agent import run_sdk
from my_evaluators import factual_match, tool_call_correct, no_hallucination
client = Client()
# Wrap each stack as a predictor
async def predict_legacy(inputs):
return {"output": await run_legacy(inputs["input"])}
async def predict_sdk(inputs):
return {"output": await run_sdk(inputs["input"])}
baseline = evaluate(
predict_legacy,
data="migration-parity",
evaluators=[factual_match, tool_call_correct, no_hallucination],
experiment_prefix="legacy-baseline",
metadata={"stack": "assistants_api", "model": "gpt-4o-2024-08-06"},
)
candidate = evaluate(
predict_sdk,
data="migration-parity",
evaluators=[factual_match, tool_call_correct, no_hallucination],
experiment_prefix="sdk-candidate",
metadata={"stack": "agents_sdk", "model": "gpt-4o-2024-08-06"},
)
# Compare
import pandas as pd
b = baseline.to_pandas().mean(numeric_only=True)
c = candidate.to_pandas().mean(numeric_only=True)
delta = (c - b).round(3)
print(delta)
LangSmith's Experiments view renders this as a side-by-side table with row-level diffs — invaluable for finding the specific cases where the new stack disagrees with the old one. Spend time in this view; it is where the gnarly bugs hide.
| Item | Why it matters | Common gotcha |
|---|---|---|
| Pin model snapshots in both stacks | Ensures the comparison is apples-to-apples | Legacy default may differ from SDK default |
| Recreate tool JSON schemas exactly | Argument coercion can shift behavior | Required vs. optional fields drift |
| Port system instructions verbatim first | Establishes baseline before optimizing | Resist "while I'm here" prompt edits |
| Replace File Search with explicit retrieval | SDK does not have a hosted equivalent | Vector DB choice affects scores |
| Move thread state to your own store | SDK is stateless across runs | Cold-start latency on first turn |
| Wire OTel + LangSmith from day one | Migration without traces is debugging blind | Set LANGSMITH_TRACING=true early |
| Add a feature flag for stack selection | Enables shadow + instant rollback | Forgotten flags become tech debt |
| Decommission only after 2 weeks of clean prod | Avoid premature cleanup | "We will delete it tomorrow" never happens |
Five things that surprised us during our own migration. Save yourself the bruises:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
day: "2026-05-12T00:00:00Z" now requires "2026-05-12". The fix is one Pydantic validator, but only if you run the dataset and notice.1.0; we had been overriding to 0.3 server-side. The SDK respects the model's default in the Responses API, which can shift outputs. Set temperature explicitly.max_completion_tokens is enforced harder. The SDK will truncate cleanly at the limit and surface a finish reason; the Assistants API was more forgiving. We saw two regressions traced to outputs being cut off mid-tool-call. Raise the limit and add an evaluator that flags finish_reason != "stop".Once you are running both stacks, the eval-parity check belongs in CI, not just one engineer's notebook. We add a dedicated job to the agent repo that runs the dataset through both stacks on every PR that touches migration code, and fails the build if the SDK stack falls more than 1 point behind on any evaluator. The same gate pattern as our continuous-eval CI/CD setup — just doubled, with two predictors instead of one.
The job is gone two weeks after the legacy code is deleted. While it lives, it is the difference between a clean cutover and a six-month tail of "wait, why is this one customer hitting an old code path."
Two final practices once the SDK stack is the only stack:
openai.beta.assistants is now dead code. Grep it out. Leftover Assistants resources on OpenAI's side keep accruing storage costs for files you forgot about.The migration is, frankly, mostly mechanical once you have the dataset. The hard part — and the part most teams skip — is committing to a numerical parity gate before you cut over. Skip the gate and you will ship regressions. Run the gate and the cutover is a non-event, which is exactly what a good migration should be.
For a single agent with 2–3 tools and no File Search dependency, a senior engineer can do it in 3–5 days including the eval-parity work. Add a week per agent that uses File Search. Add another week if your thread persistence is currently entirely server-side and needs to be moved to your own store.
Technically yes, but the cognitive cost of maintaining two mental models for "how the agent works" is high, and the SDK is where new features land. Plan to fully cut over within a quarter of starting the migration.
Only if you deleted them. Pull the thread messages via the API before decommissioning, store them in your own DB, and you have a clean migration of historical context. We did this for ~40k threads in a single batch job.
Native function tools: yes, and the ergonomics are better. File Search and Code Interpreter: not as hosted tools — you build the equivalents yourself or use the Responses API's file primitives. For most teams the function-tool case is 90% of the surface area; the rest is project-specific work.
A working heuristic: the dataset is big enough when adding 50 more rows does not move the aggregate scores by more than the regression threshold you intend to gate on. For our scheduling agent that point was around 320 rows. Below 100 rows the comparison is too noisy to trust as a cutover gate. Pair the offline dataset with the shadow-traffic step and you have two independent gates protecting the migration.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI