By Sagar Shankaran, Founder of CallSphere
Tool-calling agents drift on edge cases your prompt cannot fix. We walk through the OpenAI SFT recipe for gpt-4o + gpt-4o-mini in 2026, the JSONL format with `tools` arrays, strict-mode caveats, and a CallSphere-tested checklist for hitting 95% function-arg accuracy.
Key takeaways
TL;DR — Fine-tune gpt-4o-mini before reaching for gpt-4o. With 200–500 high-quality JSONL examples that include the full
toolsarray per row, you can lift function-arg accuracy from ~82% (vanilla prompt) to 95%+ on a vertical tool surface, at $25/M training tokens and $0.30/$1.20 per 1M inference tokens.
OpenAI supervised fine-tuning (SFT) for tool-calling teaches a model which tool to pick, which arguments to fill, and how to format the call for your specific tool surface. Vanilla GPT-4o handles the public schema well, but vertical agents have private quirks — phone numbers in E.164, ICD-10 codes for healthcare, time zones inferred from caller location — that prompt-only systems hallucinate in 10–20% of calls.
Strict mode is supported during training but disabled at inference time when a fine-tuned model emits parallel tool calls, so design your training set to bias toward sequential calls if argument validation is critical.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
store: true).messages and the same tools array used in production.POST /v1/fine_tuning/jobs with model gpt-4o-mini-2024-07-18 or gpt-4o-2024-08-06.flowchart TD
PROD[Production traces] -->|store:true| LOG[(Stored Completions)]
LOG --> CURATE[Curate 200-500 hard cases]
CURATE --> FMT[JSONL: messages + tools]
FMT --> JOB[Fine-tune gpt-4o-mini]
JOB --> EVAL[OpenAI Evals]
EVAL -->|pass 95%| DEPLOY[Deploy]
EVAL -->|fail| CURATE
CallSphere runs 37 specialized agents across 6 verticals (healthcare, behavioral health, salon, dental, MSP, real estate), each with a private slice of the 90+ shared tool surface and 115+ DB tables. Healthcare's post-call analytics agent runs on gpt-4o-mini specifically because the tool surface is narrow (12 functions) and SFT lifts arg-accuracy from 84% to 96%. The OneRoof real-estate vertical uses the OpenAI Agents SDK which natively respects the fine-tuned model's tool routing.
We expose this on every plan: Starter $149, Growth $499, Scale $1,499 — with a 14-day trial and 22% affiliate for partners. Run your own numbers in the ROI calculator.
# 1. Capture traces in production
client.chat.completions.create(
model="gpt-4o",
messages=msgs,
tools=tools,
store=True, # 30-day retention
metadata={"app": "voice-router"},
)
# 2. Format one JSONL row
{"messages":[
{"role":"system","content":"..."},
{"role":"user","content":"book me with Dr. Patel tomorrow at 3"},
{"role":"assistant","tool_calls":[{
"id":"call_1","type":"function",
"function":{"name":"create_appointment",
"arguments":"{\"provider\":\"dr_patel\",\"start\":\"2026-05-08T15:00-04:00\"}"}}]}
],
"tools":[ /* same tools array as production */ ]}
# 3. Launch the job
client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={"n_epochs":3},
)
tools array on every row — without it, the model forgets the schema and falls back to natural-language tool descriptions.Q: gpt-4o or gpt-4o-mini? Start with mini ($25/M training, $0.30/$1.20 inference). Only escalate to gpt-4o if mini's eval ceiling is too low after 3 epochs.
Q: How many examples are enough? 200 for a narrow surface (≤10 tools). 500–2,000 if you have 50+ tools or multi-step planning. Quality > volume.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Q: Will fine-tuning beat a bigger prompt? For tool selection, yes — past ~5 tools, prompt engineering returns flatten while SFT keeps lifting accuracy.
Q: What about catastrophic forgetting? Mix 10–15% general instruction-following examples to preserve out-of-domain reasoning.
Q: Do I need DPO too? Not initially. SFT first, measure, then add DPO if you have preference pairs (good vs bad call argument).
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
OpenAI's Frontier platform makes model-native orchestration the default. What that means for agent builders, voice/chat buyers, and the build-vs-buy decision.
GPT-Realtime-2 brings GPT-5-class reasoning into voice. What that means for tool-call reliability, structured output, and production agent design.
The 2026 desktop AI agent landscape — ServiceNow Project Arc, Anthropic Claude offerings, OpenAI agents, and Google Mariner. A buyer's map.
A three-way comparison of Gemini Enterprise, Anthropic managed agents and OpenAI Frontier Platform after Cloud Next 2026 — strengths, gaps, buyer fit.
Anthropic's May 2026 push positions Claude as a vertical platform for financial services. The strategic positioning versus OpenAI and Google.
May 2026's biggest agent-architecture shift: planning, tool selection, and self-correction move inside the model. Framework code shrinks. Here is what changes.
© 2026 CallSphere LLC. All rights reserved.