By Sagar Shankaran, Founder of CallSphere
FP4 training was a research curiosity in 2024. By 2026 it ships in production frontier models. What changed and what tradeoffs remain.
Key takeaways
DeepSeek V4 (March 2026) is the first publicly described frontier model trained substantially in FP4. NVIDIA Blackwell's tensor cores accelerate FP4 at twice the rate of FP8 and four times BF16. The arithmetic of training cost finally pushed the industry past FP16 as the default for new pretraining.
This piece walks through what FP4 training actually means, how teams are doing it without quality regressions, and what is still a moving target.
flowchart LR
Fwd[Forward pass<br/>FP4 weights/activations] --> Loss
Loss --> Bwd[Backward pass<br/>FP4 gradients]
Bwd --> Master[FP32 master weights<br/>updated by optimizer]
Master --> CastF[Cast back to FP4 for next step]
You do not train end-to-end in FP4. The standard recipe in 2026:
The result: about 2x the throughput of FP8, roughly 4x BF16, while staying within 0.5 percent of BF16 quality on standard benchmarks.
Naive FP4 training diverges. Activations and gradients have wide dynamic ranges that 4 bits cannot represent. The patterns that made it work in 2025-2026:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
DeepSeek V4 published technical details in their Q1 2026 paper. Key points:
Independent reproductions of parts of the recipe by Tsinghua and HuggingFace teams have validated that FP4 training is broadly reproducible — not a one-off.
flowchart TB
H100[H100 BF16/FP8] --> Old[Older training]
H200[H200 FP8 native] --> Mid[2024-2025 mainstream]
B200[Blackwell B200<br/>FP4 native] --> New[2026 frontier]
MI355[AMD MI355X<br/>FP4 native] --> NewAMD[2026 alternative]
Blackwell's FP4 tensor cores are the production hardware enabling this in 2026. AMD's MI355X added FP4 support and is closing the gap. Older H100 fleets cannot do FP4 natively — they emulate it slowly. The capex shift toward Blackwell is partly motivated by FP4 economics.
If you are pretraining a frontier model in 2026, FP4 is the default path on Blackwell hardware. If you are fine-tuning or doing post-training, the choice depends on framework support — most frameworks (Megatron-LM, NeMo, TorchTitan) support FP4 mixed-precision; some (smaller research libraries) do not yet.
For inference, FP4 weights are essentially free quality-wise for chat and agentic workloads. They are now the default in production.
FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 is the kind of news that lives or dies on second-week behavior. The first benchmark is marketing. The eval suite a week later is the truth. For CallSphere — Twilio + OpenAI Realtime + ElevenLabs + NestJS + Prisma + Postgres, 37 agents across 6 verticals — the bar for adopting any new model or API is unsentimental: does it shorten the inner loop on a real call, or just on a benchmark?
A base model is a checkpoint. A production LLM stack is a whole different artifact: eval gates that fail the build on regression, prompt caching that cuts repeated-system-prompt cost by 40-70%, structured outputs that prevent JSON drift on tool calls, fallback chains that route to a smaller-model retry when the primary times out, and request-side guardrails that cap tool calls per session before the loop spirals. CallSphere runs LLMs in tandem on purpose: gpt-4o-realtime for the live call (streaming audio in and out, tool calls inline) and gpt-4o-mini for post-call analytics (sentiment scoring, lead qualification, summary generation, and the lower-stakes async work that doesn't need realtime). That split is not a cost optimization — it's a reliability decision. Realtime is optimized for low-latency turn-taking; mini is optimized for cheap, deterministic batch scoring. Mixing them lets each do what it's good at without one regressing the other. The teams that struggle with LLMs in production almost always made the same mistake: they treated "the model" as a single dependency, instead of as a small portfolio of models, each pinned to a job, each behind its own eval suite, each with a documented fallback.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Q: Does fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 actually move p95 latency or tool-call reliability?
A: Most of the time it doesn't, and that's the right starting assumption. The relevant test is whether it improves at least one of: p95 first-token latency, tool-call argument accuracy on noisy inputs, multi-turn handoff stability, or per-session cost. Real Estate deployments run 10 specialist agents with 30 tools, including vision-on-photos for listing intake and follow-up.
Q: What would have to be true before fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 ships into production?
A: The eval gate is unsentimental — a regression suite that simulates real call traffic (noisy ASR, partial inputs, tool-call timeouts) measures four numbers, and a candidate has to win on three of four without losing badly on the fourth. Anything else is treated as a blog post, not a stack change.
Q: Which CallSphere vertical would benefit from fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 first?
A: In a CallSphere deployment, new model and API capabilities land first in the post-call analytics pipeline (lower stakes, async, easy to roll back) and only later in the live realtime path. Today the verticals most likely to absorb new capability first are Sales and Healthcare, which already run the largest share of production traffic.
Want to see healthcare agents handle real traffic? Walk through https://healthcare.callsphere.tech or grab 20 minutes with the founder: https://calendly.com/sagar-callsphere/new-meeting.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
DeepSeek V4 anchors a thriving Chinese open-model ecosystem. Qwen, Kimi, Yi, GLM — what each one does and how the ecosystem competes.
Synthetic data is now most of the post-training corpus at frontier labs. The 2026 pipelines — Magpie, Nemotron, Self-Taught — and how to build one.
MoE evolved beyond simple top-k routing. The 2026 patterns from Granite, DeepSeek-MoE, and Mixtral that make MoE practical at scale.
Mistral raised $830M debt for 13,800 GPUs. DeepSeek R1 hits GPT-4 reasoning at 27x lower cost. Open-source AI market hit $23B in 2026. Where the funding is shifting.
When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.
Master the art of building high-quality fine-tuning datasets with practical techniques for data collection, cleaning, deduplication, format validation, and diversity analysis.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI