The Headline

DeepSeek V4 (March 2026) is the first publicly described frontier model trained substantially in FP4. NVIDIA Blackwell's tensor cores accelerate FP4 at twice the rate of FP8 and four times BF16. The arithmetic of training cost finally pushed the industry past FP16 as the default for new pretraining.

This piece walks through what FP4 training actually means, how teams are doing it without quality regressions, and what is still a moving target.

Mixed-Precision Training Refresher

flowchart LR
    Fwd[Forward pass<br/>FP4 weights/activations] --> Loss
    Loss --> Bwd[Backward pass<br/>FP4 gradients]
    Bwd --> Master[FP32 master weights<br/>updated by optimizer]
    Master --> CastF[Cast back to FP4 for next step]

You do not train end-to-end in FP4. The standard recipe in 2026:

Forward and backward pass: FP4 (specifically MXFP4 with E2M1 elements and E8M0 block scales)
Activations and gradients: MXFP6 or MXFP8 in critical layers
Master weights: still kept in FP32 or BF16 by the optimizer
Optimizer state: BF16 or FP8 with stochastic rounding

The result: about 2x the throughput of FP8, roughly 4x BF16, while staying within 0.5 percent of BF16 quality on standard benchmarks.

Why This Required New Tricks

Naive FP4 training diverges. Activations and gradients have wide dynamic ranges that 4 bits cannot represent. The patterns that made it work in 2025-2026:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Microscaling block sizes tuned per tensor: not all tensors tolerate the same block size. DeepSeek V4 uses block sizes from 16 to 128 depending on tensor type.
Stochastic rounding in the FP4 cast prevents systematic drift
Selective higher-precision layers: embeddings, layer norms, and the final classifier head stay BF16
Loss scaling adapted for FP4 dynamic range — a refinement of the older FP16 loss-scaling trick
Outlier handling: per-tensor outlier clipping or dedicated higher-precision storage for known outlier dimensions

The DeepSeek V4 Recipe

DeepSeek V4 published technical details in their Q1 2026 paper. Key points:

Pretraining done substantially in FP4 (with critical components in higher precision)
~14 trillion tokens of training data
Mixture-of-Experts with FP4 expert weights
Multi-token prediction objective (related to but different from speculative decoding)
Total training compute reported substantially below comparable Llama-class models

Independent reproductions of parts of the recipe by Tsinghua and HuggingFace teams have validated that FP4 training is broadly reproducible — not a one-off.

Hardware

flowchart TB
    H100[H100 BF16/FP8] --> Old[Older training]
    H200[H200 FP8 native] --> Mid[2024-2025 mainstream]
    B200[Blackwell B200<br/>FP4 native] --> New[2026 frontier]
    MI355[AMD MI355X<br/>FP4 native] --> NewAMD[2026 alternative]

Blackwell's FP4 tensor cores are the production hardware enabling this in 2026. AMD's MI355X added FP4 support and is closing the gap. Older H100 fleets cannot do FP4 natively — they emulate it slowly. The capex shift toward Blackwell is partly motivated by FP4 economics.

What Still Doesn't Fit

Very small models: under ~3B parameters, FP4 training quality regressions are larger relative to BF16; the dollar savings are also smaller
Tasks with extreme tail dependence: math benchmarks and hard reasoning still show ~1 point regressions in some FP4 trainings; for the highest-quality math models, BF16 weights are still preferred
RL fine-tuning: PPO and GRPO fine-tunes are sensitive; many teams keep RLHF in BF16 even when pretraining was FP4

What This Means for Practitioners

If you are pretraining a frontier model in 2026, FP4 is the default path on Blackwell hardware. If you are fine-tuning or doing post-training, the choice depends on framework support — most frameworks (Megatron-LM, NeMo, TorchTitan) support FP4 mixed-precision; some (smaller research libraries) do not yet.

For inference, FP4 weights are essentially free quality-wise for chat and agentic workloads. They are now the default in production.

Sources

DeepSeek V4 technical report — https://github.com/deepseek-ai
"FP4 training in practice" NVIDIA — https://developer.nvidia.com/blog
OCP Microscaling specification — https://www.opencompute.org
"Microscaling formats for AI" research — https://arxiv.org/abs/2310.10537
"FP8 Formats for Deep Learning" Micikevicius et al. — https://arxiv.org/abs/2209.05433

FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 — operator perspective

FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 is the kind of news that lives or dies on second-week behavior. The first benchmark is marketing. The eval suite a week later is the truth. For CallSphere — Twilio + OpenAI Realtime + ElevenLabs + NestJS + Prisma + Postgres, 37 agents across 6 verticals — the bar for adopting any new model or API is unsentimental: does it shorten the inner loop on a real call, or just on a benchmark?

Base model vs. production LLM stack — the gap that costs you uptime

A base model is a checkpoint. A production LLM stack is a whole different artifact: eval gates that fail the build on regression, prompt caching that cuts repeated-system-prompt cost by 40-70%, structured outputs that prevent JSON drift on tool calls, fallback chains that route to a smaller-model retry when the primary times out, and request-side guardrails that cap tool calls per session before the loop spirals. CallSphere runs LLMs in tandem on purpose: gpt-4o-realtime for the live call (streaming audio in and out, tool calls inline) and gpt-4o-mini for post-call analytics (sentiment scoring, lead qualification, summary generation, and the lower-stakes async work that doesn't need realtime). That split is not a cost optimization — it's a reliability decision. Realtime is optimized for low-latency turn-taking; mini is optimized for cheap, deterministic batch scoring. Mixing them lets each do what it's good at without one regressing the other. The teams that struggle with LLMs in production almost always made the same mistake: they treated "the model" as a single dependency, instead of as a small portfolio of models, each pinned to a job, each behind its own eval suite, each with a documented fallback.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQs

Q: Does fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 actually move p95 latency or tool-call reliability?

A: Most of the time it doesn't, and that's the right starting assumption. The relevant test is whether it improves at least one of: p95 first-token latency, tool-call argument accuracy on noisy inputs, multi-turn handoff stability, or per-session cost. Real Estate deployments run 10 specialist agents with 30 tools, including vision-on-photos for listing intake and follow-up.

Q: What would have to be true before fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 ships into production?

A: The eval gate is unsentimental — a regression suite that simulates real call traffic (noisy ASR, partial inputs, tool-call timeouts) measures four numbers, and a candidate has to win on three of four without losing badly on the fourth. Anything else is treated as a blog post, not a stack change.

Q: Which CallSphere vertical would benefit from fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 first?

A: In a CallSphere deployment, new model and API capabilities land first in the post-call analytics pipeline (lower stakes, async, easy to roll back) and only later in the live realtime path. Today the verticals most likely to absorb new capability first are Sales and Healthcare, which already run the largest share of production traffic.

See it live

Want to see healthcare agents handle real traffic? Walk through https://healthcare.callsphere.tech or grab 20 minutes with the founder: https://calendly.com/sagar-callsphere/new-meeting.

FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16

The Headline

Mixed-Precision Training Refresher

Why This Required New Tricks

The DeepSeek V4 Recipe

Hardware

What Still Doesn't Fit

What This Means for Practitioners

Sources

FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 — operator perspective

Base model vs. production LLM stack — the gap that costs you uptime

FAQs

See it live

Try CallSphere AI Voice Agents

Related Articles You May Like

DeepSeek V4 and the Chinese Open-Model Ecosystem in 2026

Synthetic Data Pipelines: Magpie, Nemotron, and Self-Taught Data Generation

Mixture of Experts Beyond Sparse: Granite, DeepSeek-MoE, and Mixtral Patterns

Open-Source vs Proprietary AI Funding 2026: Mistral's $830M, Llama 4, and the 27x Cost Gap

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

Preparing Fine-Tuning Datasets: Data Collection, Cleaning, and Formatting

Product

Resources

Company

Legal

Industries

Integrations

Solutions

Compare

Pillar Guides

See AI Voice Agents in Action