---
title: "FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16"
description: "FP4 training was a research curiosity in 2024. By 2026 it ships in production frontier models. What changed and what tradeoffs remain."
canonical: https://callsphere.ai/blog/fp4-training-deepseek-v4-blackwell-end-of-fp16-2026
category: "Large Language Models"
tags: ["FP4 Training", "DeepSeek", "NVIDIA Blackwell", "LLM Training"]
author: "CallSphere Team"
published: 2026-04-24T00:00:00.000Z
updated: 2026-05-08T17:27:37.343Z
---

# FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16

> FP4 training was a research curiosity in 2024. By 2026 it ships in production frontier models. What changed and what tradeoffs remain.

## The Headline

DeepSeek V4 (March 2026) is the first publicly described frontier model trained substantially in FP4. NVIDIA Blackwell's tensor cores accelerate FP4 at twice the rate of FP8 and four times BF16. The arithmetic of training cost finally pushed the industry past FP16 as the default for new pretraining.

This piece walks through what FP4 training actually means, how teams are doing it without quality regressions, and what is still a moving target.

## Mixed-Precision Training Refresher

```mermaid
flowchart LR
    Fwd[Forward pass
FP4 weights/activations] --> Loss
    Loss --> Bwd[Backward pass
FP4 gradients]
    Bwd --> Master[FP32 master weights
updated by optimizer]
    Master --> CastF[Cast back to FP4 for next step]
```

You do not train end-to-end in FP4. The standard recipe in 2026:

- **Forward and backward pass**: FP4 (specifically MXFP4 with E2M1 elements and E8M0 block scales)
- **Activations and gradients**: MXFP6 or MXFP8 in critical layers
- **Master weights**: still kept in FP32 or BF16 by the optimizer
- **Optimizer state**: BF16 or FP8 with stochastic rounding

The result: about 2x the throughput of FP8, roughly 4x BF16, while staying within 0.5 percent of BF16 quality on standard benchmarks.

## Why This Required New Tricks

Naive FP4 training diverges. Activations and gradients have wide dynamic ranges that 4 bits cannot represent. The patterns that made it work in 2025-2026:

- **Microscaling block sizes tuned per tensor**: not all tensors tolerate the same block size. DeepSeek V4 uses block sizes from 16 to 128 depending on tensor type.
- **Stochastic rounding** in the FP4 cast prevents systematic drift
- **Selective higher-precision layers**: embeddings, layer norms, and the final classifier head stay BF16
- **Loss scaling adapted for FP4 dynamic range** — a refinement of the older FP16 loss-scaling trick
- **Outlier handling**: per-tensor outlier clipping or dedicated higher-precision storage for known outlier dimensions

## The DeepSeek V4 Recipe

DeepSeek V4 published technical details in their Q1 2026 paper. Key points:

- Pretraining done substantially in FP4 (with critical components in higher precision)
- ~14 trillion tokens of training data
- Mixture-of-Experts with FP4 expert weights
- Multi-token prediction objective (related to but different from speculative decoding)
- Total training compute reported substantially below comparable Llama-class models

Independent reproductions of parts of the recipe by Tsinghua and HuggingFace teams have validated that FP4 training is broadly reproducible — not a one-off.

## Hardware

```mermaid
flowchart TB
    H100[H100 BF16/FP8] --> Old[Older training]
    H200[H200 FP8 native] --> Mid[2024-2025 mainstream]
    B200[Blackwell B200
FP4 native] --> New[2026 frontier]
    MI355[AMD MI355X
FP4 native] --> NewAMD[2026 alternative]
```

Blackwell's FP4 tensor cores are the production hardware enabling this in 2026. AMD's MI355X added FP4 support and is closing the gap. Older H100 fleets cannot do FP4 natively — they emulate it slowly. The capex shift toward Blackwell is partly motivated by FP4 economics.

## What Still Doesn't Fit

- **Very small models**: under ~3B parameters, FP4 training quality regressions are larger relative to BF16; the dollar savings are also smaller
- **Tasks with extreme tail dependence**: math benchmarks and hard reasoning still show ~1 point regressions in some FP4 trainings; for the highest-quality math models, BF16 weights are still preferred
- **RL fine-tuning**: PPO and GRPO fine-tunes are sensitive; many teams keep RLHF in BF16 even when pretraining was FP4

## What This Means for Practitioners

If you are pretraining a frontier model in 2026, FP4 is the default path on Blackwell hardware. If you are fine-tuning or doing post-training, the choice depends on framework support — most frameworks (Megatron-LM, NeMo, TorchTitan) support FP4 mixed-precision; some (smaller research libraries) do not yet.

For inference, FP4 weights are essentially free quality-wise for chat and agentic workloads. They are now the default in production.

## Sources

- DeepSeek V4 technical report — [https://github.com/deepseek-ai](https://github.com/deepseek-ai)
- "FP4 training in practice" NVIDIA — [https://developer.nvidia.com/blog](https://developer.nvidia.com/blog)
- OCP Microscaling specification — [https://www.opencompute.org](https://www.opencompute.org)
- "Microscaling formats for AI" research — [https://arxiv.org/abs/2310.10537](https://arxiv.org/abs/2310.10537)
- "FP8 Formats for Deep Learning" Micikevicius et al. — [https://arxiv.org/abs/2209.05433](https://arxiv.org/abs/2209.05433)

## FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 — operator perspective

FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 is the kind of news that lives or dies on second-week behavior. The first benchmark is marketing. The eval suite a week later is the truth. For CallSphere — Twilio + OpenAI Realtime + ElevenLabs + NestJS + Prisma + Postgres, 37 agents across 6 verticals — the bar for adopting any new model or API is unsentimental: does it shorten the inner loop on a real call, or just on a benchmark?

## Base model vs. production LLM stack — the gap that costs you uptime

A base model is a checkpoint. A production LLM stack is a whole different artifact: eval gates that fail the build on regression, prompt caching that cuts repeated-system-prompt cost by 40-70%, structured outputs that prevent JSON drift on tool calls, fallback chains that route to a smaller-model retry when the primary times out, and request-side guardrails that cap tool calls per session before the loop spirals. CallSphere runs LLMs in tandem on purpose: `gpt-4o-realtime` for the live call (streaming audio in and out, tool calls inline) and `gpt-4o-mini` for post-call analytics (sentiment scoring, lead qualification, summary generation, and the lower-stakes async work that doesn't need realtime). That split is not a cost optimization — it's a reliability decision. Realtime is optimized for low-latency turn-taking; mini is optimized for cheap, deterministic batch scoring. Mixing them lets each do what it's good at without one regressing the other. The teams that struggle with LLMs in production almost always made the same mistake: they treated "the model" as a single dependency, instead of as a small portfolio of models, each pinned to a job, each behind its own eval suite, each with a documented fallback.

## FAQs

**Q: Does fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 actually move p95 latency or tool-call reliability?**

A: Most of the time it doesn't, and that's the right starting assumption. The relevant test is whether it improves at least one of: p95 first-token latency, tool-call argument accuracy on noisy inputs, multi-turn handoff stability, or per-session cost. Real Estate deployments run 10 specialist agents with 30 tools, including vision-on-photos for listing intake and follow-up.

**Q: What would have to be true before fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 ships into production?**

A: The eval gate is unsentimental — a regression suite that simulates real call traffic (noisy ASR, partial inputs, tool-call timeouts) measures four numbers, and a candidate has to win on three of four without losing badly on the fourth. Anything else is treated as a blog post, not a stack change.

**Q: Which CallSphere vertical would benefit from fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 first?**

A: In a CallSphere deployment, new model and API capabilities land first in the post-call analytics pipeline (lower stakes, async, easy to roll back) and only later in the live realtime path. Today the verticals most likely to absorb new capability first are Sales and Healthcare, which already run the largest share of production traffic.

## See it live

Want to see healthcare agents handle real traffic? Walk through https://healthcare.callsphere.tech or grab 20 minutes with the founder: https://calendly.com/sagar-callsphere/new-meeting.

---

Source: https://callsphere.ai/blog/fp4-training-deepseek-v4-blackwell-end-of-fp16-2026