Training models with only 4 bits | Fully-Quantized Training

The video discusses the breakthrough of training large language models using four-bit floating point precision (FP4), enabled by Nvidia’s Tensor Core technology and novel quantization formats like microscaling, which significantly reduce memory and computational demands while maintaining performance comparable to higher-precision training. It also highlights techniques such as stochastic rounding to address quantization challenges and envisions fully quantized FP4 training becoming a practical, cost-effective standard with the advent of new hardware like Nvidia’s Blackwell GPUs.

The video explores the cutting-edge development of training large language models (LLMs) using only four-bit floating point precision, a significant leap from the previous standards of 16-bit and 8-bit training. Historically, training precision has steadily decreased to reduce computational costs and memory usage, with notable milestones including Google’s BF16 format in 2018 and Deepseek’s 8-bit training breakthrough in 2024. The new research pushes these boundaries further by employing floating point 4-bit (FP4) precision for matrix multiplications during training, achieving comparable performance to 16-bit baselines. However, this “fully quantized training” is somewhat of a misnomer, as mixed precision techniques are still used for certain sensitive operations within the model.

A key enabler of this advancement is Nvidia’s Tensor Core technology, which supports mixed precision matrix multiplications with high throughput and efficient accumulation. Tensor Cores can handle low-precision inputs like FP4 while accumulating results in higher precision formats such as FP8 or FP32, preventing overflow and maintaining numerical stability. The latest Nvidia Blackwell GPUs introduce native support for FP4 operations, allowing the quantization of weights, activations, and gradients into FP4 format before matrix multiplication, significantly reducing memory bandwidth requirements and speeding up training. Despite the “all the way” FP4 claim, some parts of the training pipeline still operate in higher precision to ensure accuracy.

The video also delves into the novel numeric formats designed specifically for low-bit quantization, such as the microscaling (MX) data formats developed collaboratively by major hardware vendors. These formats quantize values in blocks, each with its own scaling factor, enabling more precise representation of normally distributed model weights compared to uniform integer quantization. Floating point 4-bit formats like MXFP4 and Nvidia’s NVFB4 use a combination of exponent and mantissa bits to better capture the distribution of values, including special values like infinities and NaNs that help signal underflows or overflows during training. This nuanced approach to quantization is crucial for maintaining model performance at such low bit widths.

On the modeling side, several innovative techniques address challenges inherent to low-precision training, particularly the bias introduced by rounding during gradient quantization. The video highlights stochastic rounding as a key method to mitigate this bias, where gradient values near quantization boundaries are probabilistically rounded up or down, preventing systematic drift in the model’s learning process. This approach ensures that the quantization noise averages out over time, preserving the integrity of gradient descent despite the coarse granularity of four-bit representation. While many other tricks exist in the literature, stochastic rounding stands out as a robust and likely enduring solution.

Finally, the video assesses the current state and future prospects of fully quantized FP4 training. Although no publicly available pretrained FP4 models exist yet, experimental results show that training loss and benchmark performance closely match those of BF16-trained models, with only minor gaps that can be closed through fine-tuning. The tasks evaluated tend to be classification-oriented rather than complex language generation, suggesting more research is needed to fully validate FP4 training across diverse applications. As hardware like Nvidia’s Blackwell GPUs become more accessible, fully quantized four-bit training is poised to become a practical and cost-effective standard in the near future, potentially revolutionizing the efficiency of training large-scale neural networks.