Academic

Attn-QAT: 4-Bit Attention With Quantization-Aware Training

arXiv:2603.00040v1 Announce Type: new Abstract: Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware training (QAT) for attention. We find that "drop-in" QAT, which naively combines an FP4 forward pass with a high-precision Flash Attention (FA)-style backward pass, leads to training instability. We identify two key principles for stable FP4 attention: (1) matching low-precision recomputation of attention scores in the backward pass, and (2) resolving implicit precision assumptions in FA's gradient calculation. Based on these insights, we propose Attn-QAT and implement fused Triton kernels for training as well as FP4 inference kernels. Across diffusion and language models, Attn-QAT recovers the quality drop from FP4 attention without explic

arXiv:2603.00040v1 Announce Type: new Abstract: Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware training (QAT) for attention. We find that "drop-in" QAT, which naively combines an FP4 forward pass with a high-precision Flash Attention (FA)-style backward pass, leads to training instability. We identify two key principles for stable FP4 attention: (1) matching low-precision recomputation of attention scores in the backward pass, and (2) resolving implicit precision assumptions in FA's gradient calculation. Based on these insights, we propose Attn-QAT and implement fused Triton kernels for training as well as FP4 inference kernels. Across diffusion and language models, Attn-QAT recovers the quality drop from FP4 attention without explicit outlier-mitigation heuristics used in prior FP4 attention, and delivers up to a 1.5x speedup on an RTX 5090. Video demos can be found at https://drive.google.com/drive/folders/190F6xbBDUF2kGQYIcXBt3ehSYij5jlim?usp=sharing.

Executive Summary

This study presents a systematic analysis of 4-bit quantization-aware training (QAT) for attention mechanisms in neural networks. The authors identify two key principles for stable FP4 attention: matching low-precision recomputation of attention scores and resolving implicit precision assumptions in gradient calculation. They propose Attn-QAT, a novel approach that incorporates these principles, and demonstrate its effectiveness in recovering quality drop and achieving up to a 1.5x speedup on an RTX 5090. The study highlights the importance of careful consideration of precision assumptions in attention mechanisms and introduces a novel approach for achieving reliable 4-bit attention.

Key Points

  • 4-bit quantization-aware training (QAT) is a critical component for end-to-end FP4 computation
  • Attn-QAT proposes two key principles for stable FP4 attention: low-precision recomputation of attention scores and resolving implicit precision assumptions
  • Attn-QAT delivers up to a 1.5x speedup on an RTX 5090 and recovers quality drop without explicit outlier-mitigation heuristics

Merits

Strength in novel approach

Attn-QAT introduces a novel approach that addresses the challenges of 4-bit attention, providing a significant improvement over existing methods.

Demerits

Limitation in scope

The study is limited to attention mechanisms in neural networks and may not generalize to other areas of computer vision or natural language processing.

Expert Commentary

This study makes a significant contribution to the field of neural networks by providing a systematic analysis of 4-bit QAT for attention mechanisms. The introduction of Attn-QAT is a major breakthrough and has the potential to significantly impact the field of model compression and optimization. However, the study's limitations, including its focus on attention mechanisms, should be taken into consideration when evaluating its broader implications. Overall, this study is an important addition to the literature and provides a valuable contribution to our understanding of 4-bit QAT and its applications.

Recommendations

  • Future studies should explore the application of Attn-QAT to other areas of computer vision and natural language processing.
  • Researchers should carefully consider the precision assumptions in attention mechanisms to ensure reliable and efficient computation.

Sources