Skip to main content
Academic

Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training

arXiv:2602.18851v1 Announce Type: new Abstract: Attention scores in transformers are bilinear forms $S_{ij} = x_i^\top M x_j / \sqrt{d_h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^Q W^{K\top}$ has rank $r \ll d$, tail probabilities for $\max_{i,j}|S_{ij}|$ decay as $\exp(-d^{2}\alpha^{2}/(\gamma r))$ rather than $\exp(-d\alpha^{2})$, where $\gamma > 1$ is a typicality parameter. For transformer attention where $r = d_h$, this yields $8$--$28\times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $\|W^Q W^{K\top}\|_2$ via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and re

S
Seyed Morteza Emadi
· · 1 min read · 2 views

arXiv:2602.18851v1 Announce Type: new Abstract: Attention scores in transformers are bilinear forms $S_{ij} = x_i^\top M x_j / \sqrt{d_h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^Q W^{K\top}$ has rank $r \ll d$, tail probabilities for $\max_{i,j}|S_{ij}|$ decay as $\exp(-d^{2}\alpha^{2}/(\gamma r))$ rather than $\exp(-d\alpha^{2})$, where $\gamma > 1$ is a typicality parameter. For transformer attention where $r = d_h$, this yields $8$--$28\times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $\|W^Q W^{K\top}\|_2$ via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fails, while achieving comparable downstream MMLU accuracy.

Executive Summary

This article presents a novel approach to stabilizing low-precision training in transformers by deriving rank-aware spectral bounds on attention logits. The authors introduce a concentration inequality that takes into account the rank of the interaction matrix, leading to tighter bounds and improved overflow guarantees. The proposed geometry-aware scaling method eliminates overflows in transient scenarios and achieves comparable accuracy across various architectures.

Key Points

  • Derivation of rank-aware concentration inequality for attention scores
  • Introduction of geometry-aware scale factors for principled overflow guarantees
  • Application to FP8 training with promising results across various transformer architectures

Merits

Improved Overflow Guarantees

The proposed method provides tighter concentration bounds and improved overflow guarantees, reducing the risk of overflows in low-precision training.

Compatibility with Existing Architectures

The geometry-aware scaling method is compatible with fused attention kernels and does not require significant modifications to existing architectures.

Demerits

Computational Overhead

The implicit power iteration method used to compute per-layer scales may introduce additional computational overhead.

Limited Generalizability

The proposed method may not be directly applicable to other neural network architectures or training scenarios.

Expert Commentary

The article presents a significant contribution to the field of deep learning, particularly in the context of low-precision training. The derivation of rank-aware spectral bounds on attention logits provides a more nuanced understanding of the underlying mechanisms and offers a promising solution to the overflow problem. The proposed geometry-aware scaling method is well-motivated and effectively demonstrated across various transformer architectures. However, further research is needed to fully explore the potential benefits and limitations of this approach.

Recommendations

  • Further investigation into the applicability of the proposed method to other neural network architectures and training scenarios
  • Exploration of potential extensions or modifications to the geometry-aware scaling method to improve its efficiency and generalizability

Sources