Lost in Backpropagation: The LM Head is a Gradient Bottleneck
arXiv:2603.10145v1 Announce Type: new Abstract: The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating $V$-dimensional gradients through a rank-$D$ linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LL
arXiv:2603.10145v1 Announce Type: new Abstract: The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating $V$-dimensional gradients through a rank-$D$ linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.
Executive Summary
The article discusses the softmax bottleneck in neural language models, where the last layer's projection of output features to logits can lead to limited expressivity and optimization issues. The authors demonstrate that backpropagating gradients through this layer causes significant compression, altering training feedback and resulting in suboptimal update directions. This phenomenon is shown to affect training dynamics and contribute to inefficiencies, highlighting the need for new language model head designs.
Key Points
- ▸ The softmax bottleneck is both an expressivity and optimization bottleneck
- ▸ Backpropagating gradients through the output layer causes significant compression
- ▸ The gradient bottleneck affects training dynamics and contributes to inefficiencies
Merits
Theoretical Analysis
The authors provide a thorough theoretical analysis of the phenomenon, offering a clear understanding of the underlying issues.
Demerits
Limited Solutions
The article primarily focuses on identifying the problem, with limited discussion on potential solutions or new language model head designs.
Expert Commentary
The article provides a valuable contribution to the understanding of neural language models, shedding light on a previously underappreciated issue. The authors' empirical measurements and controlled experiments demonstrate the significant impact of the gradient bottleneck on training dynamics. However, the article could benefit from a more in-depth discussion of potential solutions, such as alternative language model head designs or modifications to the training process. Nevertheless, the findings have important implications for the development of more efficient and effective language models, and highlight the need for continued research into neural network architecture design.
Recommendations
- ✓ Further research into alternative language model head designs to mitigate the gradient bottleneck
- ✓ Investigation into modifications to the training process to improve the optimization of language models