Academic

Lost in Backpropagation: The LM Head is a Gradient Bottleneck

arXiv:2603.10145v1 Announce Type: new Abstract: The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating $V$-dimensional gradients through a rank-$D$ linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LL

Nathan Godey, Yoav Artzi · March 12, 2026 · 1 min read · 34 views

#cs.CL

Executive Summary

The article discusses the softmax bottleneck in neural language models, where the last layer's projection of output features to logits can lead to limited expressivity and optimization issues. The authors demonstrate that backpropagating gradients through this layer causes significant compression, altering training feedback and resulting in suboptimal update directions. This phenomenon is shown to affect training dynamics and contribute to inefficiencies, highlighting the need for new language model head designs.

Key Points

▸ The softmax bottleneck is both an expressivity and optimization bottleneck
▸ Backpropagating gradients through the output layer causes significant compression
▸ The gradient bottleneck affects training dynamics and contributes to inefficiencies

Merits

Theoretical Analysis

The authors provide a thorough theoretical analysis of the phenomenon, offering a clear understanding of the underlying issues.

Demerits

Limited Solutions

The article primarily focuses on identifying the problem, with limited discussion on potential solutions or new language model head designs.

Expert Commentary

The article provides a valuable contribution to the understanding of neural language models, shedding light on a previously underappreciated issue. The authors' empirical measurements and controlled experiments demonstrate the significant impact of the gradient bottleneck on training dynamics. However, the article could benefit from a more in-depth discussion of potential solutions, such as alternative language model head designs or modifications to the training process. Nevertheless, the findings have important implications for the development of more efficient and effective language models, and highlight the need for continued research into neural network architecture design.

Recommendations

✓ Further research into alternative language model head designs to mitigate the gradient bottleneck
✓ Investigation into modifications to the training process to improve the optimization of language models

Sources

arXiv - cs.CL

Lost in Backpropagation: The LM Head is a Gradient Bottleneck

AI Commentary

Executive Summary

Key Points

Merits

Theoretical Analysis

Demerits

Limited Solutions

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs