Academic

Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

arXiv:2603.03459v1 Announce Type: new Abstract: We investigate when transformer MLP nonlinearity is actually necessary. A gate with $d+1$ parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ($r < 0.05$). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at <1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B's full 32-layer sweep reveals one layer that narrowly beats baseline. As a proof of concept, we progressively replac

P
Peter Balogh
· · 1 min read · 2 views

arXiv:2603.03459v1 Announce Type: new Abstract: We investigate when transformer MLP nonlinearity is actually necessary. A gate with $d+1$ parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ($r < 0.05$). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at <1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B's full 32-layer sweep reveals one layer that narrowly beats baseline. As a proof of concept, we progressively replace middle-layer MLPs with frozen linear matrices: 5 of 24 layers linearize at zero cost. With a full training budget, 4 linearized layers yield a 10.2% perplexity improvement -- and a two-phase gated approach pushes this to 17.3%, beating a vanilla fine-tuning control and confirming that the nonlinear MLPs at these layers were actively harmful.

Executive Summary

The article explores the necessity of nonlinearity in transformer MLPs, introducing a gate to decide when to replace the full MLP with a linear surrogate. Through experiments across various models and architectures, the authors find that nonlinearity need is contextual and cannot be predicted from token identity. The gate achieves significant linear routing at minimal perplexity cost, with some layers even beating the baseline with gating. The study demonstrates the potential for optimizing transformer architectures by reallocating the MLP budget.

Key Points

  • Introduction of a gate to decide when to replace the full MLP with a linear surrogate
  • Contextual dependence of nonlinearity need, unable to be predicted from token identity
  • Achievement of significant linear routing at minimal perplexity cost

Merits

Efficient Use of Computational Resources

The proposed gate allows for efficient use of computational resources by replacing unnecessary nonlinear computations with linear surrogates

Improved Model Performance

The study demonstrates the potential for optimizing transformer architectures, leading to improved model performance

Demerits

Architecture Dependence

The success of the proposed approach is architecture-dependent, with varying results across different models and architectures

Limited Generalizability

The study's findings may not generalize to all transformer-based models or tasks

Expert Commentary

The article presents a significant contribution to the field of transformer architecture optimization, demonstrating the potential for improving model performance while reducing computational costs. The proposed gate mechanism allows for efficient allocation of computational resources, replacing unnecessary nonlinear computations with linear surrogates. However, the architecture dependence of the approach highlights the need for further research into the generalizability of the findings. As the field continues to evolve, it is essential to prioritize efficient use of computational resources and develop guidelines for the responsible development and deployment of AI models.

Recommendations

  • Further research into the generalizability of the proposed approach across different models and architectures
  • Exploration of potential applications in natural language processing and other areas where transformers are used

Sources