Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget
arXiv:2603.03459v1 Announce Type: new Abstract: We investigate when transformer MLP nonlinearity is actually necessary. A gate with $d+1$ parameters decides when to replace the full …
Peter Balogh
3 views