Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat
arXiv:2603.02227v1 Announce Type: cross Abstract: Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a …
Keston Aquino-Michaels
3 views