Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions
arXiv:2603.06248v1 Announce Type: new Abstract: Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this …
Aditya Varre, Mark Rofin, Nicolas Flammarion
19 views