Muon+: Towards Better Muon via One Additional Normalization Step
arXiv:2602.21545v1 Announce Type: new Abstract: The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.
arXiv:2602.21545v1 Announce Type: new Abstract: The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.
Executive Summary
The article proposes Muon+, an enhancement to the Muon optimizer, which introduces an additional normalization step after orthogonalization. The authors demonstrate the effectiveness of Muon+ through extensive pre-training experiments across various model scales and architectures, showing a consistent boost in training and validation perplexity over Muon. The results highlight the potential of Muon+ in improving the performance of large language models.
Key Points
- ▸ Introduction of an additional normalization step in Muon+
- ▸ Extensive pre-training experiments across various model scales and architectures
- ▸ Consistent boost in training and validation perplexity over Muon
Merits
Improved Performance
Muon+ demonstrates a consistent improvement in training and validation perplexity over Muon, indicating its potential in enhancing the performance of large language models.
Demerits
Computational Overhead
The introduction of an additional normalization step may increase the computational overhead, potentially affecting the efficiency of the training process.
Expert Commentary
The article presents a significant contribution to the field of optimization techniques for large language models. The introduction of an additional normalization step in Muon+ demonstrates a clear understanding of the importance of orthogonalization and normalization in improving model performance. The extensive pre-training experiments provide strong evidence for the effectiveness of Muon+, and the results have important implications for the development of more efficient and effective optimization techniques. However, further research is needed to fully explore the potential of Muon+ and its applications in various domains.
Recommendations
- ✓ Further research should be conducted to explore the potential of Muon+ in various applications, such as natural language processing and text generation.
- ✓ The development of more efficient and effective optimization techniques should be prioritized, taking into account the computational overhead and potential limitations of Muon+.