Skip to main content
Academic

Muon+: Towards Better Muon via One Additional Normalization Step

arXiv:2602.21545v1 Announce Type: new Abstract: The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.

R
Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang
· · 1 min read · 3 views

arXiv:2602.21545v1 Announce Type: new Abstract: The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.

Executive Summary

The article proposes Muon+, an enhancement to the Muon optimizer, which introduces an additional normalization step after orthogonalization. The authors demonstrate the effectiveness of Muon+ through extensive pre-training experiments across various model scales and architectures, showing a consistent boost in training and validation perplexity over Muon. The results highlight the potential of Muon+ in improving the performance of large language models.

Key Points

  • Introduction of an additional normalization step in Muon+
  • Extensive pre-training experiments across various model scales and architectures
  • Consistent boost in training and validation perplexity over Muon

Merits

Improved Performance

Muon+ demonstrates a consistent improvement in training and validation perplexity over Muon, indicating its potential in enhancing the performance of large language models.

Demerits

Computational Overhead

The introduction of an additional normalization step may increase the computational overhead, potentially affecting the efficiency of the training process.

Expert Commentary

The article presents a significant contribution to the field of optimization techniques for large language models. The introduction of an additional normalization step in Muon+ demonstrates a clear understanding of the importance of orthogonalization and normalization in improving model performance. The extensive pre-training experiments provide strong evidence for the effectiveness of Muon+, and the results have important implications for the development of more efficient and effective optimization techniques. However, further research is needed to fully explore the potential of Muon+ and its applications in various domains.

Recommendations

  • Further research should be conducted to explore the potential of Muon+ in various applications, such as natural language processing and text generation.
  • The development of more efficient and effective optimization techniques should be prioritized, taking into account the computational overhead and potential limitations of Muon+.

Sources