Academic

Muon+: Towards Better Muon via One Additional Normalization Step

arXiv:2602.21545v1 Announce Type: new Abstract: The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang · February 27, 2026 · 1 min read · 3 views

#cs.LG

Executive Summary

The article proposes Muon+, an enhancement to the Muon optimizer, which introduces an additional normalization step after orthogonalization. The authors demonstrate the effectiveness of Muon+ through extensive pre-training experiments across various model scales and architectures, showing a consistent boost in training and validation perplexity over Muon. The results highlight the potential of Muon+ in improving the performance of large language models.

Key Points

▸ Introduction of an additional normalization step in Muon+
▸ Extensive pre-training experiments across various model scales and architectures
▸ Consistent boost in training and validation perplexity over Muon

Merits

Improved Performance

Muon+ demonstrates a consistent improvement in training and validation perplexity over Muon, indicating its potential in enhancing the performance of large language models.

Demerits

Computational Overhead

The introduction of an additional normalization step may increase the computational overhead, potentially affecting the efficiency of the training process.

Expert Commentary

The article presents a significant contribution to the field of optimization techniques for large language models. The introduction of an additional normalization step in Muon+ demonstrates a clear understanding of the importance of orthogonalization and normalization in improving model performance. The extensive pre-training experiments provide strong evidence for the effectiveness of Muon+, and the results have important implications for the development of more efficient and effective optimization techniques. However, further research is needed to fully explore the potential of Muon+ and its applications in various domains.

Recommendations

✓ Further research should be conducted to explore the potential of Muon+ in various applications, such as natural language processing and text generation.
✓ The development of more efficient and effective optimization techniques should be prioritized, taking into account the computational overhead and potential limitations of Muon+.

Sources

arXiv - cs.LG

Something extraordinary is coming.

Muon+: Towards Better Muon via One Additional Normalization Step

AI Commentary

Executive Summary

Key Points

Merits

Improved Performance

Demerits

Computational Overhead

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.