Academic

Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum

Minxin Zhang, Yuxuan Liu, Hayden Scheaffer · February 21, 2026 · 1 min read · 13 views

#cs.LG #math.OC

arXiv:2602.17080v1 Announce Type: new Abstract: Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers' matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard assumptions, we establish optimal convergence rates for both algorithms in the deterministic setting and show that, in the stochastic setting, their convergence guarantees adapt to the noise level of stochastic gradients. Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon baselines, with NAMO-D achieving further gains over NAMO via an additional clamping hyperparameter that balances the competing goals of maintaining a well-conditioned update direction and leveraging fine-grained noise adaptation.

Executive Summary

This article proposes a new optimizer, NAMO, and its diagonal extension, NAMO-D, which integrate orthogonalized momentum with norm-based Adam-type noise adaptation. The authors demonstrate the superiority of NAMO and NAMO-D over AdamW and Muon baselines in pretraining GPT-2 models. The design of NAMO-D enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure, resulting in further gains over NAMO. The article establishes optimal convergence rates for both algorithms in the deterministic and stochastic settings, making it a significant contribution to the field of stochastic optimization.

Key Points

▸ Introduction of NAMO and NAMO-D optimizers
▸ Integration of orthogonalized momentum with norm-based Adam-type noise adaptation
▸ Demonstrated superiority over AdamW and Muon baselines in pretraining GPT-2 models

Merits

Improved Convergence Rates

The authors establish optimal convergence rates for NAMO and NAMO-D in both deterministic and stochastic settings, making them competitive with state-of-the-art optimizers.

Neuron-wise Noise Adaptation

NAMO-D's design enables neuron-wise noise adaptation, which aligns with the common near block-diagonal Hessian structure, resulting in improved performance.

Demerits

Additional Computational Cost

The introduction of orthogonalized momentum and norm-based Adam-type noise adaptation may increase the computational cost of NAMO and NAMO-D compared to other optimizers.

Expert Commentary

The article presents a significant contribution to the field of stochastic optimization, introducing NAMO and NAMO-D as competitive optimizers. The authors' use of orthogonalized momentum and norm-based Adam-type noise adaptation is innovative and effective, resulting in improved convergence rates and performance. However, the additional computational cost of these optimizers may be a limitation in certain applications. Overall, the article demonstrates the potential of NAMO and NAMO-D to improve the training of large language models and other deep learning applications.

Recommendations

✓ Further research on the application of NAMO and NAMO-D to other deep learning tasks and models
✓ Investigation of the potential benefits and limitations of using NAMO and NAMO-D in combination with other optimization techniques

Sources

arXiv - cs.LG

Something extraordinary is coming.

Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum

AI Commentary

Executive Summary

Key Points

Merits

Improved Convergence Rates

Neuron-wise Noise Adaptation

Demerits

Additional Computational Cost

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.