Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum
arXiv:2602.17080v1 Announce Type: new Abstract: Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers' matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard ass
arXiv:2602.17080v1 Announce Type: new Abstract: Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers' matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard assumptions, we establish optimal convergence rates for both algorithms in the deterministic setting and show that, in the stochastic setting, their convergence guarantees adapt to the noise level of stochastic gradients. Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon baselines, with NAMO-D achieving further gains over NAMO via an additional clamping hyperparameter that balances the competing goals of maintaining a well-conditioned update direction and leveraging fine-grained noise adaptation.
Executive Summary
This article proposes a new optimizer, NAMO, and its diagonal extension, NAMO-D, which integrate orthogonalized momentum with norm-based Adam-type noise adaptation. The authors demonstrate the superiority of NAMO and NAMO-D over AdamW and Muon baselines in pretraining GPT-2 models. The design of NAMO-D enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure, resulting in further gains over NAMO. The article establishes optimal convergence rates for both algorithms in the deterministic and stochastic settings, making it a significant contribution to the field of stochastic optimization.
Key Points
- ▸ Introduction of NAMO and NAMO-D optimizers
- ▸ Integration of orthogonalized momentum with norm-based Adam-type noise adaptation
- ▸ Demonstrated superiority over AdamW and Muon baselines in pretraining GPT-2 models
Merits
Improved Convergence Rates
The authors establish optimal convergence rates for NAMO and NAMO-D in both deterministic and stochastic settings, making them competitive with state-of-the-art optimizers.
Neuron-wise Noise Adaptation
NAMO-D's design enables neuron-wise noise adaptation, which aligns with the common near block-diagonal Hessian structure, resulting in improved performance.
Demerits
Additional Computational Cost
The introduction of orthogonalized momentum and norm-based Adam-type noise adaptation may increase the computational cost of NAMO and NAMO-D compared to other optimizers.
Expert Commentary
The article presents a significant contribution to the field of stochastic optimization, introducing NAMO and NAMO-D as competitive optimizers. The authors' use of orthogonalized momentum and norm-based Adam-type noise adaptation is innovative and effective, resulting in improved convergence rates and performance. However, the additional computational cost of these optimizers may be a limitation in certain applications. Overall, the article demonstrates the potential of NAMO and NAMO-D to improve the training of large language models and other deep learning applications.
Recommendations
- ✓ Further research on the application of NAMO and NAMO-D to other deep learning tasks and models
- ✓ Investigation of the potential benefits and limitations of using NAMO and NAMO-D in combination with other optimization techniques