Academic

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

arXiv:2602.15322v1 Announce Type: new Abstract: Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.

Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie · February 19, 2026 · 1 min read · 3 views

#cs.LG #cs.AI

Executive Summary

This article presents a novel approach to optimizing large language models (LLMs) by introducing a technique called Momentum-aligned gradient masking (Magma). By randomly masking parameter updates, the authors demonstrate that a masked variant of RMSProp can outperform recent state-of-the-art optimizers. The analysis reveals that this approach induces a curvature-dependent geometric regularization, smoothing the optimization trajectory. The study shows that Magma can be used as a simple drop-in replacement for adaptive optimizers, providing consistent gains and negligible computational overhead. Specifically, for a 1B model size, Magma reduces perplexity by over 19% and 9% compared to Adam and Muon, respectively. This breakthrough has significant implications for the field of deep learning and language modeling.

Key Points

▸ Introduction of Momentum-aligned gradient masking (Magma) as a novel optimization technique
▸ Demonstration of Magma's superior performance compared to recent state-of-the-art optimizers
▸ Analysis of the curvature-dependent geometric regularization induced by random masking
▸ Magma's potential as a simple drop-in replacement for adaptive optimizers

Merits

Strength in novel approach

The authors' introduction of Magma represents a departure from traditional optimization techniques, offering a fresh perspective on the problem of optimizing LLMs. This innovative approach has the potential to drive significant advancements in the field.

Robust experimental results

The study's extensive LLM pre-training experiments provide strong evidence for Magma's effectiveness, with consistent gains and negligible computational overhead.

Demerits

Limited scope of evaluation

The article focuses primarily on LLM pre-training experiments, which may not fully capture the potential of Magma in other deep learning applications.

Lack of theoretical justification

While the authors provide empirical evidence for Magma's effectiveness, a more comprehensive theoretical understanding of the underlying mechanisms is needed to fully justify its use.

Expert Commentary

The introduction of Magma represents a significant advancement in the field of deep learning optimization. By leveraging a novel technique that induces curvature-dependent geometric regularization, the authors have opened up new possibilities for optimizing LLMs. While the study's findings are promising, further research is necessary to fully understand the underlying mechanisms and potential applications of Magma. The article's emphasis on empirical evidence and robust experimental results is commendable, and the authors' bold approach to challenging traditional optimization techniques is admirable. As the field continues to evolve, it will be exciting to see how Magma and other novel optimization techniques shape the development of future language models and deep learning architectures.

Recommendations

✓ Further investigation into Magma's theoretical foundations and potential applications beyond LLM pre-training
✓ Exploration of Magma's compatibility with other deep learning architectures and tasks

Sources

arXiv - cs.LG

Something extraordinary is coming.

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

AI Commentary

Executive Summary

Key Points

Merits

Strength in novel approach

Robust experimental results

Demerits

Limited scope of evaluation

Lack of theoretical justification

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.