Academic

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

arXiv:2603.10067v1 Announce Type: new Abstract: Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint

arXiv:2603.10067v1 Announce Type: new Abstract: Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

Executive Summary

The article introduces HTMuon, an enhancement to the Muon optimizer for large language models (LLMs) by incorporating a heavy-tailed spectral correction inspired by the Heavy-Tailed Self-Regularization (HT-SR) theory. The authors identify a limitation in Muon’s orthogonalized update rule, which suppresses heavy-tailed weight spectra and overemphasizes noise-dominated directions. HTMuon preserves Muon’s parameter interdependence capture while enabling heavier-tailed updates and spectra, leading to improved performance on LLM pretraining and image classification benchmarks. Notably, on LLaMA with C4, HTMuon reduces perplexity by up to 0.98 relative to Muon. Theoretical support is provided via steepest descent under Schatten-$q$ norm constraints and convergence analysis in smooth non-convex settings. The implementation is publicly available, enhancing reproducibility and applicability.

Key Points

  • HTMuon addresses Muon’s suppression of heavy-tailed spectra via HT-SR theory

Merits

Innovation

HTMuon introduces a novel spectral correction that aligns with HT-SR theory, offering a theoretically grounded enhancement to Muon without compromising existing strengths.

Performance Improvement

Empirical results demonstrate consistent gains across multiple benchmarks, with quantifiable reductions in perplexity.

Plug-in Compatibility

HTMuon functions as a modular upgrade applicable to existing Muon variants, enhancing flexibility for practitioners.

Demerits

Implementation Complexity

While theoretically robust, the integration of spectral corrections may introduce additional complexity for developers unfamiliar with heavy-tailed dynamics.

Limited Scope

Empirical validation is currently confined to LLMs and image classification; broader applicability remains to be validated.

Expert Commentary

HTMuon represents a sophisticated yet pragmatic advancement in optimizer design. The authors skillfully bridge theoretical insights from HT-SR with practical applications in LLM training, avoiding the common pitfall of overfitting to empirical trends. The integration of a steepest descent under Schatten-$q$ norm constraint is particularly compelling, as it provides a rigorous foundation for the observed improvements. Moreover, the availability of the implementation fosters transparency and accelerates adoption. While the current experiments are narrowly scoped, the methodological rigor and reproducibility suggest this work will influence future optimizer research—potentially inspiring analogous corrections in other domains such as computer vision or reinforcement learning. This is not merely an incremental improvement; it is a meaningful contribution to the evolution of adaptive learning algorithms.

Recommendations

  • Researchers should validate HTMuon on additional domains beyond LLMs to assess generalizability.
  • Engineers integrating HTMuon into production systems should conduct stability tests under varying hyperparameter regimes to mitigate potential edge-case instabilities.

Sources