Skip to main content
Academic

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

arXiv:2602.18523v1 Announce Type: new Abstract: Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical

Y
Yongzhong Xu
· · 1 min read · 3 views

arXiv:2602.18523v1 Announce Type: new Abstract: Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajectory directions yet are distributed across full-rank weights and destroyed by minimal perturbations; SVD truncation, magnitude pruning, and uniform scaling all fail to preserve performance. (5) Transverse fragility and redundancy: removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. Together, these results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways.

Executive Summary

This article extends geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task and tri-task objectives across a systematic weight decay sweep. The authors identify five consistent phenomena: staggered grokking order, universal integrability, distinct weight decay phase structure, holographic incompressibility, and transverse fragility and redundancy. These results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space. The study provides significant insights into the optimization dynamics of deep neural networks and sheds light on the role of weight decay and overparameterization in generalization. The findings have implications for the development of more efficient and effective deep learning architectures.

Key Points

  • staggered grokking order
  • universal integrability
  • distinct weight decay phase structure

Merits

Strength in theoretical foundation

The article is grounded in a solid theoretical framework, leveraging geometric analysis to provide new insights into the optimization dynamics of deep neural networks.

Methodological rigor

The authors employ a systematic weight decay sweep and rigorous analysis to identify consistent phenomena and shed light on the role of weight decay and overparameterization in generalization.

Implications for deep learning architectures

The study has practical implications for the development of more efficient and effective deep learning architectures, with potential applications in a range of areas, from natural language processing to computer vision.

Demerits

Limited scope

The study focuses on a specific class of deep neural networks (shared-trunk Transformers) and tasks (modular arithmetic), limiting the generalizability of the findings to other architectures and applications.

Data requirements

The study requires significant computational resources and data, which may be a barrier to replication and extension of the findings.

Expert Commentary

The article presents a comprehensive analysis of the optimization dynamics of deep neural networks, shedding light on the role of weight decay and overparameterization in generalization. The study's findings have significant implications for the development of more efficient and effective deep learning architectures, with potential applications in a range of areas. However, the limited scope of the study and the data requirements may limit the generalizability of the findings to other architectures and applications.

Recommendations

  • Future studies should expand the scope of the analysis to include other architectures and tasks, as well as explore the implications of the findings for the development of more efficient and effective deep learning architectures.
  • Researchers should consider the computational and data requirements of the study and explore strategies for reducing the resource burden.

Sources