Skip to main content
Academic

Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

arXiv:2602.17798v1 Announce Type: new Abstract: Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob -- the concentration matrix $\Lambda$ -- that continuously controls routing entropy, replacing discrete top-$k$ selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top-$k$ mass, and a

I
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
· · 1 min read · 5 views

arXiv:2602.17798v1 Announce Type: new Abstract: Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob -- the concentration matrix $\Lambda$ -- that continuously controls routing entropy, replacing discrete top-$k$ selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top-$k$ mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0\% routing collapse across all seeds, comparable or better perplexity with 15--30\% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.

Executive Summary

The article presents a novel routing framework for Mixture-of-Experts (MoE) models, called Grassmannian MoE (GrMoE), which operates on the Grassmannian manifold of subspaces. GrMoE introduces a concentration-controlled sparsity mechanism, where the concentration matrix λ controls routing entropy, replacing discrete top-k selection with a smooth, geometrically principled sparsity mechanism. The method enables uncertainty-aware expert assignment and resists expert collapse. The authors conduct experiments on synthetic routing tasks, demonstrating improved load balance and a smooth monotonic relationship between concentration and effective sparsity. The results reveal interpretable routing behavior, with experts learning heterogeneous concentration values that correlate with linguistic specialization. The article provides a formal theory of concentration-controlled sparsity, establishing tight bounds relating the Bingham concentration spectrum to routing entropy and expert collapse.

Key Points

  • GrMoE operates on the Grassmannian manifold of subspaces, introducing a concentration-controlled sparsity mechanism.
  • The concentration matrix λ controls routing entropy, replacing discrete top-k selection with a smooth, geometrically principled sparsity mechanism.
  • GrMoE enables uncertainty-aware expert assignment and resists expert collapse.

Merits

Strength in theoretical foundations

The article provides a formal theory of concentration-controlled sparsity, establishing tight bounds relating the Bingham concentration spectrum to routing entropy and expert collapse, which is a significant contribution to the field.

Demerits

Limitation in applicability

The method may not be generalizable to other domains or tasks beyond synthetic routing tasks, requiring further research to establish its broader applicability.

Expert Commentary

The article presents a significant contribution to the field of MoE models, introducing a novel routing framework and concentration-controlled sparsity mechanism. The method's ability to resist expert collapse and enable uncertainty-aware expert assignment is particularly noteworthy. However, the article's limitations in applicability and scope should be carefully considered. Further research is needed to establish the broader applicability of the method and to explore its potential applications in other domains. Overall, the article provides a valuable addition to the literature on MoE models and their applications.

Recommendations

  • Further research is needed to establish the broader applicability of the method and to explore its potential applications in other domains.
  • The authors should investigate the method's performance on more complex tasks and datasets, and explore its potential applications in areas such as natural language processing and computer vision.

Sources