Academic

Sparse Crosscoders for diffing MoEs and Dense models

arXiv:2603.05805v1 Announce Type: new Abstract: Mixture of Experts (MoE) achieve parameter-efficient scaling through sparse expert routing, yet their internal representations remain poorly understood compared to dense models. We present a systematic comparison of MoE and dense model internals using crosscoders, a variant of sparse autoencoders, that jointly models multiple activation spaces. We train 5-layer dense and MoEs (equal active parameters) on 1B tokens across code, scientific text, and english stories. Using BatchTopK crosscoders with explicitly designated shared features, we achieve $\sim 87\%$ fractional variance explained and uncover concrete differences in feature organization. The MoE learns significantly fewer unique features compared to the dense model. MoE-specific features also exhibit higher activation density than shared features, whereas dense-specific features show lower density. Our analysis reveals that MoEs develop more specialized, focused representations whi

Marmik Chaudhari, Nishkal Hundia, Idhant Gulati · March 9, 2026 · 1 min read · 17 views

#cs.LG

Executive Summary

This article presents a systematic comparison of Mixture of Experts (MoE) and dense model internals using crosscoders, a variant of sparse autoencoders. The authors train 5-layer dense and MoE models on 1B tokens across different domains and analyze their internal representations. The results show that MoE models learn significantly fewer unique features compared to dense models, with MoE-specific features exhibiting higher activation density. The study reveals that MoEs develop more specialized, focused representations, whereas dense models distribute information across broader, more general-purpose features. The findings provide new insights into the internal workings of MoE models and their parameter-efficient scaling.

Key Points

▸ MoE models learn significantly fewer unique features compared to dense models.
▸ MoE-specific features exhibit higher activation density than shared features.
▸ Dense models distribute information across broader, more general-purpose features.

Merits

Strength in Methodology

The authors employ a systematic and robust methodology, using crosscoders to jointly model multiple activation spaces and achieve high fractional variance explained.

Insightful Findings

The study provides new insights into the internal workings of MoE models and their parameter-efficient scaling, shedding light on the differences in feature organization between MoE and dense models.

Demerits

Limitation in Scalability

The study is limited to 1B tokens, which may not be representative of larger-scale datasets, and the authors do not explore the implications of their findings on more complex tasks or domains.

Lack of Theoretical Analysis

The article does not provide a theoretical analysis of the differences between MoE and dense models, leaving open questions about the underlying mechanisms driving these differences.

Expert Commentary

The article presents a well-designed and executed study that sheds new light on the internal workings of MoE models. The use of crosscoders and the analysis of feature organization provide a nuanced understanding of the differences between MoE and dense models. However, the study's limitations in scalability and lack of theoretical analysis leave open questions about the underlying mechanisms driving these differences. Nevertheless, the findings have important practical and policy implications, and the study contributes to the ongoing debate about the optimal design of deep learning architectures.

Recommendations

✓ Future studies should explore the implications of the findings on more complex tasks or domains, and investigate the theoretical foundations of MoE models.
✓ Researchers should develop more efficient and effective deep learning architectures that take into account the parameter-efficient scaling of MoE models, while also addressing the limitations of dense models.

Sources

arXiv - cs.LG

Sparse Crosscoders for diffing MoEs and Dense models

AI Commentary

Executive Summary

Key Points

Merits

Strength in Methodology

Insightful Findings

Demerits

Limitation in Scalability

Lack of Theoretical Analysis

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs