Sparse Crosscoders for diffing MoEs and Dense models
arXiv:2603.05805v1 Announce Type: new Abstract: Mixture of Experts (MoE) achieve parameter-efficient scaling through sparse expert routing, yet their internal representations remain poorly understood compared to dense models. We present a systematic comparison of MoE and dense model internals using crosscoders, a variant of sparse autoencoders, that jointly models multiple activation spaces. We train 5-layer dense and MoEs (equal active parameters) on 1B tokens across code, scientific text, and english stories. Using BatchTopK crosscoders with explicitly designated shared features, we achieve $\sim 87\%$ fractional variance explained and uncover concrete differences in feature organization. The MoE learns significantly fewer unique features compared to the dense model. MoE-specific features also exhibit higher activation density than shared features, whereas dense-specific features show lower density. Our analysis reveals that MoEs develop more specialized, focused representations whi
arXiv:2603.05805v1 Announce Type: new Abstract: Mixture of Experts (MoE) achieve parameter-efficient scaling through sparse expert routing, yet their internal representations remain poorly understood compared to dense models. We present a systematic comparison of MoE and dense model internals using crosscoders, a variant of sparse autoencoders, that jointly models multiple activation spaces. We train 5-layer dense and MoEs (equal active parameters) on 1B tokens across code, scientific text, and english stories. Using BatchTopK crosscoders with explicitly designated shared features, we achieve $\sim 87\%$ fractional variance explained and uncover concrete differences in feature organization. The MoE learns significantly fewer unique features compared to the dense model. MoE-specific features also exhibit higher activation density than shared features, whereas dense-specific features show lower density. Our analysis reveals that MoEs develop more specialized, focused representations while dense models distribute information across broader, more general-purpose features.
Executive Summary
This article presents a systematic comparison of Mixture of Experts (MoE) and dense model internals using crosscoders, a variant of sparse autoencoders. The authors train 5-layer dense and MoE models on 1B tokens across different domains and analyze their internal representations. The results show that MoE models learn significantly fewer unique features compared to dense models, with MoE-specific features exhibiting higher activation density. The study reveals that MoEs develop more specialized, focused representations, whereas dense models distribute information across broader, more general-purpose features. The findings provide new insights into the internal workings of MoE models and their parameter-efficient scaling.
Key Points
- ▸ MoE models learn significantly fewer unique features compared to dense models.
- ▸ MoE-specific features exhibit higher activation density than shared features.
- ▸ Dense models distribute information across broader, more general-purpose features.
Merits
Strength in Methodology
The authors employ a systematic and robust methodology, using crosscoders to jointly model multiple activation spaces and achieve high fractional variance explained.
Insightful Findings
The study provides new insights into the internal workings of MoE models and their parameter-efficient scaling, shedding light on the differences in feature organization between MoE and dense models.
Demerits
Limitation in Scalability
The study is limited to 1B tokens, which may not be representative of larger-scale datasets, and the authors do not explore the implications of their findings on more complex tasks or domains.
Lack of Theoretical Analysis
The article does not provide a theoretical analysis of the differences between MoE and dense models, leaving open questions about the underlying mechanisms driving these differences.
Expert Commentary
The article presents a well-designed and executed study that sheds new light on the internal workings of MoE models. The use of crosscoders and the analysis of feature organization provide a nuanced understanding of the differences between MoE and dense models. However, the study's limitations in scalability and lack of theoretical analysis leave open questions about the underlying mechanisms driving these differences. Nevertheless, the findings have important practical and policy implications, and the study contributes to the ongoing debate about the optimal design of deep learning architectures.
Recommendations
- ✓ Future studies should explore the implications of the findings on more complex tasks or domains, and investigate the theoretical foundations of MoE models.
- ✓ Researchers should develop more efficient and effective deep learning architectures that take into account the parameter-efficient scaling of MoE models, while also addressing the limitations of dense models.