Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression
arXiv:2603.02217v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Expert Merging - and show that persistent post-compression degradation largely stems from a neglected factor: router-expert mismatch when experts are changed but the router is left untouched. We argue that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration. To this end, we propose Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters (the router) by distilling the original model's next-token distribution on unlabeled calibration data. Experiments across representative methods in all three paradigms demonstrate consistent performance recovery, with substantially larger gains in fine-grain
arXiv:2603.02217v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Expert Merging - and show that persistent post-compression degradation largely stems from a neglected factor: router-expert mismatch when experts are changed but the router is left untouched. We argue that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration. To this end, we propose Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters (the router) by distilling the original model's next-token distribution on unlabeled calibration data. Experiments across representative methods in all three paradigms demonstrate consistent performance recovery, with substantially larger gains in fine-grained MoEs (many small experts) than in coarse-grained MoEs due to their more complex routing decision boundaries.
Executive Summary
This article addresses the challenge of scaling Mixture-of-Experts (MoE) models for efficient deployment. MoE models have a massive parameter footprint that creates a deployment-time memory bottleneck. The authors propose three paradigms for retraining-free MoE compression - Expert Pruning, Expert Editing, and Expert Merging - and identify a neglected factor, router-expert mismatch, as the primary cause of persistent degradation. To address this issue, the authors propose Router Knowledge Distillation (Router KD), a method that updates only the router by distilling the original model's next-token distribution on unlabeled calibration data. Experiments demonstrate consistent performance recovery across all paradigms, with larger gains in fine-grained MoEs. The findings highlight the importance of router calibration for efficient MoE compression.
Key Points
- ▸ MoE models have a massive parameter footprint that creates a deployment-time memory bottleneck
- ▸ Router-expert mismatch is the primary cause of persistent degradation in MoE compression
- ▸ Router Knowledge Distillation (Router KD) updates only the router for efficient compression
Merits
Strength
The article proposes a novel solution to the neglected issue of router-expert mismatch, which is a significant contribution to the field of MoE compression.
Methodological Rigor
The authors conduct experiments across representative methods in all three paradigms, demonstrating the effectiveness of Router KD in recovering performance degradation.
Demerits
Limitation
The article assumes that the original model's next-token distribution is available for calibration, which may not be the case in all scenarios.
Scalability
The method requires a significant amount of unlabeled calibration data, which may be a bottleneck for large-scale MoE models.
Expert Commentary
The article makes a significant contribution to the field of MoE compression, proposing a novel solution to the neglected issue of router-expert mismatch. The methodological rigor of the article is commendable, with experiments conducted across representative methods in all three paradigms. However, the article's assumptions about the availability of the original model's next-token distribution for calibration may limit its applicability in some scenarios. Furthermore, the method's requirement for significant amounts of unlabeled calibration data may pose a scalability bottleneck for large-scale MoE models. Nevertheless, the findings have significant practical implications for the deployment of MoE models in real-world applications, and policymakers and researchers should prioritize the development of more efficient MoE compression methods.
Recommendations
- ✓ Future research should focus on developing more efficient MoE compression methods that can address the issue of router-expert mismatch without requiring significant amounts of unlabeled calibration data.
- ✓ Researchers should prioritize the development of more scalable and efficient MoE models that can be deployed in real-world applications with memory constraints.