Academic

Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

arXiv:2603.02217v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Expert Merging - and show that persistent post-compression degradation largely stems from a neglected factor: router-expert mismatch when experts are changed but the router is left untouched. We argue that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration. To this end, we propose Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters (the router) by distilling the original model's next-token distribution on unlabeled calibration data. Experiments across representative methods in all three paradigms demonstrate consistent performance recovery, with substantially larger gains in fine-grain

Sieun Hyeon, Jaeyoung Do · March 5, 2026 · 1 min read · 13 views

#cs.LG #cs.AI

Executive Summary

This article addresses the challenge of scaling Mixture-of-Experts (MoE) models for efficient deployment. MoE models have a massive parameter footprint that creates a deployment-time memory bottleneck. The authors propose three paradigms for retraining-free MoE compression - Expert Pruning, Expert Editing, and Expert Merging - and identify a neglected factor, router-expert mismatch, as the primary cause of persistent degradation. To address this issue, the authors propose Router Knowledge Distillation (Router KD), a method that updates only the router by distilling the original model's next-token distribution on unlabeled calibration data. Experiments demonstrate consistent performance recovery across all paradigms, with larger gains in fine-grained MoEs. The findings highlight the importance of router calibration for efficient MoE compression.

Key Points

▸ MoE models have a massive parameter footprint that creates a deployment-time memory bottleneck
▸ Router-expert mismatch is the primary cause of persistent degradation in MoE compression
▸ Router Knowledge Distillation (Router KD) updates only the router for efficient compression

Merits

Strength

The article proposes a novel solution to the neglected issue of router-expert mismatch, which is a significant contribution to the field of MoE compression.

Methodological Rigor

The authors conduct experiments across representative methods in all three paradigms, demonstrating the effectiveness of Router KD in recovering performance degradation.

Demerits

Limitation

The article assumes that the original model's next-token distribution is available for calibration, which may not be the case in all scenarios.

Scalability

The method requires a significant amount of unlabeled calibration data, which may be a bottleneck for large-scale MoE models.

Expert Commentary

The article makes a significant contribution to the field of MoE compression, proposing a novel solution to the neglected issue of router-expert mismatch. The methodological rigor of the article is commendable, with experiments conducted across representative methods in all three paradigms. However, the article's assumptions about the availability of the original model's next-token distribution for calibration may limit its applicability in some scenarios. Furthermore, the method's requirement for significant amounts of unlabeled calibration data may pose a scalability bottleneck for large-scale MoE models. Nevertheless, the findings have significant practical implications for the deployment of MoE models in real-world applications, and policymakers and researchers should prioritize the development of more efficient MoE compression methods.

Recommendations

✓ Future research should focus on developing more efficient MoE compression methods that can address the issue of router-expert mismatch without requiring significant amounts of unlabeled calibration data.
✓ Researchers should prioritize the development of more scalable and efficient MoE models that can be deployed in real-world applications with memory constraints.

Sources

arXiv - cs.LG

Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

AI Commentary

Executive Summary

Key Points

Merits

Strength

Methodological Rigor

Demerits

Limitation

Scalability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs