Academic

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

arXiv:2603.06003v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains memory- and throughput-bound because the full expert pool must be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model. ESAP is bounded and stable, enabling cheap comparison of many candidates without costly autoregressive decoding. Building on ESAP, we propose EvoESAP, an evolutionary searching framework that optimiz

arXiv:2603.06003v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains memory- and throughput-bound because the full expert pool must be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model. ESAP is bounded and stable, enabling cheap comparison of many candidates without costly autoregressive decoding. Building on ESAP, we propose EvoESAP, an evolutionary searching framework that optimizes a non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method with criteria such as Frequency, EAN, SEER, and REAP. Across 7B--30B SMoE LLMs at 25\% and 50\% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf{+19.6\%} on MATH-500 at 50\% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity.

Executive Summary

This article proposes EvoESAP, a novel method for optimizing non-uniform expert pruning in Sparse Mixture-of-Experts (SMoE) language models. The approach decouples pruning into within-layer expert ranking and across-layer budget allocation, leveraging a speculative-decoding-inspired metric called ESAP. EvoESAP is an evolutionary searching framework that optimizes layer-wise sparsity allocation under a fixed global budget, making it a plug-and-play method. The results show that EvoESAP consistently discovers non-uniform allocations that improve open-ended generation while preserving competitive multiple-choice accuracy compared to uniform pruning. This research contributes to the development of more efficient and effective SMoE models, which is crucial for deploying large language models in memory- and throughput-constrained environments.

Key Points

  • EvoESAP decouples pruning into within-layer expert ranking and across-layer budget allocation.
  • The method leverages ESAP, a speculative-decoding-inspired metric for measuring expert importance.
  • EvoESAP optimizes layer-wise sparsity allocation under a fixed global budget using an evolutionary searching framework.

Merits

Strength in Optimizing Layer-wise Sparsity Allocation

EvoESAP's ability to optimize non-uniform layer-wise sparsity allocation under a fixed global budget is a significant merit, as it allows for more efficient and effective SMoE models.

Improved Open-ended Generation

The results show that EvoESAP consistently discovers non-uniform allocations that improve open-ended generation, which is a critical application of SMoE models.

Demerits

Limited Evaluation on Specific Tasks

The article primarily evaluates EvoESAP on open-ended generation tasks, and it would be beneficial to explore its performance on other tasks and domains.

Complexity of Evolutionary Searching Framework

The evolutionary searching framework used in EvoESAP may be computationally complex and require significant resources to train, which could be a limitation in practice.

Expert Commentary

The article presents a significant contribution to the field of SMoE models, leveraging a novel approach to optimize expert pruning and layer-wise sparsity allocation. The results are promising, and EvoESAP has the potential to improve the efficiency and effectiveness of SMoE models in real-world applications. However, the complexity of the evolutionary searching framework used in EvoESAP may be a limitation in practice. Additionally, it would be beneficial to explore EvoESAP's performance on other tasks and domains to further understand its implications and applications. Overall, this research has the potential to inform policy decisions on the development and deployment of large language models.

Recommendations

  • Further research is needed to explore EvoESAP's performance on other tasks and domains.
  • The complexity of the evolutionary searching framework used in EvoESAP should be addressed to make it more practical for deployment.

Sources