Academic

Multi-objective Evolutionary Merging Enables Efficient Reasoning Models

arXiv:2604.06465v1 Announce Type: new Abstract: Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces

arXiv:2604.06465v1 Announce Type: new Abstract: Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces the overhead of fitness estimation. Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the problem-solving accuracy of the original reasoning models.

Executive Summary

This article introduces Evo-L2S, a novel framework addressing the 'Long-to-Short' (L2S) reasoning problem in large language models (LLMs). The core innovation lies in framing L2S as a multi-objective optimization challenge, explicitly balancing inference accuracy and token length reduction through evolutionary model merging. Unlike prior scalarized approaches, Evo-L2S generates a Pareto front of merged models, offering robust trade-offs. An entropy-based sampling technique is proposed to ensure computational tractability. Experimental results across various model scales and mathematical benchmarks indicate significant reductions in reasoning trace length (over 50%) while maintaining or improving accuracy, presenting a promising avenue for more efficient LLM deployment without sacrificing performance.

Key Points

  • Evo-L2S frames the Long-to-Short (L2S) reasoning problem as a multi-objective optimization task.
  • It leverages evolutionary model merging to explicitly optimize the trade-off between accuracy and output length.
  • The framework generates a Pareto front of merged models, offering robust and non-dominated solutions.
  • An entropy-based subset sampling technique is introduced to make fitness estimation computationally tractable for large models.
  • Experiments show over 50% reduction in reasoning trace length with preserved or improved accuracy across various LLM scales and benchmarks.

Merits

Novel Multi-objective Formulation

Explicitly addresses the inherent trade-off in L2S reasoning, moving beyond brittle scalarized approaches to generate a robust Pareto front of solutions.

Computational Tractability

The entropy-based subset sampling technique is a crucial enabler for applying evolutionary methods to large language models, mitigating the typically high computational cost.

Empirical Rigor

Comprehensive experiments across multiple model scales (1.5B, 7B, 14B) and six diverse mathematical reasoning benchmarks lend significant credibility to the findings.

Significant Efficiency Gains

Achieving over 50% reduction in token length while maintaining or improving accuracy represents a substantial practical improvement for LLM inference efficiency.

Demerits

Domain Specificity of Benchmarks

While mathematical reasoning is a strong test, the generalizability of Evo-L2S's effectiveness across broader reasoning domains (e.g., legal, medical, common sense) remains to be fully explored.

Complexity of Evolutionary Search

Despite the sampling technique, evolutionary algorithms can still be computationally intensive and sensitive to hyperparameter tuning, which might be a barrier for practitioners without deep expertise.

Interpretability of Merged Models

The 'black-box' nature of merged models, especially those derived through evolutionary processes, might reduce interpretability compared to more direct fine-tuning or pruning methods.

Dependence on 'Original Reasoning Models'

The approach assumes the availability and quality of 'original reasoning models' with long chains of thought, which might not always be optimal or readily available for all tasks.

Expert Commentary

The Evo-L2S framework represents a significant methodological advance in addressing the critical L2S reasoning problem. Its shift from scalarized, often brittle, optimization to a multi-objective evolutionary approach is conceptually elegant and empirically validated. The generation of a Pareto front provides practitioners with a nuanced understanding of the achievable trade-offs, enabling informed deployment decisions rather than forcing suboptimal compromises. Crucially, the entropy-based sampling technique effectively bridges the gap between the computational demands of evolutionary search and the scale of modern LLMs, a non-trivial engineering feat. While the current benchmarks are mathematically focused, the underlying principle of optimizing for both accuracy and efficiency is universally applicable. Future research should rigorously test its generalizability across diverse, real-world reasoning tasks, particularly those in legal, medical, or scientific domains where both accuracy and explainability are paramount. The potential for these 'shortened' models to retain, or even enhance, accuracy while drastically cutting resource consumption is transformative for the widespread adoption of advanced reasoning AI.

Recommendations

  • Conduct extensive validation of Evo-L2S across a broader range of reasoning domains (e.g., legal case analysis, medical diagnostics, scientific hypothesis generation) to assess generalizability.
  • Investigate the interpretability and explainability of the merged models, especially how the 'shortened' reasoning traces arrive at conclusions, to build trust in high-stakes applications.
  • Explore hybrid approaches combining Evo-L2S with other model compression techniques (e.g., quantization, pruning) to achieve even greater efficiency gains.
  • Develop user-friendly tools and libraries that abstract away the complexity of evolutionary optimization, making Evo-L2S more accessible to a wider range of AI practitioners and researchers.
  • Analyze the sensitivity of Evo-L2S to different initial model architectures and training data, and provide guidelines for optimal application.

Sources

Original: arXiv - cs.CL