Academic

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei · February 27, 2026 · 1 min read · 2 views

#cs.LG

arXiv:2602.21585v1 Announce Type: new Abstract: Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.

Executive Summary

The article introduces Duel-Evolve, an evolutionary optimization algorithm that leverages pairwise comparisons from a large language model (LLM) to guide test-time optimization. This approach eliminates the need for external scalar rewards, ground-truth labels, and hand-crafted scoring functions. The algorithm aggregates noisy candidate comparisons using a Bayesian Bradley-Terry model and allocates the comparison budget using Double Thompson Sampling. The results show significant improvements over existing methods on MathBench and LiveCodeBench, demonstrating the effectiveness of pairwise self-preferences in optimizing LLM outputs.

Key Points

▸ Duel-Evolve replaces external scalar rewards with pairwise preferences from the LLM
▸ The algorithm uses a Bayesian Bradley-Terry model to aggregate noisy candidate comparisons
▸ Double Thompson Sampling is used to allocate the comparison budget and select high-quality parents

Merits

Efficient Optimization

Duel-Evolve achieves significant improvements over existing methods without requiring external rewards or labels

Flexibility

The algorithm can be applied to various tasks and domains, including those with sparse or unreliable scoring functions

Demerits

Computational Complexity

The use of Bayesian Bradley-Terry models and Double Thompson Sampling may increase computational costs

Noise Sensitivity

The algorithm's reliance on noisy candidate comparisons may affect its performance in certain scenarios

Expert Commentary

The introduction of Duel-Evolve marks a significant advancement in the field of evolutionary optimization. By leveraging pairwise comparisons from the LLM itself, the algorithm eliminates the need for external rewards and labels, making it a highly flexible and efficient approach. The use of Bayesian Bradley-Terry models and Double Thompson Sampling provides a robust framework for aggregating noisy comparisons and allocating the comparison budget. However, further research is needed to fully understand the limitations and potential applications of this approach, particularly in scenarios with high levels of noise or adversarial examples.

Recommendations

✓ Further evaluation of Duel-Evolve on diverse tasks and domains to demonstrate its generalizability
✓ Investigation of the algorithm's robustness to noise and adversarial examples to improve its reliability in real-world scenarios

Sources

arXiv - cs.LG

Something extraordinary is coming.

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

AI Commentary

Executive Summary

Key Points

Merits

Efficient Optimization

Flexibility

Demerits

Computational Complexity

Noise Sensitivity

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.