SCOPE: Selective Conformal Optimized Pairwise LLM Judging
arXiv:2602.13110v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $\alpha$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncert
arXiv:2602.13110v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $\alpha$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $\alpha = 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to na\"ive baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.
Executive Summary
The article 'SCOPE: Selective Conformal Optimized Pairwise LLM Judging' introduces a novel framework for improving the reliability and calibration of large language models (LLMs) as judges in pairwise evaluations. The authors propose SCOPE, which combines selective judging with finite-sample statistical guarantees to control error rates. A key innovation is the Bidirectional Preference Entropy (BPE) method, which provides a bias-neutral uncertainty signal by querying the judge under both response positions. The study demonstrates that SCOPE consistently meets target risk levels while maintaining high coverage across various benchmarks and judge scales, outperforming naive baselines. The findings highlight the potential of SCOPE to enhance the practicality and accuracy of LLM-based evaluations.
Key Points
- ▸ SCOPE framework provides selective pairwise judging with finite-sample statistical guarantees.
- ▸ Bidirectional Preference Entropy (BPE) improves uncertainty quality and bias neutrality.
- ▸ SCOPE consistently meets target risk levels and maintains high coverage across benchmarks.
- ▸ Compared to naive baselines, SCOPE accepts significantly more judgments under the same risk constraints.
Merits
Innovative Framework
The SCOPE framework introduces a novel approach to selective pairwise judging, providing finite-sample statistical guarantees that enhance the reliability of LLM-based evaluations.
Improved Uncertainty Measurement
The BPE method effectively addresses the issue of bias in uncertainty signals by querying the judge under both response positions, leading to more accurate and reliable evaluations.
Consistent Performance
SCOPE demonstrates consistent performance across different benchmarks and judge scales, meeting target risk levels and maintaining high coverage, which is crucial for practical applications.
Demerits
Complexity
The implementation of SCOPE and BPE may be complex and require significant computational resources, which could limit its accessibility for smaller organizations or individual researchers.
Benchmark Limitations
The study's findings are based on specific benchmarks (MT-Bench, RewardBench, and Chatbot Arena), and the generalizability of the results to other evaluation contexts may need further validation.
Dependence on LLM Quality
The effectiveness of SCOPE is inherently dependent on the quality and calibration of the underlying LLM judge, which may vary across different models and applications.
Expert Commentary
The article presents a significant advancement in the field of LLM-based evaluations by introducing the SCOPE framework and BPE method. The rigorous approach to selective judging and the provision of finite-sample statistical guarantees address critical challenges in the calibration and bias of LLMs. The consistent performance across different benchmarks and judge scales underscores the practical utility of SCOPE. However, the complexity of implementation and the dependence on the quality of the underlying LLM judge are notable limitations. The study's findings have important implications for both practical applications and policy-making, emphasizing the need for reliable and bias-neutral AI systems. Future research could explore the generalizability of SCOPE to other evaluation contexts and its integration with other AI ethics and fairness frameworks.
Recommendations
- ✓ Further validation of SCOPE across a broader range of benchmarks and applications to assess its generalizability.
- ✓ Development of simplified implementations of SCOPE to enhance its accessibility for smaller organizations and individual researchers.
- ✓ Exploration of the integration of SCOPE with other AI ethics and fairness frameworks to provide a comprehensive approach to reliable and bias-neutral AI evaluations.