Academic

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

arXiv:2603.12520v1 Announce Type: cross Abstract: Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal,

E
Eddie Landesberg
· · 1 min read · 15 views

arXiv:2603.12520v1 Announce Type: cross Abstract: Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.

Executive Summary

The article highlights the limitations of using large language models as judges to score candidate responses, particularly when the deployment task involves best-of-n selection. Despite moderate global correlation, the models often fail to capture the improvement that perfect selection would achieve, due to prompt-level baseline effects and coarse pointwise scoring. The study proposes a matched-pair best-of-2 audit to recover the lost signal and suggests reporting within-prompt signal, tie rates, and recovery/top-1 accuracy instead of global agreement alone.

Key Points

  • Global correlation is not a reliable metric for best-of-n selection tasks
  • Within-prompt correlation is a more relevant metric for judge-based selection
  • Coarse pointwise scoring can create ties in pairwise comparisons, leading to reduced accuracy

Merits

Improved Evaluation Metric

The proposed within-prompt signal and tie rates provide a more nuanced understanding of the model's performance in best-of-n selection tasks

Demerits

Limited Generalizability

The study's findings may not be generalizable to all large language models or deployment tasks, and further research is needed to confirm the results

Expert Commentary

The article provides a timely and important critique of the current evaluation metrics used for large language models. The authors' proposal to use within-prompt signal and tie rates as evaluation metrics has the potential to significantly improve the accuracy and reliability of model performance in best-of-n selection tasks. However, further research is needed to fully understand the implications of these findings and to develop more effective evaluation metrics for large language models.

Recommendations

  • Develop and evaluate more nuanced evaluation metrics for large language models in best-of-n selection tasks
  • Conduct further research to confirm the generalizability of the study's findings to other large language models and deployment tasks

Sources