Academic

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

arXiv:2603.17145v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-de

arXiv:2603.17145v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.

Executive Summary

This article introduces REAL, a novel Regression-Aware Reinforcement Learning framework designed to optimize regression rewards, thereby addressing the limitations of existing regression-aware approaches. By employing a generalized policy gradient estimator, REAL naturally decomposes optimization into exploration and regression-aware prediction refinement, enabling the model to learn optimal reasoning paths and achieve superior generalization on out-of-domain benchmarks. Experimental results demonstrate REAL's efficacy across model scales, outperforming regression-aware SFT baselines and standard RL methods. The study highlights the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation, offering a significant step forward in the development of LLM-as-a-Judge paradigms.

Key Points

  • REAL is a principled RL framework designed to optimize regression rewards
  • REAL employs a generalized policy gradient estimator to address policy-dependent regression objectives
  • REAL achieves superior generalization on out-of-domain benchmarks, outperforming regression-aware SFT baselines and standard RL methods

Merits

Strengths in Addressing Limitations

The REAL framework effectively addresses the limitations of existing regression-aware approaches, particularly the reliance on binary rewards and the confinement to Supervised Fine-Tuning (SFT).

Improved Generalization

REAL's ability to learn optimal reasoning paths and refine regression-aware predictions leads to superior generalization on out-of-domain benchmarks.

Methodological Innovation

The use of a generalized policy gradient estimator in REAL represents a novel and effective approach to addressing policy-dependent regression objectives.

Demerits

Technical Complexity

The REAL framework's reliance on a generalized policy gradient estimator may introduce additional technical complexity, potentially limiting its adoption and implementation.

Model Requirements

The experimental results presented in the study may be contingent on specific model scales (8B to 32B), potentially limiting the framework's applicability to smaller or larger models.

Evaluation Metrics

The study's focus on Pearson and Spearman correlation metrics may not capture the full range of evaluation metrics relevant to LLM-as-a-Judge paradigms.

Expert Commentary

The REAL framework represents a significant step forward in the development of LLM-as-a-Judge paradigms, effectively addressing the limitations of existing regression-aware approaches. The study's findings demonstrate the critical value of integrating regression objectives into RL exploration, offering a novel and effective approach to improving the accuracy of LLM evaluation. However, the REAL framework's technical complexity and potential limitations in model requirements and evaluation metrics require careful consideration. As the field continues to evolve, the REAL framework's implications for practical and policy considerations will become increasingly relevant.

Recommendations

  • Future research should focus on expanding the REAL framework's applicability to smaller or larger models, as well as exploring alternative evaluation metrics.
  • The development of more effective LLM evaluation metrics and paradigms should be informed by the REAL framework's ability to optimize regression rewards.

Sources