Skip to main content
Academic

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

arXiv:2602.14069v1 Announce Type: new Abstract: Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and

arXiv:2602.14069v1 Announce Type: new Abstract: Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced -- and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.

Executive Summary

The article introduces the Open Rubric System (OpenRS), a novel framework designed to address the limitations of scalar reward models in reinforcement learning (RL) by employing a rubrics-based approach. OpenRS utilizes Pairwise Adaptive Meta-Rubrics (PAMR) and Pointwise Verifiable Rubrics (PVRs) to create a more transparent and adaptable reward system. The system aims to improve the alignment of RL models with human preferences by avoiding the information bottleneck inherent in scalar rewards, thereby reducing brittleness and reward hacking. The framework is designed to be plug-and-play and can be integrated into existing RL training pipelines.

Key Points

  • OpenRS addresses the limitations of scalar reward models by using a rubrics-based approach.
  • The system employs PAMR and PVRs to create a transparent and adaptable reward system.
  • OpenRS avoids the information bottleneck and improves discriminability in open-ended settings.
  • The framework includes a two-level meta-rubric refinement pipeline for consistency and editability.
  • OpenRS is instantiated as reward supervision in pairwise RL training.

Merits

Transparency and Adaptability

OpenRS provides a transparent and adaptable reward system by using explicit meta-rubrics and conditionally instantiated rubrics, which enhances the interpretability and flexibility of the reward mechanism.

Improved Discriminability

The pairwise comparison approach in OpenRS improves discriminability in open-ended settings, leading to more robust alignment with human preferences.

Scalability

The plug-and-play design of OpenRS makes it scalable and easily integrable into existing RL training pipelines, facilitating widespread adoption.

Demerits

Complexity

The complexity of the OpenRS framework, including the two-level meta-rubric refinement pipeline, may pose challenges in implementation and maintenance.

Human-in-the-Loop Requirements

The need for human-in-the-loop procedures for domain principle refinement may limit the scalability and efficiency of the system in certain applications.

Verification Overhead

The use of PVRs for verifiable rewards and guardrails may introduce additional computational overhead, potentially impacting the performance of the RL training process.

Expert Commentary

The Open Rubric System (OpenRS) represents a significant advancement in the field of reinforcement learning, addressing critical limitations of traditional scalar reward models. By employing a rubrics-based approach, OpenRS enhances transparency and adaptability, which are essential for robust alignment with human preferences. The use of PAMR and PVRs provides a novel solution to the information bottleneck problem, improving discriminability in open-ended settings. However, the complexity of the framework and the need for human-in-the-loop procedures may pose challenges in implementation. Despite these limitations, OpenRS offers a promising direction for future research in RL, particularly in areas where human-AI alignment is paramount. The framework's potential to enhance the reliability and interpretability of AI systems makes it a valuable contribution to the field.

Recommendations

  • Further research should focus on simplifying the implementation of OpenRS to reduce complexity and improve scalability.
  • Exploring automated methods for domain principle refinement could enhance the efficiency and reduce the human-in-the-loop requirements of the system.

Sources