Academic

Causally Robust Reward Learning from Reason-Augmented Preference Feedback

arXiv:2603.04861v1 Announce Type: new Abstract: Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers pr

arXiv:2603.04861v1 Announce Type: new Abstract: Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language-model fine-tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj-hwang/ReCouPLe

Executive Summary

This article proposes ReCouPLe, a novel framework for preference-based reward learning that addresses the issue of causal confusion by incorporating natural language rationales. The approach uses rationales as guiding projection axes in an embedding space, allowing the model to score trajectories based on features aligned with the stated reason. This method enables the reuse of causal directions across tasks and transfers preference knowledge to novel tasks without additional data or fine-tuning. ReCouPLe demonstrates improved reward accuracy and downstream policy performance in novel tasks, outperforming baseline models by up to 2x. The authors release their code for open use, making it accessible for researchers and practitioners.

Key Points

  • ReCouPLe addresses the issue of causal confusion in preference-based reward learning
  • The framework uses natural language rationales to guide the learning process
  • ReCouPLe enables the reuse of causal directions across tasks and transfers preference knowledge to novel tasks

Merits

Strength in Addressing Causal Confusion

ReCouPLe effectively addresses the issue of causal confusion by incorporating natural language rationales, which provides a clear causal signal for the model to learn from.

Improved Performance

ReCouPLe demonstrates improved reward accuracy and downstream policy performance in novel tasks, outperforming baseline models by up to 2x.

Code Release

The authors release their code for open use, making it accessible for researchers and practitioners to build upon and adapt the framework.

Demerits

Limited Domain

The framework may be limited to domains where natural language rationales are readily available and can be effectively incorporated into the learning process.

Dependence on Rationale Quality

The quality of the rationales may significantly impact the effectiveness of ReCouPLe, and poor-quality rationales could lead to suboptimal performance.

Expert Commentary

The proposed framework of ReCuPLe addresses a significant limitation in preference-based reward learning and demonstrates promising results in improving reward accuracy and downstream policy performance. However, further research is needed to investigate the limitations and potential biases of the framework, particularly in domains where natural language rationales are not readily available or are of poor quality. The release of the code for open use is a significant contribution, making it accessible for researchers and practitioners to build upon and adapt the framework.

Recommendations

  • Future research should investigate the limitations and potential biases of ReCuPLe in various domains and scenarios.
  • The framework's potential applications in real-world settings should be explored, particularly in domains where preferences are not well-defined or are subject to change.

Sources