Academic

SteerRM: Debiasing Reward Models via Sparse Autoencoders

arXiv:2603.12795v1 Announce Type: new Abstract: Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format bias further suggest generalization across RM archite

M
Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye
· · 1 min read · 8 views

arXiv:2603.12795v1 Announce Type: new Abstract: Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format bias further suggest generalization across RM architectures and bias types. We further find that format-related features are concentrated in shallow layers and transfer across models, revealing shared architecture-level bias encoding patterns. These results show that SAE-based interventions can mitigate reward-model biases without retraining, providing a practical and interpretable solution for alignment pipelines.

Executive Summary

This article proposes SteerRM, a novel training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. By isolating stylistic effects, identifying bias-related SAE features, and suppressing them at inference time, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. The findings demonstrate generalization across RM architectures and bias types, as well as shared architecture-level bias encoding patterns. This breakthrough has significant implications for alignment pipelines, providing a practical and interpretable solution to mitigate reward-model biases without retraining. The study's robust results and insights into bias encoding patterns contribute meaningfully to the field of artificial intelligence and machine learning.

Key Points

  • SteerRM is a training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions.
  • The method isolates stylistic effects, identifies bias-related SAE features, and suppresses them at inference time.
  • SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance.

Merits

Interpretability and Practicality

SteerRM provides a clear and interpretable solution for debiasing reward models, making it a valuable contribution to the field. Its training-free approach also offers a practical advantage, as it does not require retraining or architectural modifications.

Generalizability Across RM Architectures

The study demonstrates that SteerRM can generalize across various RM architectures, including Gemma-based reward models and controlled non-format bias. This suggests that the method can be applied to a wide range of scenarios.

Shared Architecture-Level Bias Encoding Patterns

The findings reveal shared architecture-level bias encoding patterns, which can inform the development of more robust and bias-resistant reward models.

Demerits

Limited Scope and Generalizability

The study primarily focuses on format-related biases and may not generalize to other types of biases. Further research is needed to explore its applicability to other domains.

Methodological Assumptions and Limitations

The SAE-based interventions may be sensitive to specific methodological assumptions and limitations, which could affect the reliability and generalizability of the results.

Expert Commentary

The article presents a novel and significant contribution to the field of AI and machine learning. By proposing a training-free method for debiasing reward models, SteerRM offers a practical and interpretable solution for alignment pipelines. The study's findings have far-reaching implications for understanding and mitigating bias in AI systems. However, the limited scope and generalizability of the results, as well as the methodological assumptions and limitations, should be carefully considered. Future research should aim to explore the applicability of SteerRM to other domains and develop more robust and bias-resistant AI systems.

Recommendations

  • Future studies should investigate the applicability of SteerRM to other domains and types of biases.
  • Researchers should explore the development of more robust and bias-resistant AI systems, incorporating the insights from SteerRM and other related studies.

Sources