Academic

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

arXiv:2603.20212v1 Announce Type: new Abstract: Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly ava

J
Jiayun Wu, Peixu Hou, Shan Qu, Peng Zhang, Ning Gu, Tun Lu
· · 1 min read · 12 views

arXiv:2603.20212v1 Announce Type: new Abstract: Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.

Executive Summary

This article introduces Fast-Slow Thinking Reward Models (F/S-RM), a novel hybrid architecture that integrates scalar and generative reward models for Large Language Models (LLMs) via Reinforcement Learning from Human Feedback (RLHF). F/S-RM leverages Dual Process Theory, combining the computational efficiency of scalar models with the superior accuracy of generative models. The dual-confidence activation mechanism regulates the engagement of slow thinking, allowing for more efficient decision-making. The proposed approach achieves a 1.2% relative performance improvement and a 20.8% reduction in token consumption, making it a promising solution for aligning LLMs with human feedback.

Key Points

  • F/S-RM integrates scalar and generative reward models for efficient decision-making in LLMs.
  • The dual-confidence activation mechanism regulates the engagement of slow thinking in F/S-RM.
  • F/S-RM achieves superior performance and efficiency compared to existing models.

Merits

Strength in Efficiency

F/S-RM reduces computational costs while maintaining performance, making it a more practical solution for real-world applications.

Improved Performance

The hybrid architecture enables F/S-RM to achieve superior performance compared to state-of-the-art models, making it a more effective approach for aligning LLMs with human feedback.

Flexibility and Adaptability

The dual-confidence activation mechanism allows F/S-RM to adapt to different scenarios and tasks, making it a more versatile solution for various applications.

Demerits

Limited Domain-Specific Knowledge

The proposed approach may not be directly applicable to tasks or domains that require extensive domain-specific knowledge, which could limit its generalizability.

Potential Complexity

The dual-confidence activation mechanism and the integration of scalar and generative models may introduce additional complexity, which could make F/S-RM more challenging to implement and maintain.

Dependence on Human Feedback

F/S-RM relies on human feedback, which can be time-consuming and expensive to obtain, potentially limiting its deployment in real-world applications.

Expert Commentary

The proposed approach of F/S-RM is a significant contribution to the field of LLMs and RLHF. By integrating scalar and generative models, F/S-RM offers a more efficient and effective solution for aligning LLMs with human feedback. However, the proposed approach also raises several challenges and limitations, including the potential complexity of the dual-confidence activation mechanism and the dependence on human feedback. Further research is needed to address these challenges and to fully realize the potential of F/S-RM.

Recommendations

  • Future research should focus on developing more efficient and effective methods for obtaining and incorporating human feedback in the development and deployment of LLMs.
  • The development of F/S-RM should be accompanied by a thorough evaluation of its performance and efficiency in real-world applications, as well as its potential implications for policy and practice.

Sources

Original: arXiv - cs.CL